A data lake1 is a pool of data from multiple sources. It is different from a data warehouse as it can store both structured and unstructured data, which can be processed and analyzed later. This eliminates a significant part of the overhead associated with traditional database architectures, which would commonly include lengthy ETL and data modeling when ingesting the data (to impose schema-on-write).
With ever-growing massive amounts of data collected and the need to leverage the data to build solutions, and strategies, organizations are facing a major challenge maintaining these massive pools of data and extracting valuable business insights from it. However, if the data in a data lake is not well-curated, it may flood it with random information which is difficult to manage and consume, leading to a data swamp. Therefore before going forward with a data lake it’s important to be aware of what are the best practices when designing, implementing, and operationalizing your data lake.
Let’s look at the best practices that help build an efficient data lake.
Data Ingestion
The data lake is allowing organizations to hold, manage, and exploit diverse data to their benefit. But here’s the reality, some data lakes fail to serve their purpose due to their complexity. This complexity may be induced by several factors, one of which is improper data ingestion. Building a sound data ingestion2 strategy is vital for succeeding with your enterprise data lakes.
Addressing Business Problem: It’s always better to question the need for a data lake before diving straight into it. If the business problem demands it only then one should opt for it. It is important to stay committed to a problem and find its answer and later if building a data lake is the right way to go, then great! A common misconception that people have is that they think data lakes and databases are the same. The basics of a data lake should be clear and should be rightly implemented for the right use cases. In general, data lakes are suitable for analyzing data from diverse sources, especially when the initial data cleansing is problematic. Data lakes also provide unlimited scalability and flexibility at a very reasonable cost. Let’s look at some use cases where businesses/industries use data lakes:
Healthcare- There is a lot of unstructured data in medical services (i.e. doctors’ notes, clinical information, etc.) and a constant need for real-time insights. Therefore, the use of data lakes turns out to be a better fit for companies in healthcare/insurance, as it gives access to both structured and unstructured data.
Transportation- Data lakes allow you to transform raw data into structured data that is ready for SQL analytics, data science, and machine learning with low latency. Raw data can be retained indefinitely at a low cost for future use in machine learning and analytics. In the transportation industry, the business insights derived from the data can help companies reduce their costs and increase their profits.
Schema Discovery Upon Ingest:It’s generally not a good idea to wait for the data to be actually in the lake to know what’s in the data.Having visibility into the schema and a general idea of what the data contains as it is being streamed into the lake will eliminate the need for ‘blind ETLing’ or reliance on partial samples for schema discovery later on.
Ensure Zero Data Loss:Ingestion can be in batch or streaming form. The data lake must ensure zero data loss and write the data exactly-once or at-least-once. Duplicate events or missed events can significantly hurt the reliability of the data stored in your lake, but exactly-once processing is notoriously difficult to implement since your storage layer is only eventually (not instantly) consistent. The data lake must also handle variability in a schema and ensure that data is written in the most optimized data format into the right partitions, and provide the ability to re-ingest data when needed.
Persist Data In The Raw State:It’s always good to persist data in its original state so that it can be repurposed whenever new business requirements emerge. Furthermore, raw data is great for exploration and discovery-oriented analytics (e.g., mining, clustering, and segmentation), which work well with large samples, detailed data, and data anomalies (outliers, nonstandard data).
Data Transformation
Data generation and data collection across semi-structured and unstructured formats is both bursty and continuous. Inspecting, exploring, and analyzing these datasets in their raw form is tedious because the analytical engines scan the entire data set across multiple files.Here are few ways to reduce data scanned and query overheads –
Columnar Data Formats For Read Analytics: Columnar storage makes the data easy and efficient to read, so it is better to store the data that will be used for analytics purposes in a format such as Apache Parquet3 or ORC4. In addition to being optimized for reads, these file formats have the advantage of being open-source rather than proprietary, which implies they can be read by a variety of analytics services.
Partition Data:Partitioning of the data helps reduce query costs and improves performance by limiting the number of scans the data query engines need to do in order to return the results for a specific query. Data is commonly partitioned by timestamp – which could mean by the hour, by a minute, or by a day – and the size of the partition should depend on the type of query intended torun.One can alsouse time, geo, lob to reduce data scans, tune partition granularity based on the data set under consideration (by hour vs. by second).
Chunk Up Small Files:Small files can be optimally chunked into bigger ones asynchronously to reduce network overheads.
Perform Stats-based Cost-based Optimization:Cost-based optimizer5 (CBO) and statistics can be used to generate efficient query execution plans that can improve performance.It also helps to understand the optimizer’s decisions, such as why the optimizer chooses a nested loop join instead of a hash join, and lets you understand the performance of a query. Dataset stats like file size, rows, histogram of values can be collected to optimize queries with join reordering. Column and table statistics are critical for estimating predicate selectivity and cost of the plan. Certain advanced rewrites require column statistics.
Use Z-order Indexed Materialized Views For Cost-based Optimization:A materialized view6 is like a query with a result that is materialized and stored in a table. When a user query is found compatible with the query associated with a materialized view, the user query can be rewritten in terms of the materialized view. This technique improves the execution of the user query because most of the query result has been precomputed. A z-order index serves queries with multiple columns in any combination and not just data sorted on a single column.
Data Governance
Don’t wait until after your data lake is built to think about data quality. Having a well-crafted data governance7 strategy in place from the start is a fundamental practice for any big data project, helping to ensure consistent, common processes and responsibilities.
Maintaining Data Catalogs: Data should be cataloged and identified, with sensitive data clearly labeled. Having a data catalog helps users discover and profile datasets for integrity by enriching metadata through different mechanisms, document datasets, and support a search interface.
Ensuring Correct Metadata For Search:It’s important for every bit of data to have information about it (metadata) in a data lake. The act of creating metadata is quite common among enterprises as a way to organize their data and prevent a data lake from turning into a data swamp. It acts as a tagging system to help people search for different kinds of data. In a scenario where there is no metadata, people accessing the data may run into a problematic scenario where they may not know how to search for information.
Set A Retention Policy:Data should not be stored forever in a data lake as it will incur the cost and may also result in compliance-related issues. Therefore, it is better to have appropriate retention policies for the incoming data.
Privacy/Security:A key component of a healthy Data Lake is privacy and security, including topics such as role-based access control, authentication, authorization, as well as encryption of data at rest and in motion. A data lake security plan needs to address the following five important challenges:
Data access control – The standard approach calls for using built-in Identity and Access Management (IAM) controls from the cloud vendor.
Data protection – Encryption of data at rest is a requirement of most information security standards.
Data leak prevention – Most major data leaks come from within the organization, sometimes inadvertently and sometimes intentionally. Fine-grained access control is critical to preventing data leaks. This means limiting access at the row, column, and even cell level, with anonymization to obfuscate data correctly.
Prevent accidental deletion of data – Data resiliency through automated replicas does not prevent an application (or developers/users) from corrupting data or accidentally deleting it. To prevent accidental deletion, It is recommended to first set the correct access policies for the Data Lake. This includes applying account and file-level access control using the security features provided by the cloud service. It is also recommended to routinely create copies of critical data in another data lake. This can be used to recover from data corruption or deletion incidents.
Data governance, privacy, and compliance – Every enterprise must deal with its users’ data responsibly to avoid the reputation damage of a major data breach. The system must be designed to quickly enable compliance with industry and data privacy regulations
Following the above best practices will help create and maintain a sustainable and healthy data lake. By devising the right strategy of collecting and storing data in the right way, one can reduce the cost of the storage, make data access efficient and cost-effective, and ensure data security.
References:
Data engineering1 is the aspect of data science that focuses on practical applications of data collection and analysis. It focuses on designing and building pipelines that transport and transform data into a highly usable format. These pipelines can take data from a wide range of sources and collect them into a data warehouse/ data lake that represents the data uniformly as a single source of truth. The ability to quickly build and deploy new data pipelines or to easily adapt existing ones to new requirements is an important factor for succeeding a company’s data strategy. The main challenge in building such a pipeline is to minimize latency and achieve a near real-time processing rate to process high-throughput data.
Building a highly scalable data pipeline provides significant value to any company doing data science. So, here are few important points to consider while building robust data pipelines:
Pick the Right Approach
The first and foremost thing is to choose appropriate tools and frameworks to build a data pipeline as it has a huge impact on the overall development process. There are two extreme routes and many variants one can choose between.
The first option is to select a data integration platform that offers graphical development environments and fully integrated workflows for building ETL pipelines. It seems to be very promising but often turns out to be the tough one as it lacks some significant features.
Another option would be to create a data pipeline using powerful frameworks like Apache Spark, Hadoop. While this approach implies a much higher effort upfront, it often turns out to be more beneficial since the complexity of the solution can grow with your requirements.
Apache Spark v/s Hadoop
Big Data Analytics with Hadoop and MapReduce was powerful, but often slow, and gave users a low-level procedural programming interface that required people to write a lot of code for even very simple data transformations. However, Spark has been found to be optimal over Hadoop, for several reasons:
Lazy evaluation in Apache Spark can overcome time complexity. The time gets saved as operations won’t get executed until it is triggered.
Spark has a DAG execution engine that facilitates in-memory computation and acyclic data flow resulting in high speed. Here, the data is being cached so that it does not fetch data from the disk every time thus the time is saved.
Spark was designed to be a Big Data tool from the very beginning. It provides out-of-the-box bindings for Scala, Java, Python, and R.
Scala – It is generally good to use Scala when doing data engineering (read, transform, store). For implementing new functionality not found in Spark, Scala is the best option as Apache Spark is written in Scala. Although Spark well supports UDFs in Python, there will be a performance penalty, and diving deeper is not possible. Implementing new connectors or file formats with Python will be very difficult, maybe even unachievable.
Python – In the case of Data Science, Python is a much better option with all those Python packages like Pandas, SciPy, SciKit Learn, Tensorflow, etc.
R – It is popular for research, plotting, and data analysis. Together with RStudio, it makes statistics, plotting, and data analytics applications. It is majorly used for building data models to be used for data analysis.
Java – It is the least preferred language because of its verbosity. Also, it does not support Read-Evaluate-Print-Loop (REPL) which is a major deal-breaker when choosing a programming language for big data processing.
Working With Data
However, as data starts increasing in volume and variety, the relational approach does not scale well enough for building Big Data applications and analytical systems. Following are some major challenges –
Managing different types and sources of data, which can be structured, semi-structured, or unstructured.
Building ETL pipelines to and from various data sources, which may lead to developing a lot of specific custom code, thereby increasing technical debt over time.
Having the capability to perform both traditional business intelligence (BI)-based analytics and advanced analytics (machine learning, statistical modeling, etc.), the latter of which is challenging to perform in relational systems.
The ability to read and write from different kinds of data sources is unarguably one of Spark’s greatest strengths. As a general computing engine, Spark can process data from various data management and storage systems, including HDFS, Hive, Cassandra, and Kafka. Apache Spark also supports a variety of data formats like CSV, JSON, Parquet, Text, JDBC, etc.
Writing Boilerplate Code
Boilerplate code refers to sections of code that have to be included in many places with little or no alteration.
RDDs – When working with big data, programming models like MapReduce are required for processing large data sets with a parallel, distributed algorithm on a cluster. But MapReduce code requires a significant amount of boilerplate.
This problem can be solved through Apache Spark’s Resilient Distributed Datasets2, the main abstraction for computations in Spark. Due to its simplified programming interface, it unifies computational styles which were spread out in otherwise traditional Hadoop stack. RDD abstracts us away from traditional map-reduce style programs, giving us interface of a collection(which is distributed), and hence a lot of operations that required quite a boilerplate in MapReduce are now just collection operations, e.g. groupBy, joins, count, distinct, max, min, etc.
CleanFrames – Nowadays, data is everywhere and drives companies and their operations. The data’s correctness and prominence reserves a special discipline, known as data cleansing, which is focused on removing or correcting course records. M/e involves a lot of boilerplate code. Therefore, Spark offers a small library CleanFrames3 to make data cleansing automated and enjoyable. Simply import the required code and call the clean method. The clean method expands code through implicit resolutions based on a case class’s elements. The Scala compiler applies a specific method to a corresponding element’s type. CleanFrames come with predefined implementations that are available via a simple library import.
Writing Simple And Clear Business Logic
Business logic is the most important part of an application and it is the place where most changes occur and the actual business value is generated. This code should be simple, clear, concise, and easy to adapt to changes and new feature requests.
Some features offered by Apache Spark for writing business logic are –
RDD abstraction with many common transformations like filtering, joining, and grouped aggregations are provided by the core libraries of Spark.
New transformations can be easily implemented with so-called user-defined functions (UDFs), where one only needs to provide a small snippet of code working on an individual record or column and Spark wraps it up such that it can be executed in parallel and distributed in a cluster of computers.
Using the internal developers API, it is even possible to go down a few layers and implement new functionalities. This might be a bit complex, but can be very beneficial for those rare cases which cannot be implemented using user-defined functions (UDFs).
To sum up, Spark is preferred because of its speed and the fact that it’s faster than most large-scale data processing frameworks. It supports multiple languages like Java, Scala, R, and Python and a plethora of libraries, functions, and collection operations that helps write clean, minimal, and maintainable code.
References:
The COVID-19 pandemic has significantly accelerated the pace at which companies are migrating their application workloads to the cloud. To keep up with the increasingly digitally-enabled world, companies of all sizes are evolving the way they work, to drive agile product delivery and innovation.
Cloud-native technologies are positioned to be the key aspect that helps companies to meet the challenges of the digital-focused marketplace. It comes as a perfect solution for IT organizations looking to deploy resilient and flexible applications that can be managed seamlessly from anywhere.
Growth of Cloud-Native
Companies have gradually realized the need to move important components of their business infrastructure to the cloud. The business challenges cropping up due to the restrictions put in place for the COVID-19 pandemic have considerably reinforced this need. While many companies have opted to migrate their legacy systems to easy-to-manage and cost-effective cloud platforms, many are choosing to develop cloud-native apps. Unlike the typical legacy applications that have to be migrated to the cloud, cloud-native ones are specially built for the cloud from day one. These apps tend to be deployed as a part of microservices and run in containers. They typically are managed with the usage of an agile methodology and DevOps.
There are several advantages a business can enjoy by going cloud-native, such as:
Faster deployment: To keep pace with the competition levels and meet the evolving needs of the customers, companies today have to keep innovating their apps frequently. In many scenarios, however, deploying new app features can be extremely cumbersome and complex, even for large companies having a skilled IT team. Cloud-native computing makes this process much simpler. Being based on microservices, this system facilitates the deployment of new apps and features at a rapid pace. It allows developers to deploy new and advanced features within a day, without having to deal with any dependencies.
Wider reach: Cloud-native technologies help companies to reach out to the global audience, expand their market reach, and increase their prospects. Low-latency solutions enjoyed by companies by going cloud-native allow them to seamlessly integrate their distributed applications.1 For modern video streaming and live streaming requirements, low latency is extremely important. This is where edge computing and Content Delivery Networks (CDNs) additionally shine, and brings data storage and computation closer to users.2
Leverage bleeding-edge technologies: Bleeding-edge services and APIs that were once available only to businesses having expansive resources are now made available to almost every cloud subscriber at nominal rates. These all-new cloud-native technologies are built to first cater to cloud-based systems and their discerning users.3
Quick MVP: Due to the elasticity of the cloud, anyone can do an MVP/POC and check their products across geographies.
Leaner Teams: Owing to the reduced operational overheads, cloud teams are always leaner when it comes to dealing with cloud-native technologies.
Easy scalability: Most legacy systems depended on plugs, switches, and various other types of tangible components and hardware. Cloud-native computing, on the other hand, is wholly based on virtualized software, where everything happens on the cloud. Hence, the application performance is not affected if the hardware capacity is scaled down or up, in this system. By going cloud-native, companies need not invest in expensive processors or storage for servers to meet their scalability challenges. They can just opt to reallocate the resources and scale up and down. All of it is done seamlessly without impacting the end-users of the app.
Better agility: The reusability and modularity aspects of cloud-native computing makes it perfect for firms that are practicing agile and DevOps methodologies, to allow for frequent releases of new features and apps. Such companies also have to follow continuous delivery and integration (CI/CD) processes to launch new features. Provided that cloud-native computing makes use of microservices for deployment, developers can swiftly write and deploy code. It subsequently streamlines the deployment process, making the whole system much more efficient.
Low-code development: With low-code development, developers can shift their focus from distinctive low-value tasks to high-value ones that are better aligned to their business requirements. It allows the developers to create a frontend of discerning apps quite fast and streamline workflow designs with the aim of accelerating the go-to-market time.
Saves cost: Cloud-native computing is based on virtualization and containerization of software. Hence, firms do not have to spend their resources on additional hardware or servers. They can easily deploy apps on any server while optimizing their usage.
Technologies that are a perfect fit for cloud-native:
K8s: The declarative, API-driven infrastructure of Kubernetes, tends to empower teams to independently operate while focusing on important business objectives. It helps development teams to enhance their productivity levels while reducing the complexities and time involved in app deployment. Kubernetes plays a key role in enabling companies to enjoy the best of cloud-native computing, and avail the prime benefits offered by this system.
Managed Kafka: With several Fortune 100 companies depending on Apache Kafka, this service has become quite entrenched and popular. As the cloud technology keeps expanding, a few chances are needed to make Apache Kafka truly cloud-native. Cloud-native infrastructures allow people to leverage the features of SaaS/ Serverless in their own self-managed infrastructure.4
ElasticSearch: Elasticsearch is a distributed search and analytics system that allows complex search capabilities across numerous types of data. For larger cloud-native applications that have complex search requirements, Elasticsearch is available as a managed service in Azure.5
ML/AI: The cloud enables firms to use managed ML to remove the burden on human resources. It reduces limitations in regards to data and ML, allowing all stakeholders to access the program and insights. AI takes the same approach on the cloud as machine learning but has a wider focus. Several firms today are able to deploy AI models and deep learning to the elastic and scalable environment of the cloud.6
All companies, no matter their core business, have to embrace digital innovations to stay competitive and ensure their continued growth. Cloud-native technologies help companies enjoy an edge over their market competition and manage applications at scale and with high velocity. Firms previously constrained to quarterly deployments to important apps are now able to deploy safely several times in a day.
References:
Transit Gateway is a highly available network gateway featured by Amazon Web Service. It eases the burden of managing connectivity between VPCs and from VPCs to On-premise data-center networks. This successfully allows organizations to build globally distributed networks and centralized network monitoring systems with minimal effort.
Earlier, the limitations with VPC Peering made it unable to create or connect VPN connections to On-premises networks directly. Also, to use transit VPC, a VPN Appliance had to be purchased from AWS Marketplace and connect all the VPCs to On-premise networks. This increased both the cost and maintenance.
Advantages of AWS Transit Gateway
Transit Gateway is highly available and scalable.
The best solution for hybrid cloud connectivity between On-premise and multiple cloud provider VPCs.
It provides better security and efficiency to control traffic to various route tables.
It helps to manage the AWS account routing globally.
Manage AWS and On-premise network using a centralized dashboard.
This helps to protect against distributed denial of service attacks and other common exploits.
Cost comparison between Transit Gateway vs VPC peering
VPC Peering
Transit Gateway
Cost per VPC connection
None
$0.05/hour
Cost per GB transferred
$0.02 (0.01 charged to sender VPC owner and 0.01 charged to receiver VPC owner)
$0.02
Overall monthly cost with 3 connected VPCs and 1 TB transferred
Connection charges – $0
Data Transfer Cost -$20
Total = $20/month
Connection charges- $108
Data Transfer Cost – $20
Total = $128/month
Transit gateway design best practices:
Use a smaller CIDR subnet and use a separate subnet for each transit gateway VPC attachment.
Based on the traffic, you can restrict NACLs rules.
Limit the number of transit gateway route tables.
Associate the same VPC route table with all of the subnets that are associated with the transit gateway.
Create one network ACL and associate it with all of the subnets that are associated with the transit gateway. Keep the network ACL open in both the inbound and outbound directions.
HashedIn, a cloud organization has a master billing account, logging, security, hosting networking infrastructure, a shared services account, three development, and one production level account for the architecture below. AWS Transit Gateway is the single point for all connectivity.
For each of the accounts, VPCs to Transit Gateway are connected via a Transit Gateway Attachment. Each account has a Transit Gateway Route Table, with an appropriate Gateway Attachment that sends traffic, and hence, subnet route tables can be used to connect from other networks. The Network account transit gateway is connected to the On-premise data center and other networks.
Here are the steps that are observed to configure multiple AWS accounts with AWS Transit Gateway:
Firstly, access the AWS Console
Up next, create the Transit Gateway
Lastly, create Transit Gateway Attachment
VPC
VPN
Peering Connection
The three available options while creating Transit Gateway Attachment:
Using VPC, an ENI is created in multiple availability zones. A TGW attachment is needed to be developed in all availability zones in the VPC so that TGW can communicate with the ENI attachment in the same availability zone.
Create Transit Gateway Route Table
Add the routing rule for the respective TransitGateway ID
Create an association and attach TransitGateway attachments
Create static routes
Transit Gateway Attachments are associated with the Transit Gateway Route Table. However, you can create multiple attachments related to a single route table. Propagation will dynamically populate the routes of one attachment to a route table of another attachment. Associations help to attach Transit Gateway Attachments.
In addition, the network manager helps in reducing the operational complexity of connecting remote locations and other cloud resources. It also acts as a centralized dashboard to monitor the end-to-end networking operational activity in our AWS account.
Conclusion
The Transit Gateway is a centralized gateway where we can manage AWS and On-premise networks on a single dashboard. It also helps simplify network architecture, which was earlier complicated in managing inter-VPC connectivity and Direct Connect.
Henry Mintzberg, Canadian author and academic, once said, “Corporations are social institutions. If they don’t serve society, they have no business existing.” To serve society implies to keep it safe and healthy. However, the dramatic collapse of major economies was inevitable. The UK is experiencing its first recession in 11 years, India might have a similar fate, with the GDP falling by a significant 23.9%. Whether it’s through telemedicine, remote monitoring, or workforce management, this blog will look at how businesses can get back on their feet. Technology has played a huge role in enabling individuals and organizations to rapidly adapt to our new COVID world. Many countries have developed tools to ensure public-health safety measures are being disseminated and implemented. Technology is enabling people to be more in control of their own health and less dependent on medical professionals. Healthtech will help in identifying the illness, treatment options, leading to increased access to medical information, and improved quality of life.
Telehealth
The pandemic instilled fears that one would have not previously thought of – What if seeking medical help would lead to more health complications? Although laden with skepticism at first, the overburdening of the healthcare system due to the Coronavirus led to the rise in telemedicine all over the world. Moreover, research shows that more than 80% of doctors expect telehealth at the same or greater level than they do now.1 In India, on the same day as the nation-wide lockdown began, the Telemedicine Practice Guidelines were issued, making it legal for registered medical practitioners to provide healthcare services using digital technologies, such as chat, audio, and video for a diagnosis. A recent report2 found that at least 50 million Indians opted for online healthcare during the pandemic, out of which 80% were first-time telemedicine users. Moreover, 44% of these users were from non-metro cities. According to a report by McKinsey Global Institute, India could save up to $10 billion by 2025, if telemedicine services could replace 30-40 percent of in-person consultations.3
Taking this opportunity, many healthtech organizations and experts voluntarily created an app for free teleconsultation. One such app is SWASTH – An effective telemedicine platform for quality COVID related healthcare access in India. It has over 2,000 registered doctors on its database, which can provide audio/video teleconsultation; other features include the listing of fever clinics, order medicines online, multiple languages, etc.
IoMT | Wearable Devices
Like telemedicine, another tech solution that can complement and enable well-being from a distance are connected wearable health devices and remote patient monitoring (RPM) technologies. Apart from the cases where hospital readmissions are critical, it was found that the medicare industry spends an estimated $17 billion a year on avoidable readmissions that can be mitigated by early detection, intervention, and better at-home care.4
Behind all of this is the Internet of Medical Things (IoMT), which will allow cloud-connected medical devices for improved patient compliance, detecting device failures before they become critical, and collecting data for more personalized therapy.5 As internet penetration increases around the world, and wireless capabilities and computing power improve, many IoMT devices are changing the healthcare landscape. With the ability to track and prevent chronic illnesses, connected devices will be able to collect, analyze, and transmit health data for both patients and healthcare professionals.
Workforce Management
COVID’s impact on businesses has been insurmountable and transitioning to a post-COVID-era will require relying heavily on digital technologies, for self-assessment, contact tracing, and syndromic surveillance. Initially, there was a false trade-off between the economy and the virus. However, it was never a debate – for the economies to regain their momentum, the Coronavirus has to be defeated. The world can’t come to a standstill; and although remote working is encouraged and has become the “new normal”, many organizations are eagerly waiting for things to settle down to resume work in a staggered manner.
As organizations resume their operations, ensuring the well-being of their employees is going to be their priority. Technology can play a role to effectively manage the workforce while complying with all the Government protocols. One such innovation from HashedIn Technologies, called Heed, is a platform that is allowing corporations to get back on their feet, despite the ongoing pandemic. Some of its unique features include, contact tracing, employee self-assessment, alerts, employee/visitor rostering and access, security, cloud syncing, etc.
Healthtech is here to stay
As Amina J. Mohammed, UN Deputy-Secretary-General said, there is no exit strategy until the vaccine is found.6 However, innovation in the healthtech space is enabling us to contain the virus and assuring us that we will be better equipped to adapt and fight off a health crisis at such a scale through technology. The accelerated adoption of digitization in the healthcare sector has helped us to cope with the repercussions of COVID. Healthtech has not only reduced the impact of the pandemic but has also increased access, bridging the urban and rural healthcare divide, while improving personalized patient care. Healthtech solutions are here to stay, long after the world has recovered from this pandemic.
References:
Progressive web application is a software built using common web technologies like HTML, CSS, and javascript, which intends to run on any platform that has a standard browser. It offers functionalities such as working offline, push notification, device hardware access, etc to create a native application like experience. They make a web application act and feel like an app by applying different technologies. Using progressive enhancement, new capabilities are enabled in modern browsers and this is achieved by employing web app manifest files and service workers. Using those technologies, a PWA can close the gap between a classic web application and a desktop, or native mobile application.
This article talks about how we can enable the PWA feature to make our app installable i.e. add an app to the home screen with caching files such as CSS, js, and data response for a Native app-like experience.
To enhance a classic web application with PWA features, two essential building blocks must be added to the application, a web app manifest file and a service worker. Let’s look into the purpose of these two elements.
Web App Manifest
The purpose of the web app manifest file is to provide information about the web application. The JSON structure of that file can contain information such as name, display, icons, and description. This information is used to install the web application to the home screen of a device that helps the user to access the application conveniently, with overall app-like experience.
Web application manifest file (https://www.w3.org/TR/appmanifest/), we have added a reference to the file in index.html file and the file itself with desired properties.
After addingWeb App manifest file, it is mandatory to add a service worker to our application as well.
Service Worker
A service worker is a JavaScript file that is added to the project and registered in the browser. The service worker script has the capability to run in the background and perform tasks like push notifications and background sync. To use this API we need to follow three steps :
Register a Service Worker.
Install a Service Worker.
Update a Service Worker
Typically, this can be achieved by writing javascript code and listening to events like download, install, and fetch. But to simplify our job, some steps are used to let it run with just a CLI command.
Steps To Add Service Worker in Angular App
To set up the Angular service worker in a project, use CLI command ng add @angular/pwa i.e.
running the above command completes the following:
1. Adds the @angular/service-worker package to your project.
2. Enables service workers to build support in CLI. 3. Imports and registers service workers in the app module.
4. Updates index.html file to include manifest.webmanifest file. 5. Creates the service worker config file “ngsw-config.json” to get the detailed description of properties
After the set-up is complete, we structure the “SW-master.js”, which is mentioned in the app.module.This file contains the script that will be generated while building the app in production mode.
This completes the framework of service workers, and our app is now satisfactory for PWA. The app caches all the files mentioned in “ngsw-config.json” under the asset group and data group. Wondering where it stores all the cached assets? Your origin (domain name) is given a certain amount of free space that is shared between all origin storage i.e. Local storage, IndexedDB, and Cache. The amount available to service workers isn’t specified, but rest assured that space is enough to make the site work offline.
But for a matter of fact, it doesn’t automatically update the app when a new feature is added. To address this issue, the service worker module provides a few services that can be implemented to interact with service workers and control the caching of the app.
The SW update service gives us access to the events when there is an available update for the app. This service can be utilized to constantly check for updates so that the app can track the latest changes and reload to apply the same.
This action can be done at various stages of an app. For simplicity, we did this in the “app.component.ts” file in the OnInit lifecycle hook.
The above piece of code creates an instance of SW update API and checks for any available update as soon as the app starts. If there is any update, it prompts the user about the same.
Summary
It can be concluded that by adding manifest files and enabling service workers with a few simple steps, we configured our app to behave like a native application. But this doesn’t eliminate the need for native applications, as some capabilities are still out of the web’s reach. The new and upcoming APIs are looking to change that and these capabilities are built with the web’s secure, user-centric permission model ensuring that going to a website is not a scary proposition for users.
To check the app in action, visithttps://tractorbazar.com/, and click on the ‘add app’ option. It can be noticed that the size of the app is added as a native app, and it is still light-weight.
Installed Progressive Web Apps run in a standalone window instead of a browser tab. They’re launchable from the user’s home screen, taskbar, or shelf. It’s possible to search for them on a device and jump between them with the app switcher, making them feel like part of the device they’re installed in.
Using the latest web features to bring native-like capabilities and reliability, Progressive Web Apps allows, what you build, to be installed by anyone, anywhere, on any device with a single codebase and lightweight.
Picture yourself in possession of a sample web application and HA (Multi Zone) Kubernetes cluster and in need of a high availability of the application using kubernetes cluster. Well, the first move that any K8s expert would do is to deploy applications into nodes, but there is a lack of assurance that each node has at least one pod running. This blog gives you a clear solution to the above scenario.
Kubernetes allows selective deployment of new pods using affinities. These can be good solutions to common HA problem statements like –
n number of pods each node
Ignore a certain node group for a group of Pods
Preferred regions, AZs or nodes to deploy auto-scale pods
Let’s discuss all these in detail. Before proceeding please find below some of the common terminologies that would be used in this blog.
podAffinity: can tell the scheduler to locate a new pod on the same node as other pods if the label selector on the new pod matches the label on the current pod.
podAntiAffinity: can prevent the scheduler from locating a new pod on the same node as pods with the same labels if the label selector on the new pod matches the label on the current pod.
weight: can be any value from 1 to 100. The weight number gives the matching node a relatively higher weight than other nodes.The more you want your preference to be fulfilled, set weight to a higher value.
topology: can be defined as node labels
requiredDuringSchedulingIgnoredDuringExecution (HARD): using this approach deployment will choose the node only if the rule is satisfied. As a result only one pod will be deployed to each node, the next pods will be in pending state.
preferredDuringSchedulingIgnoredDuringExecution (SOFT): using this approach deployment will first prefer the nodes which satisfies the rule and if none exists it will deploy to non preferred nodes. Along with weight in the deployment we can deploy pods which will get evenly distributed among nodes.
So, a rule of podAntiAffinity with SOFT scheduling will do the task here!
First, let’s have a look at the deployment of the below yaml file which uses podAntiAffinity with a replica count of 3.
This antiaffinity rule ensures that two pods with the key run equals nginx must not run in the same node. This deployment was deployed to a 3 master node HA cluster.
Result:
As you can see here, each pod is deployed to each k8’s node (master1, master2 and master3).
We have already seen what will happen with the deployment and now will see what happens while scaling the deployment?
Scale the replica to a higher count, let’s say 6!
kubectl scale deployment nginx –replicas=6
Result:
As you can see the next set of 3 pods also got distributed evenly over the nodes. Each node has 2 pods of nginx running.
Will the same work with HPA (Horizontal Pod Scaling) or not?
Configure the HPA and create a load generator pod which hits the application endpoint with multiple requests so that it can trigger the scaling.
Now as you can see the newly launched pods are successfully distributed among the nodes.
Result:
podAffinity is therefore the ultimately easy solution for a high availability of deployments. We hope this blog has helped you understand the importance of podAffinity.
RedMask is a command-line tool to mask sensitive data in a relational database. Currently, it supports Redshift and Postgres. This tool helps organizations to reduce the time for sharing data, building data warehouse solutions, ensuring security, and reducing compliance costs.
Why Data Masking?
Masking is an essential use case when huge data is managed. It is a critical step when dealing with personal or commercially sensitive and identifiable information. The data should be protected while being shared with other internal or external people or agencies for consumption. Here are a few examples, where masking would be necessary:
FinTech/BFSI organization – Sharing customer, product, and transaction details.
Healthcare industry – Sharing patient data, diagnosis, and health-related information.
E-commerce organization – Sharing the details of the consumer and product database.
Transportation and urban logistics companies – Sharing user and location data.
How RedMask Works?
Administrators can mask data using a variety of techniques. RedMask uses native database queries in order to mask data and manage permission; and thus, has a minimum impact on performance. RedMask supports both dynamic and static masking.
Static Masking:
In static masking, a new table is created with the selected columns masked. This increases storage costs. This technique is suitable for data sets that do not change often.
Dynamic Masking:
In dynamic masking, RedMask creates a view that masks the desired columns. The view has the same name and columns as the underlying table but is in a different schema. When a query is executed, the data warehouse picks either the table or the view, depending on the search path/default schema.
RedMask creates a masked view for the data consumers with lesser privileges based on the settings. The consumers will see the masked data instead of real data. In addition, RedMask supports a dryrun mode, wherein it just generates a SQL file with required queries, while making no changes to the Database. This allows the Database Administrators to verify the underlying queries.
What Makes RedMask Different?
RedMask addresses the challenges faced while using other masking tools in the market.
RedMask is a proxy-less application, therefore it requires minimal setup time.
RedMask uses underlying database queries, therefore it has a negligible impact on performance and keeps the data within the database.
It has an additional dry run mode for administrators who want to verify the script before it gets executed onto the database.
Since RedMask doesn’t interact directly with the underlying data, it needs almost zero setup infrastructure and has a negligible communication overhead.
Note that the masked view is created under the schema name after the username entered in the config file.
The user will only be able access to this view when he queries this particular table.
FAQs
Does it only support Redshift?
No, RedMask also supports Postgresql and Snowflake. The development of supporting other databases is in progress.
Does this involve any additional infrastructure cost?
The RedMask application itself has a negligible setup and operating overhead. Additional storage would be required to store mask data under static masking mode.
Will the new tables and views that are created, be secure?
Yes, the new tables and views are extremely secure. They allow access only to the needed user or roles that have been given at the time of masking. Additional permission can be assigned by the Database Administrator at their own discretion.
When React Hooks were released at the React Conference 2018, they took the React community all across the globe by storm. Finally, there was a refreshing and fundamental change in the way everyone thinks about and writes React code. One of the attractive (yet mysterious?) aspects of using hooks was the brevity they imparted to projects. As everyone was pumped up about giving these all-new hooks a spin, so was I. Sure, useState, useMemo and useContext look neat and are fairly simple to understand. But there, in plain sight is probably one of the most critical hooks – useEffect.
useEffect is supposed to be this all-in-one replacement for a bunch of old lifecycle methods like componentDidMount, componentDidUpdate to name a few, and hence getting the right understanding of useEffect hook and hooks, in general, is paramount to writing consistent, clean and bug-free data-driven React applications.
Here are some of the questions you might be having after employing the useEffect hook, which we’ll seek answers to
How to emulate componentDidMount with useEffect?
How to emulate componentDidUpdate with useEffect?
How to access previousProps as we did in componentDidUpdate with useEffect?
How to stop an infinite refetching loop?
Should functions be specified as dependencies to useEffect?
In case you are looking for answers to these questions real quick, check out the TLDR; section at the end.
Understanding Rendering
Functional components differ from traditional class components in a prominent way and that is the way they close around state and props. In more simplified terms, each ‘render’ owns a separate set of state and props. Now that might not be that simple to understand. Let’s go with a simple example –
Every time user clicks on the button, setCount updates the state which triggers a re-render with the new value. Initially, the count is 0, and as the first click happens on button state updates via setCount, calling the Counter function again with a new value of count which would be 1. For the second render, Counter sees its individual value for the count. So the count is constant for each render and Its value is isolated between different renders. This is not only true for state values but also for props, events handlers, functions, and pretty much everything. Here is another example to clarify this –
Okay, let’s say we type ‘Dan’ in the input and click greet button, then we enter ‘Sunil’ and click greet button before 5 seconds. What message will the greeting alert show? will it be ‘Hey Dan’ or ‘Hey Sunil’? It is the latter one right? No. Here’s the sandbox for you to give it a spin and see the difference between class components and functional components.
So why is it so? We’ve discussed this above. Each render owns the state for functional components which is not true for class components. Next time someone asks you the difference between class and functional components, do remember this.
Although, you should totally follow Dan and Sunil on Twitter!
Each render owns its State, Props, Effects …Everything.
As we saw above, each render owns its state, props, and pretty much everything including effects.
In this example above, as the user types a name in the input field, setName gets triggered and sets the new name in the state, leading to a re-render. How does the effect know about the latest value of a name? Is it some sort of data-binding? Nope. We learned earlier that count is different and isolated for each render. Functions, event handlers, and even effects can ‘see’ values from the render they belong to. So the correct mental model to paint is – It is not only a changing name inside an unchanging effect. It is the effect that also is different for each render.
Let’s set this straight. Every function inside a component render sees the state, props particular to that render only.
So, the eventual question in your mind would be – How can I access state and prop from next or previous renders, just like class components. Well, it would be going against the flow, but it is definitely possible using refs. Be aware that when you want to read the future props or state from a function in a past render, you’re intentionally going against the flow. It’s not wrong (and in some cases necessary) but it might look less “clean” to break out and that’s intended to highlight which code is fragile and depends on timing. In classes, it’s less obvious when this happens, which can lead to strange bugs in some cases.
function Greeting() { const [name, setName] = useState("") const latestName = useRef() useEffect(() => { // Set the mutable latest value latestCount.current = name setTimeout(() => { // Read the mutable latest value console.log(`You entered ${latestName}`) }, 5000) }) return ( <div> <p>Name is {name}</p> <input type="text" onChange={event => setName(event.target.value)} /> </div> )}
This code will always log the latest value of name, behaving like classes. So it might feel a bit strange but React mutates this.state the same way in classes.
What about the ‘cleanup’ Function?
Going through the docs, you will find mentions of a certain cleanup function. The cleanup function does exactly as it says – cleans up an effect. Where will it be required you might ask, well any cleanup function makes sense in cases where you have to unsubscribe from subscriptions or say remove Event Listeners.
Many of us have a slightly wrong mental model regarding the working of cleanup functions. borrowing the example from React docs;
useEffect(() => { ChatAPI.subscribeToFriendStatus(props.id, handleStatusChange) return () => { ChatAPI.unsubscribeFromFriendStatus(props.id, handleStatusChange) }})if props.id is 1 on first render and 2 on second. How will this effect work ?subscribe with id = 1unsubscribe with id = 1thensubscribe with id = 2unsubscribe with id = 2
That is the synchronized model, we derive from our understanding of classes. Well, it doesn’t quite stand true here. As you are assuming that cleanup ‘sees’ the old props because it runs before the next render and that’s now how it happens. We have to understand that React runs an effect only after the browser has painted, making apps feel faster as effects no longer block the painting, so why should cleanup function block the next render? Thankfully it doesn’t. The previous effect is cleaned up after the re-render with new props
You must be a little uncomfortable with this. How can cleanup function still see ‘old’ props once the next render has happened? Well, reiterating our learnings from above –
EVERY FUNCTION INSIDE A COMPONENT, INCLUDING EFFECTS AND TIMEOUTS, CAPTURES THE STATE AND PROPS OF THE RENDER.
So no ‘new’ props and state are available to the cleanup function but the ones available to other functions and effects in that render and hence cleanups don’t require to be run right before next render, they can run after the next render as well.
Telling React When to Run an Effect
This is pretty much on the lines of how React works, it learned the lesson that instead of changing the whole DOM on every change, only update the parts that actually need updating. There are times when running an effect is not necessary, so how do you instruct React on whether to run this effect or not?
In this example, when you click the increment button, the state updates, and the effect fires, unnecessarily even though the name it uses hasn’t changed. So probably React should get a diff between old effect and new effect to see if anything has changed? No, that can’t happen as well because React can’t magically tell the difference in two functions without executing them, which defeats the purpose, right?
To get out of this problem, useEffect hook has a provision for providing a second argument – An array of dependencies. An array of dependencies or ‘deps’ signify what are the dependencies for that effect. If the dependencies are the same between renders, React understands that the effect can be skipped.
Being Honest About Effect Dependencies
Dependency array or ‘deps’ are critical to bug-free usage of useEffect. Giving incomplete or wrong dependencies can certainly lead to bugs. The statement to internalize here is –
ALL VALUES FROM INSIDE YOUR COMPONENT, USED BY THE EFFECT MUST BE SPECIFIED IN THE DEPENDENCY ARRAY.
This is of great relevance and we as developers often lie to React about dependencies of an effect. Often unintentionally, owing to our reliance on the class lifecycle model where we still try to emulate ComponentDidMount or ComponentDidUpdate. We need to unlearn that paradigm or just switch it off before using hooks, because as we have seen so far – Hooks are quite different from classes and carrying that baggage of class model onto hooks can lead to inaccurate mental models and formation of wrong concepts which can be hard to shake.
Being honest about dependencies is quite simple if you follow these two conventions –
Include all the values inside the component that are used inside the effect.
OR
Change the effect code, so that you don’t need to specify dependency anymore.
The mental model of hooks generated by our usage of classes sways us against considering functions as part of data flow and hence a dependency on an effect, which is not true. In reality, Classes obstruct us from considering functions as part of the data flow and that can be understood better by this example –
class Parent extends Component { state = { query: "react", } fetchData = () => { const url = "https://hn.algolia.com/api/v1/search?query=" + this.state.query // ... Fetch data and do something ... } render() { return <Child fetchData={this.fetchData} /> }}class Child extends Component { state = { data: null, } componentDidMount() { this.props.fetchData() } render() { // ... }}
now if we want to refetch in the child based on changes in query, we cannot just compare fetchData between renders because it is a class property and will always be the same, which will land us in infinite refetching condition. So we have to take another step unnecessarily – passing query as a prop to Child and comparing the value of query in componentDidUpdate between renders.
class Child extends Component { state = { data: null, } componentDidMount() { this.props.fetchData() } componentDidUpdate(prevProps) { // This condition will never be true as fetchData is a class property and will always remain same if (this.props.fetchData !== prevProps.fetchData) { this.props.fetchData() } }}
Here you can clearly see how classes break out and do not allow us to have functioned as part of data flow but hooks do because with hooks you can specify functions as a dependency hence making them part of the data flow.
✅ Though there is one minor catch here, which you must understand. If we hadn’t wrapped fetchData in a useCallback with a query as a dependency, fetchData would change on every render, which when supplied as a dependency to useEffect in child component would needlessly trigger the effect over and over again. Not ideal, huh.
useCallback allows functions to fully participate in the data flow. Whenever function inputs or dependencies change, the output function changes, else it remains the same.
More importantly, passing a lot of callbacks wrapped in useCallback isn’t the best of choices and can be avoided. As React docs say –
We recommend to pass dispatch down in context rather than individual callbacks in props. The approach below is only mentioned here for completeness and as an escape hatch. Also, note that this pattern might cause problems in the concurrent mode. We plan to provide more ergonomic alternatives in the future, but the safest solution right now is to always invalidate the callback if some value it depends on changes.
TLDR
As discussed in the introduction, here are my answers from the understanding of useEffect hook.
How to emulate componentDidMount with useEffect?
It’s not an exact equivalent of componentDidMount because of the differences in class and hook model but useEffect(fn, []) will cut it. As told over and over again, useEffect captures the state and props of the render it is in, so you might have to put in some extra efforts to get the latest value, i.e useRef. Although, try to come to terms with thinking in terms of effects rather than mapping concepts from class lifecycles to hooks.
How to emulate componentDidUpdate with useEffect?
Again not exactly emulating componentDidUpdate and I do not encourage thinking of hooks with the same philosophy as of class lifecycle methods but useEffect(fn, [deps]) is a replacement. You do need to specify correct dependencies in the array, omitting or adding dependencies might result in bugs. Though to access previous or latest props, employ useRef.
How to access previousProps as we did in componentDidUpdate with useEffect?
To access previous or latest props is to break out of paradigm and is a little more effort, employ useRef to keep track of previous or latest values for state and props. usePrevious described here is a widely used custom hook for the job.
How to stop an infinite refetching loop?
Data fetching in useEffect is a routine use case and while useEffect is not exactly made for the job. (Waiting for Suspense to become production-ready 💓) – Meanwhile, this guide covers how to fetch data properly using hooks. You enter infinite refetching land usually when you don’t specify second argument (deps) for the effect which leads to the effect being triggered over and over. Give a proper set of dependencies to useEffect to avoid infinite refetching problem.
Should functions be specified as dependencies to useEffect?
Yes, absolutely. If your functions use state/prop do wrap them in a useCallback as we saw above. Functions relying on state/props should be part of data flow using hooks, otherwise, you can just hoist them outside your component.
Closing Notes
This post is an attempt to make useEffect more understandable and to document my understanding of useEffect. Most of the examples and inspiration come from the exhaustive and Complete guide on useEffect by Dan Abramov, which you should definitely go through. I thank him for teaching me and to you, for taking the time to read this post.
Redux as a global state management solution has been around for a while now and has become almost the go-to way of managing state in React applications. Though Redux works with JS apps, the focus of this post will be React apps.
I’ve seen people add Redux to their dependencies without even thinking about it, and I’ve been one of those people. A small disclaimer: This is no rant about Redux being good or bad, of course, Redux is great that’s why so many people have been using it for years now. The aim of this blog is instead to share my opinions on the use of Redux and where it makes sense and more importantly where it doesn’t.
React’s Context API has been around for some time now and it is good, useReducer is great as well, but that doesn’t make Redux obsolete. Redux still makes sense to me. Let’s not make size be the parameter for using (or not using) Redux. It is not about the size of the application, it is about the state. That is also where the pain begins. In large applications with multiple developers working on the same codebase, it is easy to abuse Redux. You are just a bad choice away and then everyone starts pushing anything and everything into the Redux store. I’ve been one of those people, again.
“Hey! What do you mean by abusing Redux? It is meant to be a global data store right?”
Yes, it is meant to be a global data store but the term global data store is often translated as a state to hold every state, value, data, and signal. That is wrong, and it is a slippery slope, it goes from 0 to 100 quite fast. Soon enough you will find yourself working on an application with an absolutely messed up global state. Where when a new guy onboards, he doesn’t know which reducer to use data from because there are multiple copies or derived states.
People often get used to the fact that they’ve to make changes in 3 files whenever something changes – Why? That’s a pain, we’ve got accustomed to and as the size of application or scope increases this only gets worse. Every change becomes incrementally difficult because you don’t want to break existing things and you end up abusing Redux further.
When that stage comes, we often like to blame React and Redux for the meaty boilerplate it asks of you to write.
“It’s not my problem or my code’s problem, that’s just how Redux is…”
Yes and No. Redux definitely asks you to write some boilerplate code, but that’s not a bad deal until you overuse it. As often said, nothing good comes easy, and the same applies to Redux. You have to follow a certain way that involves:
– Keeping the application state as plain objects. (Store)
– Maintaining changes in the system in the form of plain objects. (Actions)
– Describe logic to handle state changes in terms of objects. (Reducers)
In my opinion, that’s not a very easy pattern to follow and introduce in your applications. The steepness of this curve should deter the abuse of Redux and should make you think before opting for Redux.
React gives you a local state. Let’s not forget it and use it as much as possible before looking for ‘global’ solutions, because if you pay close attention, most data in your Redux store is actually just used by one or two components. ‘Global’ means much more than one or two components, right?
Redux for state
Right, that’s what Redux is for – maintaining your global state, but pay close attention and you will find values that are required by just a single component but you thought that someone might need them in future in some other component. Therefore, why not put in the effort to put this local data in Redux. That’s where we are often wrong. Because the chance of a future guy requiring the same data in some other component is really low and even if that happens, chances of duplication of data and derived Redux states are good. Over time, this practice of putting values unnecessarily in the Redux store becomes perfunctory. You eventually land in a big stinking mess of reducers and states where nobody wants to touch anything. They would rather create a new reducer, in the fear of causing regressions in, god knows which, component. I know proper code reviews and processes in place will not let the situation get that dire, that fast but definitely, the morning will come, when the realization strikes and all you are left with is an insurmountable tech debt of fixing state management in an existing code base with minimum regressions.
So heed my words – whenever you are thinking of going the Redux way with a state, give a good thought to it. Is that data really of ‘global’ use, are there other (non-immediate child) components which require that data? If the answer is yes, Redux-land is the home for that data.
Redux for data fetching
Redux mostly gets looped into the data fetching scene in the React world but, why? Why do you have to write an action reducer every time you want to make an API call? In huge applications, it might make sense, but for medium to small projects, it is an overkill. Redux is meant to store values/data which might be used by multiple components in your application, possibly on different component trees. Data fetching from an API to fill up a component isn’t really something that fits that definition. Why should data fetched from an API to be used by a single component, go through the store to the component in question?
Now, the first point against this line of thought would be:
We don’t want to pollute our components with data fetching logic…
I am all for clean and readable code, and readability principles demand that one should be able to comprehend what is happening in your code, as easily as possible (with an opening as little files as possible). Now keeping that in mind, I don’t think to have your data fetching logic specific to a component inside that component is polluting it. It is actually making your code readable and understandable as probably the most critical part of your component. The data it fetches and how it fetches it is quite conspicuous to the reader, isn’t it?
Moreover. I am not asking you to put your fetch calls inside your components just as they are, there are a lot of great abstractions out there, which are seemingly easy to use and do not take away the brevity and readability of your code.
Since Redux is a global store, we don’t have to fetch the data again…
Some of you might have this as an argument. Well, most of us make API calls to fill up our components whenever the component “mounts” and that data comes via Redux, right? So until you have proper data validation mechanisms to know that your Redux store needs to be repopulated with fresh data from an API call, you are going to make that API call on every “mount”, so how are we saving anything there? And if what you really want is ‘caching’ so that you can save on your API calls, why not go with solutions built for that, instead of trying to mold Redux into your cache?
There are a lot of brilliant libraries out there to help you tidy up your data fetching. SWR by the good folks at Zeit is one amazing utility to get started with. If you want to take it a notch up, you can consider going with react-query by Tanner Linsley both of these are mostly based around render time data fetching and provide great ways to optimize your data fetching operations.
Sometimes you might need event-based data fetching – fetching data when some particular ’event’ happens. For such cases, I authored a utility called react-str, which I’ve been using a lot by now and it makes things quite nice and concise.
I’ve been walking this path of avoiding Redux for data fetching for some time now, and I am quite happy with the results. Most performant and most maintainable lines of code are the ones not written, and avoiding Redux in your data fetching can save you a lot of time, code with almost no harm.
Redux for signaling
At times in our applications, we need to do something based on the occurrence of some event somewhere else in your code. For example, I want to change the text inside a header component whenever the user scrolls beyond a certain limit in another component. Usually, how we solve this is by maintaining a Redux state which is a boolean keeping track if the user scrolled to that limit or not. Then, we watch for that state value in our header, when it changes we change our header text. Sounds familiar? Or suppose you have a general confirmation modal which you might want to show at multiple places in your application, you can either call it at these many places or just put it on the parent component, toggling its visibility somehow. Again you’ll be maintaining a set of actions and a reducer to maintain modal visibility state, right? These cases might not be that frequent but these are not rare as well.
To reiterate, signaling flags or values like the one above are rarely ‘global’ and are required to be received by one or a few components at most. So why do we put these values in our so-called global store?
Maintaining an action, reducer and a single boolean value in Redux for signaling the occurrence of an event to another component seems overkill to me. So I wrote an event emitter. Which lets you fire events and handle them anywhere in your code, without indulging Redux.
Redux for future proof applications
In most cases, teams opt for Redux because some senior members on the team think they might need it in the future. And this often becomes a very serious mistake later on, because it is not easy to rollback on such decisions. You and possibly your team might be taking the future and especially Redux, as part of the React world, too seriously. As Dan Abramov. co-creator of Redux. has said:
It’s just one of the tools in your toolbox, an experiment gone wild.
Redux is just another library, albeit a great one. Use it when you identify the need for it, not just because it is such a major name in the React ecosystem.
And, if you are influenced and love the pattern Redux enforces, you can enforce it without Redux also, the pattern is called flux and you can read more about it here.
I think it is the right time to share some wisdom: Nobody knows the future, not even the seniors – all the time. The future depends on the present and the decisions we make today reflect tomorrow. It is okay to ask why, before adding anything to your bundle. Remember, as Front-end engineers, the user is King and every avoidable package you use, add to that bundle size, making it heavier and ultimately slower to load.
Maybe I am exaggerating things a bit, but this is not just about Redux. Every package you add on to your project, think, think, think! Is it really needed? And, how can you minimize the costs on bundle size? A few tens of kilobytes might not seem a lot, but on a mobile network in some far away, developing land, that might be the deciding factor if a user is able to use your application or not. That’s all that matters, right?
Closing Notes
So, these are my thoughts on the current state (pun definitely intended) of Redux usage across React apps. The problems shared here, are the ones I have observed first hand on the projects that I’ve worked on. Also, the solutions shared are the ones we used/developed to get out of these problems. Are they perfect? No, nothing is, but they definitely seem better at the moment.
I hope this post gave some points worth your time on why you probably don’t need Redux. If you are going to start a new project soon, this might help in deciding if you really need it, or if you have an existing project with Redux, you can make your store quite lean and free your codebase of much boilerplate by employing these practices.
If the title seems familiar, it is derived (more unsolicited puns) from – You probably don’t need derived state – this excellent official React blog post by Brian Vaughn, if you haven’t read it, do check it out!
Through this post, I would like to thank Rishabh Gupta for the ideas and learnings shared. This would not have been complete without his guidance and help.