A data lake1 is a pool of data from multiple sources. It is different from a data warehouse as it can store both structured and unstructured data, which can be processed and analyzed later. This eliminates a significant part of the overhead associated with traditional database architectures, which would commonly include lengthy ETL and data modeling when ingesting the data (to impose schema-on-write).

 

With ever-growing massive amounts of data collected and the need to leverage the data to build solutions, and strategies, organizations are facing a major challenge maintaining these massive pools of data and extracting valuable business insights from it. However, if the data in a data lake is not well-curated, it may flood it with random information which is difficult to manage and consume, leading to a data swamp. Therefore before going forward with a data lake it’s important to be aware of what are the best practices when designing, implementing, and operationalizing your data lake.
 
Let’s look at the best practices that help build an efficient data lake.

 

 

Following the above best practices will help create and maintain a sustainable and healthy data lake. By devising the right strategy of collecting and storing data in the right way, one can reduce the cost of the storage, make data access efficient and cost-effective, and ensure data security.

 

References:

Data engineering1 is the aspect of data science that focuses on practical applications of data collection and analysis. It focuses on designing and building pipelines that transport and transform data into a highly usable format. These pipelines can take data from a wide range of sources and collect them into a data warehouse/ data lake that represents the data uniformly as a single source of truth. The ability to quickly build and deploy new data pipelines or to easily adapt existing ones to new requirements is an important factor for succeeding a company’s data strategy. The main challenge in building such a pipeline is to minimize latency and achieve a near real-time processing rate to process high-throughput data.
 
Building a highly scalable data pipeline provides significant value to any company doing data science. So, here are few important points to consider while building robust data pipelines:
 

  1. Pick the Right Approach
  2. The first and foremost thing is to choose appropriate tools and frameworks to build a data pipeline as it has a huge impact on the overall development process. There are two extreme routes and many variants one can choose between. 
     

    • The first option is to select a data integration platform that offers graphical development environments and fully integrated workflows for building ETL pipelines. It seems to be very promising but often turns out to be the tough one as it lacks some significant features.
    • Another option would be to create a data pipeline using powerful frameworks like Apache Spark, Hadoop. While this approach implies a much higher effort upfront, it often turns out to be more beneficial since the complexity of the solution can grow with your requirements.

     

    Apache Spark v/s Hadoop

    Big Data Analytics with Hadoop and MapReduce was powerful, but often slow, and gave users a low-level procedural programming interface that required people to write a lot of code for even very simple data transformations. However, Spark has been found to be optimal over Hadoop, for several reasons:  
     

    • Lazy evaluation in Apache Spark can overcome time complexity. The time gets saved as operations won’t get executed until it is triggered.
    • Spark has a DAG execution engine that facilitates in-memory computation and acyclic data flow resulting in high speed. Here, the data is being cached so that it does not fetch data from the disk every time thus the time is saved. 

     
     
     

    Spark was designed to be a Big Data tool from the very beginning. It provides  out-of-the-box bindings for Scala, Java, Python, and R.
     

    • Scala – It is generally good to use Scala when doing data engineering (read, transform, store). For implementing new functionality not found in Spark, Scala is the best option as Apache Spark is written in Scala. Although Spark well supports UDFs in Python, there will be a performance penalty, and diving deeper is not possible. Implementing new connectors or file formats with Python will be very difficult, maybe even unachievable.
    • Python – In the case of Data Science, Python is a much better option with all those Python packages like Pandas, SciPy, SciKit Learn, Tensorflow, etc.
    • R – It is popular for research, plotting, and data analysis. Together with RStudio, it makes statistics, plotting, and data analytics applications. It is majorly used for building data models to be used for data analysis.
    • Java – It is the least preferred language because of its verbosity. Also, it does not support Read-Evaluate-Print-Loop (REPL) which is a major deal-breaker when choosing a programming language for big data processing.

     

  3. Working With Data
  4. However, as data starts increasing in volume and variety, the relational approach does not scale well enough for building Big Data applications and analytical systems. Following are some major challenges –
     

    • Managing different types and sources of data, which can be structured, semi-structured, or unstructured.
    • Building ETL pipelines to and from various data sources, which may lead to developing a lot of specific custom code, thereby increasing technical debt over time.
    • Having the capability to perform both traditional business intelligence (BI)-based analytics and advanced analytics (machine learning, statistical modeling, etc.), the latter of which is challenging to perform in relational systems.

     

    The ability to read and write from different kinds of data sources is unarguably one of Spark’s greatest strengths. As a general computing engine, Spark can process data from various data management and storage systems, including HDFS, Hive, Cassandra, and Kafka. Apache Spark also supports a variety of data formats like CSV, JSON, Parquet, Text, JDBC, etc.
     

  5. Writing Boilerplate Code
  6. Boilerplate code refers to sections of code that have to be included in many places with little or no alteration. 
     

    • RDDs – When working with big data, programming models like MapReduce are required for processing large data sets with a parallel, distributed algorithm on a cluster.  But MapReduce code requires a significant amount of boilerplate. 
       
      This problem can be solved through Apache Spark’s Resilient Distributed Datasets2, the main abstraction for computations in Spark. Due to its simplified programming interface, it unifies computational styles which were spread out in otherwise traditional Hadoop stack. RDD abstracts us away from traditional map-reduce style programs, giving us interface of a collection(which is distributed), and hence a lot of operations that required quite a boilerplate in MapReduce are now just collection operations, e.g. groupBy, joins, count, distinct, max, min, etc. 
    •  

    • CleanFrames – Nowadays, data is everywhere and drives companies and their operations. The data’s correctness and prominence reserves a special discipline, known as data cleansing, which is focused on removing or correcting course records. M/e involves a lot of boilerplate code. Therefore, Spark offers a small library CleanFrames3 to make data cleansing automated and enjoyable. Simply import the required code and call the clean method. The clean method expands code through implicit resolutions based on a case class’s elements. The Scala compiler applies a specific method to a corresponding element’s type. CleanFrames come with predefined implementations that are available via a simple library import.

     

  7. Writing Simple And Clear Business Logic
  8. Business logic is the most important part of an application and it is the place where most changes occur and the actual business value is generated. This code should be simple, clear, concise, and easy to adapt to changes and new feature requests.
     
    Some features offered by Apache Spark for writing business logic are – 
     

    • RDD abstraction with many common transformations like filtering, joining, and grouped aggregations are provided by the core libraries of Spark. 
    • New transformations can be easily implemented with so-called user-defined functions (UDFs), where one only needs to provide a small snippet of code working on an individual record or column and Spark wraps it up such that it can be executed in parallel and distributed in a cluster of computers.
    • Using the internal developers API, it is even possible to go down a few layers and implement new functionalities. This might be a bit complex, but can be very beneficial for those rare cases which cannot be implemented using user-defined functions (UDFs).
    •  

    To sum up, Spark is preferred because of its speed and the fact that it’s faster than most large-scale data processing frameworks. It supports multiple languages like Java, Scala, R, and Python and a plethora of libraries, functions, and collection operations that helps write clean, minimal, and maintainable code.

     

    References:

The COVID-19 pandemic has significantly accelerated the pace at which companies are migrating their application workloads to the cloud. To keep up with the increasingly digitally-enabled world, companies of all sizes are evolving the way they work, to drive agile product delivery and innovation.

 

Cloud-native technologies are positioned to be the key aspect that helps companies to meet the challenges of the digital-focused marketplace. It comes as a perfect solution for IT organizations looking to deploy resilient and flexible applications that can be managed seamlessly from anywhere.

 

Growth of Cloud-Native     

 

Companies have gradually realized the need to move important components of their business infrastructure to the cloud. The business challenges cropping up due to the restrictions put in place for the COVID-19 pandemic have considerably reinforced this need. While many companies have opted to migrate their legacy systems to easy-to-manage and cost-effective cloud platforms, many are choosing to develop cloud-native apps. Unlike the typical legacy applications that have to be migrated to the cloud, cloud-native ones are specially built for the cloud from day one. These apps tend to be deployed as a part of microservices and run in containers. They typically are managed with the usage of an agile methodology and DevOps.

 

There are several advantages a business can enjoy by going cloud-native, such as:

 

 

Technologies that are a perfect fit for cloud-native:

 

 

All companies, no matter their core business, have to embrace digital innovations to stay competitive and ensure their continued growth. Cloud-native technologies help companies enjoy an edge over their market competition and manage applications at scale and with high velocity. Firms previously constrained to quarterly deployments to important apps are now able to deploy safely several times in a day.

 

References:

Transit Gateway is a highly available network gateway featured by Amazon Web Service.  It eases the burden of managing connectivity between VPCs and from VPCs to On-premise data-center networks. This successfully allows organizations to build globally distributed networks and centralized network monitoring systems with minimal effort. 

 

Earlier, the limitations with VPC Peering made it unable to create or connect VPN connections to On-premises networks directly.  Also, to use transit VPC, a VPN Appliance had to be purchased from AWS Marketplace and connect all the VPCs to On-premise networks. This increased both the cost and maintenance.
 


 

Advantages of AWS Transit Gateway

 
Cost comparison between Transit Gateway vs VPC peering
 

VPC Peering Transit Gateway
Cost per VPC connection None $0.05/hour
Cost per GB transferred $0.02 (0.01 charged to sender VPC owner and 0.01 charged to receiver VPC owner) $0.02
Overall monthly cost with 3 connected VPCs and 1 TB transferred Connection charges – $0

Data Transfer Cost -$20

Total = $20/month

Connection charges- $108

Data Transfer Cost – $20

Total = $128/month

 

Transit gateway design best practices:

 

HashedIn, a cloud organization has a master billing account, logging, security, hosting networking infrastructure, a shared services account, three development, and one production level account for the architecture below. AWS Transit Gateway is the single point for all connectivity.  
 
For each of the accounts, VPCs to Transit Gateway are connected via a Transit Gateway Attachment.  Each account has a Transit Gateway Route Table, with an appropriate Gateway Attachment that sends traffic, and hence, subnet route tables can be used to connect from other networks. The Network account transit gateway is connected to the On-premise data center and other networks.
 

 
Here are the steps that are observed to configure multiple AWS accounts with AWS Transit Gateway:

 
The three available options while creating Transit Gateway Attachment:
 
Using VPC, an ENI is created in multiple availability zones. A TGW attachment is needed to be developed in all availability zones in the VPC so that TGW can communicate with the ENI attachment in the same availability zone.
 

 

 

Transit Gateway Attachments are associated with the Transit Gateway Route Table. However, you can create multiple attachments related to a single route table. Propagation will dynamically populate the routes of one attachment to a route table of another attachment.  Associations help to attach Transit Gateway Attachments. 
 


 

In addition, the network manager helps in reducing the operational complexity of connecting remote locations and other cloud resources. It also acts as a centralized dashboard to monitor the end-to-end networking operational activity in our AWS account.
 

 

 
Conclusion
 
The Transit Gateway is a centralized gateway where we can manage AWS and On-premise networks on a single dashboard.  It also helps simplify network architecture, which was earlier complicated in managing inter-VPC connectivity and Direct Connect.

 

Henry Mintzberg, Canadian author and academic, once said, “Corporations are social institutions. If they don’t serve society, they have no business existing.” To serve society implies to keep it safe and healthy. However, the dramatic collapse of major economies was inevitable. The UK is experiencing its first recession in 11 years, India might have a similar fate, with the GDP falling by a significant 23.9%. Whether it’s through telemedicine, remote monitoring, or workforce management, this blog will look at how businesses can get back on their feet. Technology has played a huge role in enabling individuals and organizations to rapidly adapt to our new COVID world. Many countries have developed tools to ensure public-health safety measures are being disseminated and implemented. Technology is enabling people to be more in control of their own health and less dependent on medical professionals. Healthtech will help in identifying the illness, treatment options, leading to increased access to medical information, and improved quality of life.
 
Telehealth
 
The pandemic instilled fears that one would have not previously thought of – What if seeking medical help would lead to more health complications? Although laden with skepticism at first, the overburdening of the healthcare system due to the Coronavirus led to the rise in telemedicine all over the world. Moreover, research shows that more than 80% of doctors expect telehealth at the same or greater level than they do now.1
In India, on the same day as the nation-wide lockdown began, the Telemedicine Practice Guidelines were issued, making it legal for registered medical practitioners to provide healthcare services using digital technologies, such as chat, audio, and video for a diagnosis. A recent report2 found that at least 50 million Indians opted for online healthcare during the pandemic, out of which 80% were first-time telemedicine users. Moreover, 44% of these users were from non-metro cities. According to a report by McKinsey Global Institute, India could save up to $10 billion by 2025, if telemedicine services could replace 30-40 percent of in-person consultations.3
 
Taking this opportunity, many healthtech organizations and experts voluntarily created an app for free teleconsultation. One such app is SWASTH – An effective telemedicine platform for quality COVID related healthcare access in India. It has over 2,000 registered doctors on its database, which can provide audio/video teleconsultation; other features include the listing of fever clinics, order medicines online, multiple languages, etc.
 
IoMT | Wearable Devices
 
Like telemedicine, another tech solution that can complement and enable well-being from a distance are connected wearable health devices and remote patient monitoring (RPM) technologies. Apart from the cases where hospital readmissions are critical, it was found that the medicare industry spends an estimated $17 billion a year on avoidable readmissions that can be mitigated by early detection, intervention, and better at-home care.4
 
Behind all of this is the Internet of Medical Things (IoMT), which will allow cloud-connected medical devices for improved patient compliance, detecting device failures before they become critical, and collecting data for more personalized therapy.5 As internet penetration increases around the world, and wireless capabilities and computing power improve, many IoMT devices are changing the healthcare landscape. With the ability to track and prevent chronic illnesses, connected devices will be able to collect, analyze, and transmit health data for both patients and healthcare professionals.
 
Workforce Management
 
COVID’s impact on businesses has been insurmountable and transitioning to a post-COVID-era will require relying heavily on digital technologies, for self-assessment, contact tracing, and syndromic surveillance. Initially, there was a false trade-off between the economy and the virus. However, it was never a debate – for the economies to regain their momentum, the Coronavirus has to be defeated. The world can’t come to a standstill; and although remote working is encouraged and has become the “new normal”, many organizations are eagerly waiting for things to settle down to resume work in a staggered manner.
 
As organizations resume their operations, ensuring the well-being of their employees is going to be their priority. Technology can play a role to effectively manage the workforce while complying with all the Government protocols. One such innovation from HashedIn Technologies, called Heed, is a platform that is allowing corporations to get back on their feet, despite the ongoing pandemic. Some of its unique features include, contact tracing, employee self-assessment, alerts, employee/visitor rostering and access, security, cloud syncing, etc.
 
Healthtech is here to stay
 
As Amina J. Mohammed, UN Deputy-Secretary-General said, there is no exit strategy until the vaccine is found.6 However, innovation in the healthtech space is enabling us to contain the virus and assuring us that we will be better equipped to adapt and fight off a health crisis at such a scale through technology. The accelerated adoption of digitization in the healthcare sector has helped us to cope with the repercussions of COVID. Healthtech has not only reduced the impact of the pandemic but has also increased access, bridging the urban and rural healthcare divide, while improving personalized patient care. Healthtech solutions are here to stay, long after the world has recovered from this pandemic.
 

References:

Progressive web application is a software built using common web technologies like HTML, CSS, and javascript, which intends to run on any platform that has a standard browser. It offers functionalities such as working offline, push notification, device hardware access, etc to create a native application like experience. They make a web application act and feel like an app by applying different technologies. Using progressive enhancement, new capabilities are enabled in modern browsers and this is achieved by employing web app manifest files and service workers. Using those technologies, a PWA can close the gap between a classic web application and a desktop, or native mobile application.

 

This article talks about how we can enable the PWA feature to make our app installable i.e. add an app to the home screen with caching files such as CSS, js, and data response for a Native app-like experience.

 

To enhance a classic web application with PWA features, two essential building blocks must be added to the application, a web app manifest file and a service worker. Let’s look into the purpose of these two elements.

 

Web App Manifest
 
The purpose of the web app manifest file is to provide information about the web application. The JSON structure of that file can contain information such as name, display, icons, and description. This information is used to install the web application to the home screen of a device that helps the user to access the application conveniently, with overall app-like experience.

 

Web application manifest file (https://www.w3.org/TR/appmanifest/), we have added a reference to the file in index.html file and the file itself with desired properties.
 
After addingWeb App manifest file, it is mandatory to add a service worker to our application as well.
 

 


 

Service Worker
 

A service worker is a JavaScript file that is added to the project and registered in the browser. The service worker script has the capability to run in the background and perform tasks like push notifications and background sync. To use this API we need to follow three steps :

  1. Register a Service Worker.
  2. Install a Service Worker.
  3. Update a Service Worker

Typically, this can be achieved by writing javascript code and listening to events like download, install, and fetch. But to simplify our job, some steps are used to let it run with just a CLI command.
 
Steps To Add Service Worker in Angular App
 
To set up the Angular service worker in a project, use CLI command ng add @angular/pwa i.e.
 

 running the above command completes the following:

1. Adds the @angular/service-worker package to your project.
 

 

2. Enables service workers to build support in CLI.
3. Imports and registers service workers in the app module.
 

 
4. Updates index.html file to include manifest.webmanifest file.
5. Creates the service worker config file “ngsw-config.json” to get the detailed description of properties  


visit https://angular.io/guide/service-worker-config

 

 
After the set-up is complete, we structure the “SW-master.js”, which is mentioned in the app.module.This file contains the script that will be generated while building the app in production mode.
 

 
This completes the framework of service workers, and our app is now satisfactory for PWA. The app caches all the files mentioned in “ngsw-config.json” under the asset group and data group. Wondering where it stores all the cached assets? Your origin (domain name) is given a certain amount of free space that is shared between all origin storage i.e. Local storage, IndexedDB, and Cache. The amount available to service workers isn’t specified, but rest assured that space is enough to make the site work offline.
 
But for a matter of fact, it doesn’t automatically update the app when a new feature is added. To address this issue, the service worker module provides a few services that can be implemented to interact with service workers and control the caching of the app.
 
The SW update service gives us access to the events when there is an available update for the app. This service can be utilized to constantly check for updates so that the app can track the latest changes and reload to apply the same.
 
This action can be done at various stages of an app. For simplicity, we did this in the “app.component.ts” file in the OnInit lifecycle hook.
 

 
The above piece of code creates an instance of SW update API and checks for any available update as soon as the app starts. If there is any update, it prompts the user about the same. 
 
Summary
 
It can be concluded that by adding manifest files and enabling service workers with a few simple steps, we configured our app to behave like a native application. But this doesn’t eliminate the need for native applications, as some capabilities are still out of the web’s reach. The new and upcoming APIs are looking to change that and these capabilities are built with the web’s secure, user-centric permission model ensuring that going to a website is not a scary proposition for users.
 
To check the app in action, visit https://tractorbazar.com/, and click on the ‘add app’ option. It can be noticed that the size of the app is added as a native app, and it is still light-weight.
 
Installed Progressive Web Apps run in a standalone window instead of a browser tab. They’re launchable from the user’s home screen, taskbar, or shelf. It’s possible to search for them on a device and jump between them with the app switcher, making them feel like part of the device they’re installed in. 

 

Using the latest web features to bring native-like capabilities and reliability, Progressive Web Apps allows, what you build, to be installed by anyone, anywhere, on any device with a single codebase and lightweight.

 

Picture yourself in possession of a sample web application and HA (Multi Zone) Kubernetes cluster and in need of a high availability of the application using kubernetes cluster. Well, the first move that any K8s expert would do is to deploy applications into nodes, but there is a lack of assurance that each node has at least one pod running. This blog gives you a clear solution to the above scenario.
 
Kubernetes allows selective deployment of new pods using affinities. These can be good solutions to common HA problem statements like –

  1.  n number of pods each node
  2.  Ignore a certain node group for a group of Pods
  3.  Preferred regions, AZs or nodes to deploy auto-scale pods

 
Let’s discuss all these in detail. Before proceeding please find below some of the common terminologies that would be used in this blog.

So, a rule of podAntiAffinity with SOFT scheduling will do the task here!
 
First, let’s have a look at the deployment of the below yaml file which uses podAntiAffinity with a replica count of 3.
 

Deployment with soft podAntiAffinity:

apiVersion: apps/v1
kind: Deployment
metadata:
 creationTimestamp: null
 labels:
   run: nginx
 name: nginx
spec:
 replicas: 3
 selector:
   matchLabels:
     run: nginx
 strategy: {}
 template:
   metadata:
     creationTimestamp: null
     labels:
       run: nginx
   spec:
     affinity:
       podAntiAffinity:
         preferredDuringSchedulingIgnoredDuringExecution:
         – podAffinityTerm:
             labelSelector:
               ;matchExpressions:
               – key: run
                 operator: In
                 values:
                 – nginx
             topologyKey: failure-domain.beta.kubernetes.io/zone
           weight: 100
     containers:
     – image: nginx
      name: nginx
      resources:
         limits:
           memory: “200Mi”
           cpu: “200m”
         requests:
           memory: “100Mi”
           cpu: “100m”
status: {}

 
This antiaffinity rule ensures that two pods with the key run equals nginx must not run in the same node. This deployment was deployed to a 3 master node HA cluster.
 

Result:

As you can see here, each pod is deployed to each k8’s node (master1, master2 and master3).
 

 
We have already seen what will happen with the deployment and now will see what happens while scaling the deployment?
 
Scale the replica to a higher count, let’s say 6!
 
kubectl scale deployment nginx –replicas=6

Result:

As you can see the next set of 3 pods also got distributed evenly over the nodes. Each node has 2 pods of nginx running.
 

 
Will the same work with HPA (Horizontal Pod Scaling) or not?
 
Configure the HPA and create a load generator pod which hits the application endpoint with multiple requests so that it can trigger the scaling.
 
Now as you can see the newly launched pods are successfully distributed among the nodes.

Result:

 

 
podAffinity is therefore the ultimately easy solution for a high availability of deployments. We hope this blog has helped you understand the importance of podAffinity.
 

RedMask is a command-line tool to mask sensitive data in a relational database. Currently, it supports Redshift and Postgres. This tool helps organizations to reduce the time for sharing data, building data warehouse solutions, ensuring security, and reducing compliance costs.

 

Why Data Masking?

 

Masking is an essential use case when huge data is managed. It is a critical step when dealing with personal or commercially sensitive and identifiable information. The data should be protected while being shared with other internal or external people or agencies for consumption.  Here are a few examples, where masking would be necessary:

 

  1. FinTech/BFSI organization – Sharing customer, product, and transaction details. 
  2. Healthcare industry – Sharing patient data, diagnosis, and health-related information. 
  3. E-commerce organization – Sharing the details of the consumer and product database.
  4. Transportation and urban logistics companies – Sharing user and location data.

 

How RedMask Works?

 

Administrators can mask data using a variety of techniques. RedMask uses native database queries in order to mask data and manage permission; and thus, has a minimum impact on performance. RedMask supports both dynamic and static masking.

 

Static Masking: 

In static masking, a new table is created with the selected columns masked. This increases storage costs. This technique is suitable for data sets that do not change often.

 

Dynamic Masking:

In dynamic masking, RedMask creates a view that masks the desired columns. The view has the same name and columns as the underlying table but is in a different schema. When a query is executed, the data warehouse picks either the table or the view, depending on the search path/default schema.

RedMask creates a masked view for the data consumers with lesser privileges based on the settings. The consumers will see the masked data instead of real data. In addition, RedMask supports a dryrun mode,  wherein it just generates a SQL file with required queries, while making no changes to the Database. This allows the Database Administrators to verify the underlying queries.

 

What Makes RedMask Different?

 

RedMask addresses the challenges faced while using other masking tools in the market.

 

 

Building the RedMask Application

 

Step 1) Clone the RedMask repository using git clone https://github.com/hashedin/redmask.git

 

Step 2) Install Gradle for your respective operating system.

 

Step 3) Run the following commands

 

Step 4) Your application will be built in the following folder build/graal/redmask.

 

Masking Data on Redshift

 

Let us take the following customer table for masking data:

 

 

 

Now we will mask the Name and the Age field with the STRING_MASKING and RANDOM_INTEGER_WITHIN_RANGE masking rules respectively.

 

Step 1) Create a JSON File as config.json.

 

 

 


DB_SUPER_USER = <Administrator_username>  

DB_SUPER_USER_PASSWORD = <Administrator_user_password>  

DB_USER = <user_name>  

DB_USER_PASSWORD = <user_password>

 

Step 2) Run the RedMask CLI command

 


./redmask -f=/<path_to_josn_file>/config.json -r=false -m=static

 

where:

 

With just that, you get your masked table:

 

 

To run it in the Dynamic mode we need to just change –mode =dynamic in the CLI command

 

./redmask -f=/<path_to_josn_file>/config.json -r=false -m=dynamic

 

 

Note that the masked view is created under the schema name after the username entered in the config file. 

The user will only be able access to this view when he queries this particular table.

 

FAQs

When React Hooks were released at the React Conference 2018, they took the React community all across the globe by storm. Finally, there was a refreshing and fundamental change in the way everyone thinks about and writes React code. One of the attractive (yet mysterious?) aspects of using hooks was the brevity they imparted to projects. As everyone was pumped up about giving these all-new hooks a spin, so was I. Sure, useState, useMemo and useContext look neat and are fairly simple to understand. But there, in plain sight is probably one of the most critical hooks – useEffect.

 

useEffect is supposed to be this all-in-one replacement for a bunch of old lifecycle methods like componentDidMount, componentDidUpdate to name a few, and hence getting the right understanding of useEffect hook and hooks, in general, is paramount to writing consistent, clean and bug-free data-driven React applications.

 

Here are some of the questions you might be having after employing the useEffect hook, which we’ll seek answers to 

 

How to emulate componentDidMount with useEffect?

How to emulate componentDidUpdate with useEffect?

How to access previousProps as we did in componentDidUpdate with useEffect?

How to stop an infinite refetching loop?

Should functions be specified as dependencies to useEffect?

In case you are looking for answers to these questions real quick, check out the TLDR; section at the end.

 

Understanding Rendering

Functional components differ from traditional class components in a prominent way and that is the way they close around state and props. In more simplified terms, each ‘render’ owns a separate set of state and props. Now that might not be that simple to understand. Let’s go with a simple example –

 


function Counter() {

  const [count, setCount] = useState(0)
  return (

    <div>

      <p>You clicked {count} times</p>

      <button onClick={() => setCount(count + 1)}>Click me</button>

    </div>

  )

}



Every time user clicks on the button, setCount updates the state which triggers a re-render with the new value. Initially, the count is 0, and as the first click happens on button state updates via setCount, calling the Counter function again with a new value of count which would be 1. For the second render, Counter sees its individual value for the count. So the count is constant for each render and Its value is isolated between different renders. This is not only true for state values but also for props, events handlers, functions, and pretty much everything. Here is another example to clarify this –

 


function Greeting() {

  const [name, setName] = useState("")

  function handleGreetClick() {

    setTimeout(() => {

      alert("Hey " + name)

    }, 5000)

  }

  return (

    <div>

      <p>Name is {name}</p>

      <input type="text" onChange={event => setName(event.target.value)} />

      <button onClick={handleGreetClick}> Greet </button>{" "}

    </div>

  )

}



Okay, let’s say we type ‘Dan’ in the input and click greet button, then we enter ‘Sunil’ and click greet button before 5 seconds. What message will the greeting alert show? will it be ‘Hey Dan’ or ‘Hey Sunil’? It is the latter one right? No. Here’s the sandbox for you to give it a spin and see the difference between class components and functional components.

 

So why is it so? We’ve discussed this above. Each render owns the state for functional components which is not true for class components. Next time someone asks you the difference between class and functional components, do remember this.

 

Although, you should totally follow Dan and Sunil on Twitter!

 

Each render owns its State, Props, Effects …Everything.

As we saw above, each render owns its state, props, and pretty much everything including effects.

 


function Greeting() {

  const [name, setName] = useState("")


  React.useEffect(() => {

    document.title = `Hey ${name}`

  })

  return (

    <div>

      <p>Name is {name}</p>

      <input type="text" onChange={event => setName(event.target.value)} />

    </div>

  )

}



In this example above, as the user types a name in the input field, setName gets triggered and sets the new name in the state, leading to a re-render. How does the effect know about the latest value of a name? Is it some sort of data-binding? Nope. We learned earlier that count is different and isolated for each render. Functions, event handlers, and even effects can ‘see’ values from the render they belong to. So the correct mental model to paint is – It is not only a changing name inside an unchanging effect. It is the effect that also is different for each render.

 

Let’s set this straight. Every function inside a component render sees the state, props particular to that render only.

 

So, the eventual question in your mind would be – How can I access state and prop from next or previous renders, just like class components. Well, it would be going against the flow, but it is definitely possible using refs. Be aware that when you want to read the future props or state from a function in a past render, you’re intentionally going against the flow. It’s not wrong (and in some cases necessary) but it might look less “clean” to break out and that’s intended to highlight which code is fragile and depends on timing. In classes, it’s less obvious when this happens, which can lead to strange bugs in some cases.

 


function Greeting() {

  const [name, setName] = useState("")

  const latestName = useRef()

  useEffect(() => {

    // Set the mutable latest value

    latestCount.current = name

    setTimeout(() => {

      // Read the mutable latest value

      console.log(`You entered ${latestName}`)

    }, 5000)

  })

  return (

    <div>

      <p>Name is {name}</p>

      <input type="text" onChange={event => setName(event.target.value)} />

    </div>

  )

}

This code will always log the latest value of name, behaving like classes. So it might feel a bit strange but React mutates this.state the same way in classes.

 

What about the ‘cleanup’ Function?

Going through the docs, you will find mentions of a certain cleanup function. The cleanup function does exactly as it says – cleans up an effect. Where will it be required you might ask, well any cleanup function makes sense in cases where you have to unsubscribe from subscriptions or say remove Event Listeners.

 

Many of us have a slightly wrong mental model regarding the working of cleanup functions. borrowing the example from React docs;

 


useEffect(() => {

  ChatAPI.subscribeToFriendStatus(props.id, handleStatusChange)

  return () => {

    ChatAPI.unsubscribeFromFriendStatus(props.id, handleStatusChange)

  }

})

if props.id is 1 on first render and 2 on second. How will this effect work ?

subscribe with id = 1

unsubscribe with id = 1

then

subscribe with id = 2

unsubscribe with id = 2



That is the synchronized model, we derive from our understanding of classes. Well, it doesn’t quite stand true here. As you are assuming that cleanup ‘sees’ the old props because it runs before the next render and that’s now how it happens. We have to understand that React runs an effect only after the browser has painted, making apps feel faster as effects no longer block the painting, so why should cleanup function block the next render? Thankfully it doesn’t. The previous effect is cleaned up after the re-render with new props

 

You must be a little uncomfortable with this. How can cleanup function still see ‘old’ props once the next render has happened? Well, reiterating our learnings from above –

 

EVERY FUNCTION INSIDE A COMPONENT, INCLUDING EFFECTS AND TIMEOUTS, CAPTURES THE STATE AND PROPS OF THE RENDER.

 

So no ‘new’ props and state are available to the cleanup function but the ones available to other functions and effects in that render and hence cleanups don’t require to be run right before next render, they can run after the next render as well.

 

Telling React When to Run an Effect

This is pretty much on the lines of how React works, it learned the lesson that instead of changing the whole DOM on every change, only update the parts that actually need updating. There are times when running an effect is not necessary, so how do you instruct React on whether to run this effect or not?

 


function Greeting({ name }) {

  const [counter, setCounter] = useState(0)

  const [name, setName] = useState("Tejas")

  useEffect(() => {

    document.title = `Hello, ${name}`

  })
  return (

    <h1 className="Greeting">

      Hello, {name}

      <button onClick={() => setCounter(count + 1)}> Increment </button>{" "}

    </h1>

  )

}

In this example, when you click the increment button, the state updates, and the effect fires, unnecessarily even though the name it uses hasn’t changed. So probably React should get a diff between old effect and new effect to see if anything has changed? No, that can’t happen as well because React can’t magically tell the difference in two functions without executing them, which defeats the purpose, right?

 

To get out of this problem, useEffect hook has a provision for providing a second argument – An array of dependencies. An array of dependencies or ‘deps’ signify what are the dependencies for that effect. If the dependencies are the same between renders, React understands that the effect can be skipped.

 

Being Honest About Effect Dependencies

Dependency array or ‘deps’ are critical to bug-free usage of useEffect. Giving incomplete or wrong dependencies can certainly lead to bugs. The statement to internalize here is –

 

ALL VALUES FROM INSIDE YOUR COMPONENT, USED BY THE EFFECT MUST BE SPECIFIED IN THE DEPENDENCY ARRAY.

This is of great relevance and we as developers often lie to React about dependencies of an effect. Often unintentionally, owing to our reliance on the class lifecycle model where we still try to emulate ComponentDidMount or ComponentDidUpdate. We need to unlearn that paradigm or just switch it off before using hooks, because as we have seen so far – Hooks are quite different from classes and carrying that baggage of class model onto hooks can lead to inaccurate mental models and formation of wrong concepts which can be hard to shake.

 

Being honest about dependencies is quite simple if you follow these two conventions –

 

Include all the values inside the component that are used inside the effect.

 

OR

 

Change the effect code, so that you don’t need to specify dependency anymore.

The mental model of hooks generated by our usage of classes sways us against considering functions as part of data flow and hence a dependency on an effect, which is not true. In reality, Classes obstruct us from considering functions as part of the data flow and that can be understood better by this example –

 


class Parent extends Component {

  state = {

    query: "react",

  }

  fetchData = () => {

    const url = "https://hn.algolia.com/api/v1/search?query=" + this.state.query

    // ... Fetch data and do something ...

  }

  render() {

    return <Child fetchData={this.fetchData} />

  }

}

 

class Child extends Component {

  state = {

    data: null,

  }

  componentDidMount() {

    this.props.fetchData()

  }

  render() {

    // ...

  }

}

now if we want to refetch in the child based on changes in query, we cannot just compare fetchData between renders because it is a class property and will always be the same, which will land us in infinite refetching condition. So we have to take another step unnecessarily – passing query as a prop to Child and comparing the value of query in componentDidUpdate between renders.

 


class Child extends Component {

  state = {

    data: null,

  }

  componentDidMount() {

    this.props.fetchData()

  }

  componentDidUpdate(prevProps) {

    //  This condition will never be true as fetchData is a class property and will always remain same

    if (this.props.fetchData !== prevProps.fetchData) {

      this.props.fetchData()

    }

  }

}



Here you can clearly see how classes break out and do not allow us to have functioned as part of data flow but hooks do because with hooks you can specify functions as a dependency hence making them part of the data flow.

 


function Parent() {

  const [query, setQuery] = React.useState("react")

 

  const fetchData = useCallback(() => {

    const url = "https://hn.algolia.com/api/v1/search?query=" + query

  }, [query])

 

  return <Child fetchData={fetchData} />

}

 

function Child({ fetchData }) {

  let [data, setData] = useState(null)

 

  useEffect(() => {

    fetchData().then(setData)

  }, [fetchData]) // Effect deps are OK

 

  // ...

}



✅ Though there is one minor catch here, which you must understand. If we hadn’t wrapped fetchData in a useCallback with a query as a dependency, fetchData would change on every render, which when supplied as a dependency to useEffect in child component would needlessly trigger the effect over and over again. Not ideal, huh.

 

useCallback allows functions to fully participate in the data flow. Whenever function inputs or dependencies change, the output function changes, else it remains the same.

 

More importantly, passing a lot of callbacks wrapped in useCallback isn’t the best of choices and can be avoided. As React docs say –

 

We recommend to pass dispatch down in context rather than individual callbacks in props. The approach below is only mentioned here for completeness and as an escape hatch. Also, note that this pattern might cause problems in the concurrent mode. We plan to provide more ergonomic alternatives in the future, but the safest solution right now is to always invalidate the callback if some value it depends on changes.

 

TLDR

As discussed in the introduction, here are my answers from the understanding of useEffect hook.

 

How to emulate componentDidMount with useEffect?

It’s not an exact equivalent of componentDidMount because of the differences in class and hook model but useEffect(fn, []) will cut it. As told over and over again, useEffect captures the state and props of the render it is in, so you might have to put in some extra efforts to get the latest value, i.e useRef. Although, try to come to terms with thinking in terms of effects rather than mapping concepts from class lifecycles to hooks.

 

How to emulate componentDidUpdate with useEffect?

Again not exactly emulating componentDidUpdate and I do not encourage thinking of hooks with the same philosophy as of class lifecycle methods but useEffect(fn, [deps]) is a replacement. You do need to specify correct dependencies in the array, omitting or adding dependencies might result in bugs. Though to access previous or latest props, employ useRef.

 

How to access previousProps as we did in componentDidUpdate with useEffect?

To access previous or latest props is to break out of paradigm and is a little more effort, employ useRef to keep track of previous or latest values for state and props. usePrevious described here is a widely used custom hook for the job.

 

How to stop an infinite refetching loop?

Data fetching in useEffect is a routine use case and while useEffect is not exactly made for the job. (Waiting for Suspense to become production-ready 💓) – Meanwhile, this guide covers how to fetch data properly using hooks. You enter infinite refetching land usually when you don’t specify second argument (deps) for the effect which leads to the effect being triggered over and over. Give a proper set of dependencies to useEffect to avoid infinite refetching problem.

 

Should functions be specified as dependencies to useEffect?

Yes, absolutely. If your functions use state/prop do wrap them in a useCallback as we saw above. Functions relying on state/props should be part of data flow using hooks, otherwise, you can just hoist them outside your component.

 

Closing Notes

This post is an attempt to make useEffect more understandable and to document my understanding of useEffect. Most of the examples and inspiration come from the exhaustive and Complete guide on useEffect by Dan Abramov, which you should definitely go through. I thank him for teaching me and to you, for taking the time to read this post.

 

Redux as a global state management solution has been around for a while now and has become almost the go-to way of managing state in React applications. Though Redux works with JS apps, the focus of this post will be React apps.

 

I’ve seen people add Redux to their dependencies without even thinking about it, and I’ve been one of those people. A small disclaimer: This is no rant about Redux being good or bad, of course, Redux is great that’s why so many people have been using it for years now. The aim of this blog is instead to share my opinions on the use of Redux and where it makes sense and more importantly where it doesn’t.

 

React’s Context API has been around for some time now and it is good, useReducer is great as well, but that doesn’t make Redux obsolete. Redux still makes sense to me. Let’s not make size be the parameter for using (or not using) Redux. It is not about the size of the application, it is about the state. That is also where the pain begins. In large applications with multiple developers working on the same codebase, it is easy to abuse Redux. You are just a bad choice away and then everyone starts pushing anything and everything into the Redux store. I’ve been one of those people, again.

 

“Hey! What do you mean by abusing Redux? It is meant to be a global data store right?”

 

Yes, it is meant to be a global data store but the term global data store is often translated as a state to hold every state, value, data, and signal. That is wrong, and it is a slippery slope, it goes from 0 to 100 quite fast. Soon enough you will find yourself working on an application with an absolutely messed up global state. Where when a new guy onboards, he doesn’t know which reducer to use data from because there are multiple copies or derived states.

 

People often get used to the fact that they’ve to make changes in 3 files whenever something changes – Why? That’s a pain, we’ve got accustomed to and as the size of application or scope increases this only gets worse. Every change becomes incrementally difficult because you don’t want to break existing things and you end up abusing Redux further.

 

When that stage comes, we often like to blame React and Redux for the meaty boilerplate it asks of you to write.

 

“It’s not my problem or my code’s problem, that’s just how Redux is…”

 

Yes and No. Redux definitely asks you to write some boilerplate code, but that’s not a bad deal until you overuse it. As often said, nothing good comes easy, and the same applies to Redux. You have to follow a certain way that involves:

 

– Keeping the application state as plain objects. (Store)

– Maintaining changes in the system in the form of plain objects. (Actions)

– Describe logic to handle state changes in terms of objects. (Reducers)

In my opinion, that’s not a very easy pattern to follow and introduce in your applications. The steepness of this curve should deter the abuse of Redux and should make you think before opting for Redux.

 

React gives you a local state. Let’s not forget it and use it as much as possible before looking for ‘global’ solutions, because if you pay close attention, most data in your Redux store is actually just used by one or two components. ‘Global’ means much more than one or two components, right?

 

Redux for state

Right, that’s what Redux is for – maintaining your global state, but pay close attention and you will find values that are required by just a single component but you thought that someone might need them in future in some other component. Therefore, why not put in the effort to put this local data in Redux. That’s where we are often wrong. Because the chance of a future guy requiring the same data in some other component is really low and even if that happens, chances of duplication of data and derived Redux states are good. Over time, this practice of putting values unnecessarily in the Redux store becomes perfunctory. You eventually land in a big stinking mess of reducers and states where nobody wants to touch anything. They would rather create a new reducer, in the fear of causing regressions in, god knows which, component. I know proper code reviews and processes in place will not let the situation get that dire, that fast but definitely, the morning will come, when the realization strikes and all you are left with is an insurmountable tech debt of fixing state management in an existing code base with minimum regressions.

 

So heed my words – whenever you are thinking of going the Redux way with a state, give a good thought to it. Is that data really of ‘global’ use, are there other (non-immediate child) components which require that data? If the answer is yes, Redux-land is the home for that data.

 

Redux for data fetching

Redux mostly gets looped into the data fetching scene in the React world but, why? Why do you have to write an action reducer every time you want to make an API call? In huge applications, it might make sense, but for medium to small projects, it is an overkill. Redux is meant to store values/data which might be used by multiple components in your application, possibly on different component trees. Data fetching from an API to fill up a component isn’t really something that fits that definition. Why should data fetched from an API to be used by a single component, go through the store to the component in question?

 

Now, the first point against this line of thought would be:

 

We don’t want to pollute our components with data fetching logic…

 

I am all for clean and readable code, and readability principles demand that one should be able to comprehend what is happening in your code, as easily as possible (with an opening as little files as possible). Now keeping that in mind, I don’t think to have your data fetching logic specific to a component inside that component is polluting it. It is actually making your code readable and understandable as probably the most critical part of your component. The data it fetches and how it fetches it is quite conspicuous to the reader, isn’t it?

 

Moreover. I am not asking you to put your fetch calls inside your components just as they are, there are a lot of great abstractions out there, which are seemingly easy to use and do not take away the brevity and readability of your code.

 

Since Redux is a global store, we don’t have to fetch the data again…

 

Some of you might have this as an argument. Well, most of us make API calls to fill up our components whenever the component “mounts” and that data comes via Redux, right? So until you have proper data validation mechanisms to know that your Redux store needs to be repopulated with fresh data from an API call, you are going to make that API call on every “mount”, so how are we saving anything there? And if what you really want is ‘caching’ so that you can save on your API calls, why not go with solutions built for that, instead of trying to mold Redux into your cache?

 

There are a lot of brilliant libraries out there to help you tidy up your data fetching. SWR by the good folks at Zeit is one amazing utility to get started with. If you want to take it a notch up, you can consider going with react-query by Tanner Linsley both of these are mostly based around render time data fetching and provide great ways to optimize your data fetching operations.

 

Sometimes you might need event-based data fetching – fetching data when some particular ’event’ happens. For such cases, I authored a utility called react-str, which I’ve been using a lot by now and it makes things quite nice and concise.

 

I’ve been walking this path of avoiding Redux for data fetching for some time now, and I am quite happy with the results. Most performant and most maintainable lines of code are the ones not written, and avoiding Redux in your data fetching can save you a lot of time, code with almost no harm.

 

Redux for signaling

At times in our applications, we need to do something based on the occurrence of some event somewhere else in your code. For example, I want to change the text inside a header component whenever the user scrolls beyond a certain limit in another component. Usually, how we solve this is by maintaining a Redux state which is a boolean keeping track if the user scrolled to that limit or not. Then, we watch for that state value in our header, when it changes we change our header text. Sounds familiar? Or suppose you have a general confirmation modal which you might want to show at multiple places in your application, you can either call it at these many places or just put it on the parent component, toggling its visibility somehow. Again you’ll be maintaining a set of actions and a reducer to maintain modal visibility state, right? These cases might not be that frequent but these are not rare as well.

 

To reiterate, signaling flags or values like the one above are rarely ‘global’ and are required to be received by one or a few components at most. So why do we put these values in our so-called global store?

 

Maintaining an action, reducer and a single boolean value in Redux for signaling the occurrence of an event to another component seems overkill to me. So I wrote an event emitter. Which lets you fire events and handle them anywhere in your code, without indulging Redux.

 

Redux for future proof applications

In most cases, teams opt for Redux because some senior members on the team think they might need it in the future. And this often becomes a very serious mistake later on, because it is not easy to rollback on such decisions. You and possibly your team might be taking the future and especially Redux, as part of the React world, too seriously. As Dan Abramov. co-creator of Redux. has said:

 

It’s just one of the tools in your toolbox, an experiment gone wild.

 

Redux is just another library, albeit a great one. Use it when you identify the need for it, not just because it is such a major name in the React ecosystem.

 

And, if you are influenced and love the pattern Redux enforces, you can enforce it without Redux also, the pattern is called flux and you can read more about it here.

 

I think it is the right time to share some wisdom: Nobody knows the future, not even the seniors – all the time. The future depends on the present and the decisions we make today reflect tomorrow. It is okay to ask why, before adding anything to your bundle. Remember, as Front-end engineers, the user is King and every avoidable package you use, add to that bundle size, making it heavier and ultimately slower to load.

 

Maybe I am exaggerating things a bit, but this is not just about Redux. Every package you add on to your project, think, think, think! Is it really needed? And, how can you minimize the costs on bundle size? A few tens of kilobytes might not seem a lot, but on a mobile network in some far away, developing land, that might be the deciding factor if a user is able to use your application or not. That’s all that matters, right?

 

Closing Notes

So, these are my thoughts on the current state (pun definitely intended) of Redux usage across React apps. The problems shared here, are the ones I have observed first hand on the projects that I’ve worked on. Also, the solutions shared are the ones we used/developed to get out of these problems. Are they perfect? No, nothing is, but they definitely seem better at the moment.

 

I hope this post gave some points worth your time on why you probably don’t need Redux. If you are going to start a new project soon, this might help in deciding if you really need it, or if you have an existing project with Redux, you can make your store quite lean and free your codebase of much boilerplate by employing these practices.

 

If the title seems familiar, it is derived (more unsolicited puns) from – You probably don’t need derived state – this excellent official React blog post by Brian Vaughn, if you haven’t read it, do check it out!

 

Through this post, I would like to thank Rishabh Gupta for the ideas and learnings shared. This would not have been complete without his guidance and help.

PHP Code Snippets Powered By : XYZScripts.com