anussharma@deloitte.com, Author at HashedIn Website

While designing and working on a bunch of React applications I always wanted myself and my team to write unit tests. I was able to learn, write and improve on how we have been writing test cases as a team.

Well, there isn’t any standard way or a rule book on how to and what to unit test with React and Redux, therefore this blog is going to talk about a few patterns that I have started using to unit test my applications. My examples in this document will cover the breadth of how to write unit tests.

What we’ll cover by the end of this series

Important libraries that we’d use to Unit Test
Testing Do’s and Don’t’s in a react component
Identifying the “Component Contract” to test with an example
Testing redux asynchronous action creators
Testing redux reducers
Configuring Jest code coverage report

Important libraries with React for Unit Tests

Jest

Jest is the official Facebook written library for testing react apps, though there are other libraries that help us unit test react, this stands for the following reasons.

Easy setup, comes built-in with react-create-app
Code Coverage in the box – no additional libraries required
Assertion module uses jasmine, making it familiar for most javascript developers

There are many more to it, and you may read about it here.

Enzyme

The enzyme is a jQuery type framework on top of the react written by Airbnb. The enzyme helps you traverse, assert, and manipulate react components. The most important is that it helps you render components at either a shallow level or a deep level while running tests. Again, you may spend time reading about it here.

More about setting up your project with Jest and Enzyme can be learned from the Setting up a section of this article.

Most of how I have described how to write Unit Tests in this series has been inspired by reading this article.

Configuring code coverage

Jest carries code coverage as a part of its huge list of utilities. We only have to configure Jest to generate code review like the way we want it to perform. I have a basic setup for getting the code coverage set up, while a documentation on how to set up in detail can be read here.


"jest": {
  "collectCoverage": true,
  "collectCoverageFrom": [
    "src/**/*.{js,jsx}"
  ],
  "coverageDirectory": "/coverage/",
  "coveragePathIgnorePatterns": ["/build/", "/node_modules/"],
  "setupFiles": [
    "/config/polyfills.js"
  ],
  "testPathIgnorePatterns": [
    "[/\\\\](build|docs|node_modules|scripts)[/\\\\]"
  ],
  "testEnvironment": node,
  "testURL": "http://localhost",
  "transform": {
    "^.+\\.(js|jsx)$": "/node_modules/babel_jest",
    "^.+\\.css$": "/config/jest/cssTransform.js",
    "^(?!.*\\.(js|jsx|css|json)$)": "/config/jest/fileTransform.js"
  },
  "transformIgnorePatterns": [
    "[/\\\\]node_modules[/\\\\].+\\.(js|jsx)$"
  ],
  "moduleNameMapper": {
    "^react_native$": "react-native-web"
  }
},

What to test in a React Component

The first thing that ran in my mind when I decided to write Unit Tests is to decide on what to test in this whole component. Based on my research and trials, here is an idea of what you would want to test in your component. Having said that, this is not a rule, your use case may want to test a few aspects out of this list as well.

On a general note, if we are Unit Testing we must be performing tests on a dumb component(simple react component) and not a connected component(component connected to the redux store).

A component connected to the store can be tested both as connected component and dumb component, to do this we have to export the component in our definition as a non-default. Testing connected components is generally an Integration Test

Test what the component renders by itself and not child behavior

It’s important to test all the direct elements that the component renders, at times it might be nothing. Also, Ensure to test the elements rendered by the component that is not dependent on the props passed to the component. This is why we recommend shallow rendering.

Test the behavior of the component by modifying the props

Every component receives props, and the props are sometimes the deciding attributes of how the component would render or interact. Your test cases can pass different props and test the behavior of the component.

Test user interactions, thus testing component internal methods

Components are bound to have user interactions and these user interactions are either handled with the help of props or methods that are internal to the component. Testing these interactions and thereby the components of private methods are essential.

What not to test in a React Component

What to test was simple, and probably something straightforward. We need to be well aware of what not to be tested in a React Component as a part of the Unit Tests.

Do not test PropTypes and Library functions

It does not make sense to test library functions, and those functionalities that you this should be tested as a part of the framework.

Do not test style attributes

Styles for a component tend to change, it does not give any value testing the styles of a component and is not maintainable when styles change the test cases are bound to change.

Do not test default state or state internal to a component

It’s not important to test the state internal to the component, as this is inhibited by default and would get tested indirectly when we test user interactions and methods internal to the component.

Identifying the “Component Contract”

When we start writing test cases for a component, it would make it helpful to decide what to test and what not when we understand the “Component Contract”.

To explain how we identify the contract, let’s discuss with an example.

Consider the following page which is a component called referral jobs.

Here is the code snippet on how this component is written


export class ReferralJobs extends Component {
    constructor(props) {
        super(props);
        this.state ={pageNumber: 1, showReferDialog: false}
    }
    componentDidMount() {
        let data = {"pageNumber": this.state.pageNumber, showReferDialog: false};
        this.props.getReferralJobs(data);
    }
    searchJob = (data) => {
        let defaultData = {"pageNumber": this.state.pageNumber};
        if(data !== defaultData) {
        this.props.getReferralJobs(data);
        }
    }
    handleReferJobDialog = (Job_Posting_ID ,JobTitle) => {
        let currentState = this.state.showReferDialog;
        this.setState({showReferDialog: !currentState});
        this.setState({jobId: Job_Posting_ID});
        this.setState({jobTitle: JobTitle});
    }
    referJob = (data) => {
        this.props.referAFriend(data);
    }
    render() {
        return()

Suggested jobs to refer

{this.props.jobs.tPostingJobLists && this.props.jobs.tPostingJobLists.map((job,index)=>{return

{job.JobTitle}

{job.MinExperience} - {job.MaxExperience} Years

{job.State}

{job.ShortDesc}

this.handleReferJobDialog(job.Job_Posting_ID, job.JobTitle)}>Refer a Friend
Share

})}

) } } 
const mapDispatchToProps = (dipatch, ownProps) => { return bindActionCreators({ getReferralJobs: getReferralJobsAction, referAFriend: getReferAFriendAction }, dispatch) 
const mapStateToProps = (state, ownProps) => { return { jobs: state.referralsState.ReferralJobs, } } 
export default connect(mapStateToProps,mapDispatchToProps)(ReferralJobs);

The component is composed of three parts Search, Job Posting Container, and The PortalApp Dialog. Let’s identify the contract for this component and also write test cases.

Search is Always Rendered

The Search Container is always rendered and is not conditional on any of the props, and the search container accepts a prop as its click handler.

This can be divided into two pieces

Search Container is always rendered no matter what props are being passed to the container
A click handler is passed as a prop and we need to test if the props have been set

Let’s write a couple of test cases for the same


describe ("ReferralJobs", () => {
         let props;
         let mountedReferralJobs;
         const referralJobs = () => {
         if (!mountedReferralJobs) {
             mountedReferralJobs = shallow();
         }
     return mountedReferralJobs;
  }
  const referralJobsMounted = () => {
     if (!mountedReferralJobs) {
         mountedReferralJobs = mount();
     }
  return mountedReferralJobs;
}
beforeEach(() => {
    props = {
        jobs: {tPostingJobLists: []},
        getReferralJobs: jest.fn
    };
    mountedReferralJobs = undefined;
});
it("Always renders a `ReferralSearch`", () => {
    expect(referralJobs().find(ReferralSearch).length).toBe(1);
});
it("sets the rendered `ReferralSearch`'s `onClick` prop to the same value as `getReferralJobs`'", () => {
    expect(referralJobsMounted().props().getReferralJobs).toBe(props.getReferralJobs);
});

There are two test cases within a test suite in the above code snippet. Let’s first try to understand how the test suite is configured.

Our describe method initializes a test suite and it can contain any number of test cases within it. It has various functions that it can handle such as after Each, beforeEach etc. read more here.

Within our describe method we have two initializations one called referralJobs and the other as referralJobsMounted. These are two ways in which we can render our component on the Virtual DOM during our testing.

Shallow Rendering

Shallow rendering is a widely used form of rendering when writing Unit Tests with Enzyme, this renders the component at one level deep and does not care about the behavior of child components. This is used when you are bothered about what is rendered in the component of interest.

Mount

Mount is a form of rendering which renders the complete component on the Virtual DOM and also returns an instance of the component. This helps us in cases where we need to test the component level props.

In our above snippet, we are also passing default props to the component so that the ReferralJobs component renders successfully.

Moving to our test cases, we have two test cases like our identified contract, one to verify that the search component is successfully rendered and other to verify that the props set in the component is the prop that we had given it while invoking the component.

As you notice, we use Jasmine’s expect the library to make assertions and this comes built in with Jest.

We pass jobs as props and ‘n’ jobs to be rendered as Paper components

The main functionality of the component is to display jobs that are given to it as props. This can again be separated into two parts.

‘n’ jobs have to be rendered, as Paper objects within the Grid Container
Each job must display the right information on its card

To test this we have two test cases, and we pass a two job object as props in our test case.


describe("when `jobs` is passed", () => {
    beforeEach(() => {
        props.jobs = {
            "tPostingJobLists": [
             {
              "RowID":1,
              "MaxExperience":0,
              "MinExperience":0,
              "GCI_Rec_ID":0,
              "Job_Posting_ID":4215,
              "JobTitle":"Java develper -Test",
              "Skill_Required":"java oracle",
              "City":"",
              "State":"Arunachal Pradesh",
              "Country":"India",
              "Area_Code":"10000",
              "ShortDesc":"test descriptuin test descriptuin test descriptuin v test descriptuin test descriptuin test descriptuin vtest descriptuin test descriptuin test descri....",
              "shareLink":"http://apps.portalapp.com/jobs/4215-Java-develper--Test-jobs",
              "ImagesLink":"http://appst.portalapp.com/eep/Images/app_0007.jpg"
             },
             {
              "RowID":1,
              "MaxExperience":0,
              "MinExperience":0,
              "GCI_Rec_ID":0,
              "Job_Posting_ID":4215,
              "JobTitle":"Java develper -Test",
              "Skill_Required":"java oracle",
              "City":"",
              "State":"Arunachal Pradesh",
              "Country":"India",
              "Area_Code":"10000",
              "ShortDesc":"test descriptuin test descriptuin test descriptuin v test descriptuin test descriptuin test descriptuin vtest descriptuin test descriptuin test descri....",
              "shareLink":"http://apps.portalapp.com/jobs/4215-Java-develper--Test-jobs",
              "ImagesLink":"http://appst.portalapp.com/eep/Images/app_0007.jpg"
             }
         ]
     };
});
/**
* Tests that the count of Grids rendered is equal to the count of jobs
* in the props
*/
it("Displays job cards in the `Grid`", () => {
    const wrappingGrid = referralJobs().find(Grid).first();
    expect(wrappingGrid.children().find(Paper).length).toBe(props.jobs.tPostingJobLists.length);
});
/**
* Tests that the first job paper has rendered the right
* job title
*/

On clicking Refer a Friend component state must change

In each job card, we have referred a friend link, on clicking this link the click handler changes the state and a dialog opens.

If you read the component code for ReferralJobs, the dialog has a prop value which is referred to the component state. When this value is true, the dialog is expected to open.

We will need to test if when clicking refer a friend link, the state changes and test whether the dialog prop is set to true or not.


/**
* Test that clicking on refer a friend changes the state of the portal app dialog
* The ReferJob pop up cannot be tested as Enzyme's mount method does not
* render inner components
*
* This illustrates how we can test component level methods
*/
it("opens a `PortalAppDialog` and shows `ReferJob` on clicking on refer a friend link", () => {
    const firstJob = referralJobs().find(Grid).first().find(Paper).first();
    const referAFriend = firstJob.find('.details-link');
    expect(referAFriend.length).toBe(1);
    const referDialogBeforeClick = referralJobs().find(PortalappDialog);
    expect(referDialogBeforeClick.first().prop('open')).toBe(false);
    const referLink = referAFriend.first();
    referLink.simulate('click');
    const referDialogAfterClick = referralJobs().find(PortalappDialog);
    expect(referDialogAfterClick.first().prop('open')).toBe(true);
  });
});

In the above snippet we have introduced the simulate functionality. The simulate is a function of enzyme that helps you trigger events on elements.

In our ReferralJobs component code, the click handler for the refers a friend link is an internal component method. Thus, by performing this test we are also testing the internal private method of the component.

Components form the core of a react application. I hope this would have helped you understand how and what to unit test in a component of a react application.

Testing Redux Asynchronous action creators

Assuming that you know what action creators are, let me move to asynchronous action creators. Your actions may make asynchronous HTTP requests to get data, and testing them is important. Action creators are responsible for dispatching the right actions to the redux store.

In our ReferralJobs application, we have used axios to make HTTP requests. You may alternatively use any library(ex: fetch) to make your HTTP requests. In order to test this action creator, we will have to mock the HTTP request and the response. Also, we need to mock the redux store to handle the actions.
For this, we have used redux-mock-store to mock the store and axios-mock-adapter to mock our axios HTTP requests.(If you are using anything apart from axios to perform HTTP request, consider using nock).


const middlewares = [thunk];
const mockStore = configureMockStore(middlewares);
axios.defaults.baseURL = 'https://appst.portalapp.com/';
const mock = new MockAdapter(axios);

In our test case, we are simply using the libraries and setting up the mocks. We will create a mock store for testing the action creators.


const middlewares = [thunk];
const mockStore = configureMockStore(middlewares);
axios.defaults.baseURL = 'https://appst.portalapp.com/';
const mock = new MockAdapter(axios);
/**
* Test Suite to test Asynchronous actions
* This test case mocks the API request made by axios with the help of
*/
describe('Testing Async Actions', () => {
    afterEach(() => {
        mock.reset();
    })
    it('creates GET_REFERRAL_JOBS when fetching referral jobs has been done', done => {
        mock.onPost(getReferralJobsURL).reply(200, { data: { todos: ['do something'] } });
        const expectedActions = [
            {"type":"FADE_SCREEN"},
            {"type":"GET_REFERRAL_JOBS","payload":{"todos":["do something"]}},
            {"type":"STOP_FADE_SCREEN"}
        ]
        const store = mockStore({ todos: [], loginState: {accessToken: "sampleAccessToken" }})
        store.dispatch(getReferralJobsAction({}));
        /**
         * Setting timeout as axios and axios-mock-adapter don't work great
         * We should move from using axios and use fetch if we don't have a specific reason using
           axios
         */
        setTimeout(() => {
            expect(store.getActions()).toEqual(expectedActions);
                done();
        }, 1000)
    })
})

Once our mockStore is ready, in our test case we mock the HTTP request that is expected to be made. This also gives it a sample response.

When an action creator is called, it is expected to dispatch actions to the store and this is a known object defined as expectedActions in our test case.

We then dispatch the action creator into our store with empty data, now the action creator is expected to make a HTTP request and dispatch necessary actions into the store.

We are testing this post the timeout as the http request is an asynchronous call(this is a hack for getting it working on axios-mock-adapter, if you are using nock the timeout is not required)

Testing Redux Reducers

Testing reducers are straightforward and are not dependent on any setup. Any reducer, when called with an action type, must modify the state and return the state back. Hence we only have to test a state object.

In our test case, we are passing a payload to the action type and we expect the state to change with the appropriate payload.

Redux handles the state of data flow within a react application, writing Unit Tests for its action creators and reducers would bring in a lot of value to your application’s quality. I hope this article helped you.

Yesterday was my first day at HashedIn. My onboarding experience was so good that I had to write and share it with the world.

I have been designing consumer-facing experiences since 2004 and I can very well gauge the level of subjective satisfaction a candidate gets during his onboarding “experience”. The first day at work is just like your first date. It is the first impression that either breaks or drives your dreams. If your first day at work is great, you know that you have a great start in the organization. Believe me, when I say this, HashedIn scores a perfect 10/10 on this aspect.

It all started with the HR meeting me sharp on the mentioned time. Time being an important finite resource, I felt the company values it so much. All the statutory form-filling was done in less than 30 minutes, something that generally takes hours in other organizations.

The HR briefed me on the important pointers essential for the first two days of induction as well as for the long run. She gave me all the information I needed and checked with me regularly to know if I was up to speed or need further clarification.

An Agile, Fun and Flexible Culture

My experience with IT folks had never been this hassle-free. While people joining other firms run from pillar to post to get their machines configured, I was up and running in less than 15 mins with a brand-new MacBook Pro. All necessary software installed and running. My new bank account created just under 10 minutes!

There were some really cool goodies in the Welcome Kit like the Hasher’s T.Shirt, a cool badge, company handbook etc. My buddy and I were given matching T.shirts and I was able to quickly relate to the pre-designed experience, without anyone detailing me the same. I felt my learnings got a fresh start.

With my buddy detailing out the business models, clientele and operating mechanisms, I was able to get all the important information quickly without skimming through pages and pages of process documents. All my queries were answered and I felt really excited to get started with my project. He took me around and introduced me to almost 100 other Hashers. Everyone across the four widely spread floors were warm and welcoming. The work culture here is so positive that I instantly felt like a part of the HashedIn team and got super excited to work with everyone.

Towards the end of the day, it was time for chai with CEO. Yes! You heard it right. I had a fruitful interaction with the CEO of the company on the very first day at work. I also had the chance to interact with all of my fellow designers in the company. We shared interesting stories and exchanged cool ideas. Time flew by, that I wished the day never came to an end.

With an open and fun work culture like this, I am looking forward to coming to work every day and experiencing professional happiness like never before.

HashedIn, you rock!

A Big shout-out to fellow Hashers Monica, Thanmayi, Ashish, Azhar and our CEO Himanshu for creating the best onboarding experience ever!

Executive Summary

Our client’s organization was founded by members of the Berkeley AI Research Lab with families of patients affected with Dementia. The objective was to reduce the frequency and impact of falls which has been the leading cause of hospitalization in Alzheimer’s care. Their existing platform was unable to scale and monitor at a facility level. It also lacked the ability to tag the recordings of fall videos to allow the facilities team to prevent future incidents.

Problem Statement

Our client was seeking to enhance their existing platform and reduce the complexity of the features present in it. They wanted to scale their application through deployment at multiple facilities and integrate it with complementary platforms in the space. Their existing platform lacked a triggering mechanism which could serve as a warning before the patient could get entangled in any untoward incident. It was also of great necessity that the application possessed the capability to work along with the AI module by capturing and analyzing the video when an incident occurs and triggering an alarm to alert the concerned authority.

HashedIn’s Proposal

HashedIn suggested to perform an assessment on their code base and develop a roadmap comprising of technological initiatives which can be executed. Based on the assessment results, HashedIn proposed to re-architect and refactor their cloud services by moving it from Django and Angular 1.x. HashedIn intended to separate UI from backend with APIs and use industry best practices along with proprietary frameworks to restructure code in a short period.

Business Requirements

Our client needed their existing AI based patient monitoring application to be refactored to make it deployable across multiple centres and have a central monitoring system that allows any potential fall to be quickly tagged and alert the facilities to quickly address the patients who are in need of immediate attention. The videos of potential fall needs to be tagged for study and preventive measure would be taken by the facilities team to ensure patient safety

Impact and involvement of stakeholders

Facility Manager – Will be able to take prompt actions based on the tagged videos

Alarm Taggers – They will be involved in tagging the video captured by the AI modules, which are located in Old age homes

Facilities for patients suffering from mental disorders

Solution Approach

Our Solution Structure

We introduced queuing mechanism using RabbitMQ to improve the alarm tag activity, store the alarms, and distribute the same among alarm taggers based on their availability.

RabbitMQ module helps the alarm taggers to get a summary of available videos waiting to be tagged, this helps prioritize their futuristic actions.

The database was optimized and indexing was also introduced to improve the performance of the existing APIs in fetching data from database.

We also proposed and implemented a record keeping of various activities that can be used in MIS reporting.

We migrated the UI code from Angular 1 to Angular 7 to introduce new features.

Earlier UI code of the client was based on Angular 1 and DJango which used server side rendering which made the application slow. We helped migrate the application from Angular 1 to Angular 7 and used the REST APIs to make the application faster & efficient.

We built a web interface through which the alarm tagger can toggle between different alarms.

Our Web interface enabled alarm taggers to monitor list of pending alarms of different types from the same screen and prioritize their tagging activity accordingly.

We introduced a functionality which triggered a notification when an alarm tagger’s session expires, a notification is also triggered if they are inactive and requeue the alarm with a higher priority.

Technology Stack

Back end: Django(Python), RabbitMQ

Front end: AngularJS, Angular 7

Business Outcomes

This solution helped the concerned facilities facing shortage of staff, manage it effectively.
The solution helped reduce the number of falls by 40% and avoid injuries caused to patients suffering from mental disabilities.
Our solution helped reduce the time taken in video tagging activity and reduced the frequency of emergency room visits by 70%.
Our solution helped improve the efficiency of the video tagging activity by reducing the number of clicks involved between different events (reduced from 3 clicks task to a single click task to ease the process).
We also provided the manager to track the efficiency and productivity of the alarm taggers.

Executive Summary

The customer is a recently founded fintech company based out of the USA which is both an investment manager and a SaaS provider. They intend to build a product to assist financial analysts at investment firms to assist in trading stocks. The goal of this product is to provide customers a managed portfolio of stocks which are predicted to give the best returns for the foreseeable future. The returns for each company would be predicted by using a Machine Learning algorithm.

Problem Statement

The client needed a product/solution to advice financial analysts for investing in stocks in the US market, the predicted best-performing stocks from the Russell 1000 index, in order to maximize their customer’s profits by building portfolios for investing in stocks from the index. businesses.

Business Requirements

End Objective

To provide an application that can predict stock returns for each stock against the performance of the Russell 1000 index and creating a portfolio of the predicted best-performing stocks in the index to provide insight to financial analysts to buy/sell stocks based on predicted performance.

Key Requirements

Historical Data for the stocks in the index

Understanding of securities and options.

A user interface for visual representation of calculated portfolios and the metrics associated with the generated portfolios

Impact and involvement of stakeholders

Large investment management firms

AI-based forecasting of alpha also helps smaller or non-specialized institutions, individual investors.

Solution Approach

Our Solution Structure

All Data imported from various Data sourcing companies are stored in a framework using resources in a cloud infrastructure. In this case, we extensively use Amazon Web Services (AWS)

A custom-built Machine Learning model using Long Short Term Memory (LSTM) Neural networks, which predicts the returns of a given stock within a stipulated time frame.

A service to regularly train the existing Models based on daily updated data

A custom-built selection Algorithm for building a portfolio by judging the predicted returns from the models based on several parameters.

A user interface for viewing the results

We used TensorFlow and Keras libraries to develop an LSTM Neural network factoring in the following inputs:

Historical Prices of the stocks, their history of splits, spinoffs, etc…

External variables (For example. Oil Prices, Gold Prices, etc…)

CBOE options data (Stock Volatilities)

Custom created Variables that track the actual performance of a company without relying solely on the prices data.

Among others…

Solution Dynamics and Interactions

Financial prediction problems are a difficult type of predictive models. The complexity comes from the dependency between the factors vary over time. To tackle this problem we need a recurrent neural network, which can learn the pattern in which the stock market moves and predict based on that.

We have more than 50 input factors which are fed into our model. LSTM model will be trained in high GPU accelerated machines. We predict a top excess return over Russell 1000. Based on the excess return prediction portfolio will be generated. The portfolio has top N (adjustable by users as per preference) companies which can overperform with respect to the index.

User Interface

We built a web interface which creates and displays a portfolio for different time periods which are limited to the backtesting period.

The user can change the financial index (benchmark) with which they want to compare varied stock’s performance.

The user also has the option to generate graphical results which assist in visually inspecting Portfolio performance.

Technology Stack

Tensorflow / Keras (Python)

Django Framework (Back-end)

ReactJs (Front-end)

Business Outcomes

We helped our client to capitalize on the state of the art AI techniques to predict the alpha in the market.

Our solution is helping the customer grow as a massive institutional traders.
Safeguard retail and small institutional traders.
In a market crash situation, it avoids a big loss.
The graphical representation will help traders to understand how a particular stock may move in the future before they invest in it.

Executive Summary

Our client provides customizable supply chain solutions for grocers. These include several features such as distributor onboarding, pickup locations, syncing products and customers, maps search, and order placements for both B2B and B2C customers. They wanted to optimize their supply chain from farm to fork.

About Client

Our client brings diverse food value chain knowledge combined with experience from outside the food industry implementing customer-centric internet solutions for F500 companies. They provide easy solutions for the supply chain of grocery with several features such as distributor onboarding, pickup locations, syncing products, customers, maps search, order placement for both B2B and B2C businesses.

Problem Statement

Our client wanted to optimize the food supply chain by helping restaurants source agricultural products directly from the farmers. They wanted a seamless experience for their customers, by ensuring freshness and reducing costs. However, they were facing issues with multiple aspects of their operations, listed below:

Syncing orders, customers, products, and various other data from the distributors’ server

Onboarding solution for distributors

Error reporting and resolutions

Promotional material

User activity and tracking of promotion downloads

Elasticsearch upgrade and periodic backup of Elasticsearch

Faster image loading on the website using CDN

SMPS for storing API keys, database credentials, and several sensitive data

Location searching capabilities

Elasticsearch autocomplete feature

Implementing MFA for AWS local testing

Upgraded frontend AngularJS code to latest Angular version

Promo Code Module

HashedIn’s Solution

We developed a web application, with features like tracking the origin of the product, automatic sync of available items, and automated demand forecasting. Restaurants can use it to view and order agricultural products from their nearest source and are also able to track the product to its origin. With automated sync, available items were updated at each source. Moreover, with a smart population of base products of food items, restaurants were able to automate demand planning and forecasting. HashedIn showcased its technical expertise by solving the challenges the client was facing, through various methods mentioned below.

Scheduled jobs to sync various orders, products, locations, etc., to and from the distributor with built-in fault tolerance and issue reporting through email and SMS while leveraging multi-threading for optimal utilization of computing resources.

Upgraded the web application to Angular 8 for better browser compatibility, performance, and maintainability.

Templated the promotional materials like flyers and stickers using Thymeleaf for easy customization

Increased reliability and performance of Elasticsearch used for the product suggestion in the application by regularly scheduled backups and adequate preprocessing.

Introduced AWS CloudFront & Lambda@Edge for image delivery thereby reducing the loading time of the website.

Increased the security aspect of the application by moving all the sensitive data like system parameters, keys, and URLs to AWS SMPS (System Manage Parameter Store) which were earlier being stored in a properties file as plain text.

To increase security towards unauthorized access, multi-factor authorization was introduced.

Technology Stack

Backend – Java, Spring boot, Hibernate

Frontend – Angular

Database – MYSQL

Cloud Service Provider – AWS

Cloud Services – EC2, RDS, EBS, SNS, SMPS, Secrets Manager, CDN

Business Outcomes

We right-sized AWS instances based on the usage and reduced the bill by $130/month.
The load time of the website was reduced by 85% (from 12.6s to 1.9s0), leading to enhanced customer experience.
To understand user behavior better, promotions and user tracking were enabled. The promotion codes led to a higher number of customer attractions and acquisitions.
Moreover, the analysis of various parameters like user login count and promotional card download count was done through the promotion card downloads.
Product data was periodically backed-up, to minimize data loss.
Improved fault tolerance for the batch processes resulting in lesser maintenance.

About the Company

This conglomerate is the third-largest in the Indian private sector. It has interests in viscose staple fibre, metals, cement (largest in India), viscose filament yarn, branded apparel, carbon black, chemicals, fertilisers, insulators, financial services, and telecom. This multinational has a dedicated cell, which is the data science center of the conglomerate, which is managed from their Mumbai headquarters.

Problem Statement

One division of the organization is building a platform that deploys custom-built AI/ML models for analyzing videos and providing customized solutions for solving business problems. The platform is scalable and built for quick deployment to process multiple camera streams in parallel and apply advanced video analytics in near real-time.

They required solutions to automate the safety guidelines that had to be followed within their cement and aluminum plants, using pre-installed cameras and computer vision processing. Compliance with guidelines has become even more significant to avoid spreading any infectious diseases, such as COVID-19.

HashedIn’s Solution

HashedIn built the data pipelines and a dashboard for the plant supervisor to view and/or receive an alert (voice, SMS, email, etc.) for any employees that are breaking the stipulated safety guidelines, in near real-time. We built a dashboard for the business stakeholders to visualize the violations happening on the plant to make business decisions, accordingly. A technical dashboard was developed for system and application health anomalies and data transfer checks from one subsystem to another.

This platform can be used to build solutions across sectors and functions, with features such as:

Helmet detection

Vest detection

Fire detection

Intrusion detection

Social distancing detection

Arrow direction detection

Unattended object detection

Face-mask detection

Vehicle detection

The administrator can configure the email id’s of area managers and plant-heads in the system so that they can receive detailed daily/monthly/annual reports daily. To expose this data, events, alerts to external systems for integration, which would be similar to the existing voice alert system. External systems such as turnstiles (automatic doors – for the unauthorized personnel, the doors won’t open), the plant siren (can be used for fire alert to inform all the workers to evacuate the area), etc. For the safety of the employees who were back to work amid the COVID-19 circumstances, face-mask and social distancing features were deployed, before the plants were opened.

Let us look at the use case for Helmet and Vest Identification:

A helmet and vest should be worn by everyone present on the plant, as there is heavy equipment operating in the area. Employees will be exposed to risks if their helmet/vest is removed while working on the site. To check compliance (whenever anyone is observed without their helmet/vest), the client wanted a system that plays a voice alert and a supervisor dashboard, that gives real-time data to the supervisor/plant-head.

Technology Stack

Django

Celery

RabbitMQ

FFMPEG

React (Typescript)

Python Scripting

Grafana

PostgreSQL

Nginx

Docker

Business Outcomes

A web interface to the end-user (plant supervisors, plant safety managers, compliance officers, etc.) to view any safety violations/intrusions/compliance exceptions/etc. happening at the plant in a near real-time environment.
Enabled real-time alerts on specific exception events through multiple channels (on-premise speakers, dashboards, SMS, emails, etc.)
A web interface for the business stakeholders to view the summary of any violations happening at the plant on a daily level.
The data is exported to the cloud on a daily basis.
In the near future, a cloud-based dashboard will be created on cloud for the business users to analyze the violation trends of multiple plants, giving them reports at an hourly level.

Google Cloud Firestore

Google Cloud Firestore is a serverless, real-time, highly scalable, NoSQL document-based database that is adept at syncing, storing and querying data and can be used for web-based, mobile or IoT apps. It provides built-in security, auto scaling and multi-region replication. In addition, it gives software development kits (SDKs) for mobile apps, and has server-side components with robust client libraries that directly connect your app to data, accompanied with built-in offline support. Data is organized in collections and documents and can be accessed via performant queries and by using a particular database entry path. Every document needs to belong to a collection and can have sub-collections that belong to a document, along with a generous billing structure where you pay based on reads and writes.

Google Cloud Bigtable

Google Cloud Bigtable is a fully managed NoSQL big data service that is highly scalable and best suited for analyzing huge workloads. Bigtable powers substantial services like Google Maps and Gmail behind the scenes and distributes data to drive performance on massive datasets. It is also designed as a sparsely populated database3 that can scale to thousands of columns and billions of rows, making it ideal to be used while dealing with terabytes or petabytes of data. As a big data database service, it also integrates well with existing big data tools like Google Cloud Hadoop. This data service can also easily accommodate an increase in requests and is best suited for containing structured or semi-structured data like financial, reporting, marketing data, or data needed to run machine learning models. Google Cloud Bigtable can be set up using Google Cloud, the web console, or the API.

Google Cloud Storage

Google Cloud Storage (GCS) is the object storage service that can be used for large scale data processing and provides reliable, scalable and consistent data storage. Contrary to a traditional file storage system, this is an object storage service that stores data in the form of an arbitrary sequence of bytes addressed by a unique key for object-based storage services in the form of an URL and can easily be used with other web-based technologies. Object storage utilizes grouping data in unique namespaces, popularly called “buckets.” Enterprises can primarily use GCS to store large binary objects, data to be served to websites, content data, historical data for compliance and as a data archive.

Storage Classes in Google Cloud

Google Cloud buckets have various storage classes:

Standard: This is the most common choice and corresponds to the bucket being in a specific Google Cloud region or stored across multiple regions. Typically, multi-region is opted when the data is very frequently accessed and
needs to be geo-redundant. In contrast, regional is opted for frequent data access in a specific region and is relatively less redundant. This storage class is best suited for data that needs to be highly available and quite
performant.
Nearline: This choice is typically used if data must be accessed less than once per month. This is a low-cost option but is a highly durable option for monthly reports or similar scenarios.
Coldline: This is like Nearline storage but is used for data typically accessed once per year or less frequently. This is a very low-cost, yet highly durable service commonly used for archival, backup and storing data for
compliance purposes with a minimum 90-day storage duration and relatively higher costs per operation.

Google Cloud SQL

Google Cloud SQL is a fully managed, easy to use relational database management system (RDBMS) that offers MySQL, Postgres and SQL server as a service on the cloud. Suppose you are in the starting phase of building your company and don’t want to worry about the integrities of applying patches, configuring for replication, backup and updates. Google Cloud SQL can seamlessly integrate with other Google Cloud offerings like App Engine, Compute Engine or Kubernetes Engine.

Cloud SQL provides vertical and horizontal scaling and can be configured either using the cloud console or the Google Cloud command-line interface. This is the ideal RDBMS if you are looking for frequent queries, and fast response time.

Google Cloud Spanner

Google Cloud Spanner can be leveraged when you need a SQL database for massive scale, to the tune of 1000s of writes per second and 100,000s of reads per second, globally. This is a fully managed, unlimited scale, string consistency RDBMS system on the cloud that supports secondary indexes and provides vital data consistency by employing hardware-assisted time synchronization. It is a globally replicated database that encrypts data at rest and transit and has low maintenance, and a high availability of 99.999 percent offering import and export of data. This database can be set up by the cloud console or the Google Cloud command-line interface and is ideal for large scale projects in domains like healthcare, retail, finance, etc.

Google Cloud BigQuery

BigQuery is a column comprehensive data warehousing service and an analytical database that indexes data by the column. It is a serverless data warehouse designed to ingest, store and query large amounts of data. BigQuery provides ways to aggregate data from discrete sources and make it available for business processing and can integrate with other third-party tools for data analysis and visualization with ease.

It works with standard SQL and provides client libraries for interacting with it in multiple programming languages and can be set up using the web console, command-line tool, or API calls in corresponding client libraries. Fully managed by Google Cloud, it only charges you for storing, querying and streaming the data.

Choosing the right storage option

Requirements to consider	Service to consider
Do you need a robust and scalable NoSQL database for cloud-native applications? Do you prefer a database that seamlessly integrates with serverless architecture? Do you need a database where you pay as you go?	Google Cloud Firestore
Are you dealing with data that is at least 1 TB in size? Do you need a key value pair NoSQL DB that needs to deal with mass storage of data rather than application state data?	Google Cloud Bigtable
Do you require high performing, scalable data storage with simple administrative overhead? Do you require an effective solution for storing large volumes of data, but not limited to backup, content storage, archival, compliance, disaster recovery incidents? Do you require encryption for the data at rest as well as in transit?	Google Cloud Storage
Do you need a fully managed generic SQL system that encrypts data automatically in rest as well as in transit? Are you looking for a DB that encrypts external connections as well? Do you need an RDBMS on the cloud that scales horizontally and vertically?	Google Cloud SQL
Do you need a highly available RDBMS for massive, large-scale data, which allows ACID updates? Do you need a system that is encrypted to not worry about data corruption? Do you need a database that auto replicates and facilitates online schema changes?	Google Cloud Spanner
Do you need a fast, highly scalable and reliable data warehouse for data analytics? Do you need a secure environment, where data is encrypted and protected with IAM support? Are you looking for a disaster proof solution where you can easily revert to a previous state?	Google Cloud BigQuery

Google Cloud’s dynamic, scalable storage services for various applications, database or business requirements can help your business ensure that your data is properly protected and retrievable for necessary performance.

References

https://youtu.be/IemOAESlWKw

https://youtu.be/Lq9uDOM4whI

https://medium.datadriveninvestor.com/storage-options-in-google-cloud-platform-rundown-f78100c4ed37

https://cloud.netapp.com/blog/object-storage-block-and-shared-file-storage-in-google-cloud

https://www.youtube.com/watch?v=Kl8ig2BtLAY

https://www.youtube.com/watch?v=amcf6W2Xv6M

https://www.youtube.com/watch?v=m8WqxLd1jSc

https://www.netsolutions.com/insights/what-is-google-cloud-sql-its-features-and-some-products-that-have-benefited-from-it/

Cloud Billing 101

Whether you’re starting your business on the cloud or assessing to migrate from your current cloud strategy, taking full advantage of any cloud transition means choosing project-based services that not only offer rich functionality, but are secure, affordable, and easy-to-use. This guide to Cloud Billing is a step-by-step resource on how to make invoicing, payments, and budget tracking more efficient for your business.

Billing accounts

Google Cloud is a collection of cloud computing services including compute, storage, networking and specialized services such as machine learning (ML). All project cloud services are charged to a linked billing account using Google Cloud projects.
A Cloud Billing account is a cloud level resource that stores information on how to pay charges associated with each project. All projects require a billing account—unless they use only free services—and can be associated with one or more projects following a similar structure to the resource hierarchy. If every department in a company pays for their cloud services from the same part of the company’s budget, they can use one billing account, or have different billing accounts linked to each of the departments.

Fig 1 . Resource Hierarchy

Billing account types

Self-serve accounts are paid from a bank account by credit/debit card or direct debit from a checking account. The costs are charged automatically to the payment instrument connected to the Cloud Billing account. Documents generated for self-serve accounts include statements, payment receipts, and tax invoices, and are accessible in the Cloud Console.
Invoiced billing accounts send bills or invoices to customers electronically or by mail using a check or wire transfer and are mostly used by enterprises and other large customers.

Payments

Cloud billing accounts are connected to a Google Payments profile that requires a form of payment and permanent profile type. This setting can’t be changed and must be carefully chosen.
There are two types of payment profiles:

Individual
- Used for personal payment accounts
- Users can manage the profile, but you cannot add or remove users or change permissions on the profile
Business
- Used for payments on behalf of a business, organization, partnership, or educational institution
- Allows you to add other users to the Google Payments profile to access or manage the profile and view the payment information

Charging cycle

The charging cycle on your Cloud Billing account determines how and when you pay for your Google Cloud services using either monthly billing or threshold billing (costs are charged when your account has accrued a specific amount). For self-serve Cloud Billing accounts, the charging cycle is automatically assigned when the account is created and can’t be changed. For invoiced Cloud Billing accounts, one invoice is received per month and payment terms are determined by your Google agreement.

Important roles associated with billing

Role	Description	Recommended assignee
Billing account creator	Enables users to create new billing accounts.	Users with a financial role in the organization.
Billing account administrator	Allows users to manage billing accounts, but they cannot create them.	Users more finance-minded, as the cloud admins.
Billing account user	Allows a user to associate projects to billing accounts.	Project creators will need this role.
Billing account viewer	Enables a user to view billing account costs and transactions.	Users like the auditor, who needs to be able to read billing account information, but not change it.

Billing budgets and alerts

Google Cloud also provides an option to define budgets and set alerts. A budget helps you to track your actual Google Cloud spend against your planned spend. Once the budget is defined, alarms can be set by defining the threshold values, which trigger an email notification helping users to take action to control costs and stay within their budget.

Setting up budgets and alerts

Fig 2. Budgets and alerts.

How to set up budgets and alerts

From the console, go to billing and then budgets and alerts
The form as shown in figure 2 will be displayed
In the form, first you give a name to the budget and then specify the budget billing account
While linking the billing account, keep in mind that the budget and alerts you specify should be based on what you expect to spend for all projects linked to the billing account
Set the budget account—you can either specify a particular amount—or specify that your budget is the amount spent in the previous month
Once the budget is set, you can set three alert percentages
By default, the percentages are set to 50 percent, 90 percent, and 100 percent and you can change the percentages according to your needs
If you’d like more alerts, click on “Add Item” in the “Set Budget Alerts” section and specify the alert percentage
Whenever the specified percentage of the budget is spent, all the recipients are notified by email
Recipients of the mail can be specified in two ways:
- Role-based, i.e., billing administrators and billing account users are notified by email, which is the default option
- Use cloud monitoring and notify others in the organization to receive the notification
Check the box “Connect a Pub/Sub-topic to this budget” in the “Manage Notifications” section, if you want to respond to the alerts programmatically; this will send notifications to the pub/sub-topic
Click “Finish”

Creating a billing account and linking it to your project is the first step to using Google Cloud.Once the budgets and alerts are set, the billing administrators, billing account users and other recipients start receiving alerts as the cost increases. There is also an option to export billing data to either a BigQuery or Cloud Storage for further analysis.

To learn more about Deloitte’s alliance with Google Cloud, visit www.deloitte.com/googlecloud.

References

Dan Sullivan. Official Google Cloud Certified Associate Cloud Engineer Study Guide. Sybex, April 2019.

https://cloud.google.com/billing/docs/concepts

https://cloud.google.com/billing/docs/how-to/budgets

Data engineering is the aspect of data science that focuses on practical applications of data collection and analysis. It focuses on designing and building pipelines that transport and transform data into a highly usable format. These pipelines can take data from a wide range of sources and collect them into a data warehouse/ data lake that represents the data uniformly as a single source of truth. The ability to quickly build and deploy new data pipelines or to easily adapt existing ones to new requirements is an important factor for succeeding a company’s data strategy. The main challenge in building such a pipeline is to minimize latency and achieve a near real-time processing rate to process high-throughput data.

Building a highly scalable data pipeline provides significant value to any company doing data science. So, here are few important points to consider while building robust data pipelines:

Pick the Right Approach
The first and foremost thing is to choose appropriate tools and frameworks to build a data pipeline as it has a huge impact on the overall development process. There are two extreme routes and many variants one can choose between.
- The first option is to select a data integration platform that offers graphical development environments and fully integrated workflows for building ETL pipelines. It seems to be very promising but often turns out to be the tough one as it lacks some significant features.
- Another option would be to create a data pipeline using powerful frameworks like Apache Spark, Hadoop. While this approach implies a much higher effort upfront, it often turns out to be more beneficial since the complexity of the solution can grow with your requirements.

Apache Spark v/s Hadoop

Big Data Analytics with Hadoop and MapReduce was powerful, but often slow, and gave users a low-level procedural programming interface that required people to write a lot of code for even very simple data transformations. However, Spark has been found to be optimal over Hadoop, for several reasons:

Lazy evaluation in Apache Spark can overcome time complexity. The time gets saved as operations won’t get executed until it is triggered.
Spark has a DAG execution engine that facilitates in-memory computation and acyclic data flow resulting in high speed. Here, the data is being cached so that it does not fetch data from the disk every time thus the time is saved.

Spark was designed to be a Big Data tool from the very beginning. It provides out-of-the-box bindings for Scala, Java, Python, and R.

Scala – It is generally good to use Scala when doing data engineering (read, transform, store). For implementing new functionality not found in Spark, Scala is the best option as Apache Spark is written in Scala. Although Spark well supports UDFs in Python, there will be a performance penalty, and diving deeper is not possible. Implementing new connectors or file formats with Python will be very difficult, maybe even unachievable.
Python – In the case of Data Science, Python is a much better option with all those Python packages like Pandas, SciPy, SciKit Learn, Tensorflow, etc.
R – It is popular for research, plotting, and data analysis. Together with RStudio, it makes statistics, plotting, and data analytics applications. It is majorly used for building data models to be used for data analysis.
Java – It is the least preferred language because of its verbosity. Also, it does not support Read-Evaluate-Print-Loop (REPL) which is a major deal-breaker when choosing a programming language for big data processing.

Working With Data
However, as data starts increasing in volume and variety, the relational approach does not scale well enough for building Big Data applications and analytical systems. Following are some major challenges –
- Managing different types and sources of data, which can be structured, semi-structured, or unstructured.
- Building ETL pipelines to and from various data sources, which may lead to developing a lot of specific custom code, thereby increasing technical debt over time.
- Having the capability to perform both traditional business intelligence (BI)-based analytics and advanced analytics (machine learning, statistical modeling, etc.), the latter of which is challenging to perform in relational systems.
The ability to read and write from different kinds of data sources is unarguably one of Spark’s greatest strengths. As a general computing engine, Spark can process data from various data management and storage systems, including HDFS, Hive, Cassandra, and Kafka. Apache Spark also supports a variety of data formats like CSV, JSON, Parquet, Text, JDBC, etc.
Writing Boilerplate Code
Boilerplate code refers to sections of code that have to be included in many places with little or no alteration.
- RDDs – When working with big data, programming models like MapReduce are required for processing large data sets with a parallel, distributed algorithm on a cluster. But MapReduce code requires a significant amount of boilerplate.This problem can be solved through Apache Spark’s Resilient Distributed Datasets, the main abstraction for computations in Spark. Due to its simplified programming interface, it unifies computational styles which were spread out in otherwise traditional Hadoop stack. RDD abstracts us away from traditional map-reduce style programs, giving us interface of a collection(which is distributed), and hence a lot of operations that required quite a boilerplate in MapReduce are now just collection operations, e.g. groupBy, joins, count, distinct, max, min, etc.
- CleanFrames – Nowadays, data is everywhere and drives companies and their operations. The data’s correctness and prominence reserves a special discipline, known as data cleansing, which is focused on removing or correcting course records. M/e involves a lot of boilerplate code. Therefore, Spark offers a small library CleanFrames to make data cleansing automated and enjoyable. Simply import the required code and call the clean method. The clean method expands code through implicit resolutions based on a case class’s elements. The Scala compiler applies a specific method to a corresponding element’s type. CleanFrames come with predefined implementations that are available via a simple library import.
Writing Simple And Clear Business Logic
Business logic is the most important part of an application and it is the place where most changes occur and the actual business value is generated. This code should be simple, clear, concise, and easy to adapt to changes and new feature requests.Some features offered by Apache Spark for writing business logic are –
- RDD abstraction with many common transformations like filtering, joining, and grouped aggregations are provided by the core libraries of Spark.
- New transformations can be easily implemented with so-called user-defined functions (UDFs), where one only needs to provide a small snippet of code working on an individual record or column and Spark wraps it up such that it can be executed in parallel and distributed in a cluster of computers.
- Using the internal developers API, it is even possible to go down a few layers and implement new functionalities. This might be a bit complex, but can be very beneficial for those rare cases which cannot be implemented using user-defined functions (UDFs).
To sum up, Spark is preferred because of its speed and the fact that it’s faster than most large-scale data processing frameworks. It supports multiple languages like Java, Scala, R, and Python and a plethora of libraries, functions, and collection operations that helps write clean, minimal, and maintainable code.

References

https://www.altexsoft.com/blog/datascience/what-is-data-engineering-explaining-data-pipeline-data-warehouse-and-data-engineer-role/

http://spark.apache.org/docs/latest/rdd-programming-guide.html

https://medium.com/@dawid.rutowicz/cleanframes-data-cleansing-library-for-apache-spark-eaae526ee958

A data lake is a pool of data from multiple sources. It is different from a data warehouse as it can store both structured and unstructured data, which can be processed and analyzed later. This eliminates a significant part of the overhead associated with traditional database architectures, which would commonly include lengthy ETL and data modeling when ingesting the data (to impose schema-on-write).

With ever-growing massive amounts of data collected and the need to leverage the data to build solutions, and strategies, organizations are facing a major challenge maintaining these massive pools of data and extracting valuable business insights from it. However, if the data in a data lake is not well-curated, it may flood it with random information which is difficult to manage and consume, leading to a data swamp. Therefore before going forward with a data lake it’s important to be aware of what are the best practices when designing, implementing, and operationalizing your data lake.

Let’s look at the best practices that help build an efficient data lake.

Data Ingestion
The data lake is allowing organizations to hold, manage, and exploit diverse data to their benefit. But here’s the reality, some data lakes fail to serve their purpose due to their complexity. This complexity may be induced by several factors, one of which is improper data ingestion. Building a sound data ingestion strategy is vital for succeeding with your enterprise data lakes.
- Addressing Business Problem: It’s always better to question the need for a data lake before diving straight into it. If the business problem demands it only then one should opt for it. It is important to stay committed to a problem and find its answer and later if building a data lake is the right way to go, then great! A common misconception that people have is that they think data lakes and databases are the same. The basics of a data lake should be clear and should be rightly implemented for the right use cases. In general, data lakes are suitable for analyzing data from diverse sources, especially when the initial data cleansing is problematic. Data lakes also provide unlimited scalability and flexibility at a very reasonable cost. Let’s look at some use cases where businesses/industries use data lakes:
  - Healthcare- There is a lot of unstructured data in medical services (i.e. doctors’ notes, clinical information, etc.) and a constant need for real-time insights. Therefore, the use of data lakes turns out to be a better fit for companies in healthcare/insurance, as it gives access to both structured and unstructured data.
  - Transportation- Data lakes allow you to transform raw data into structured data that is ready for SQL analytics, data science, and machine learning with low latency. Raw data can be retained indefinitely at a low cost for future use in machine learning and analytics. In the transportation industry, the business insights derived from the data can help companies reduce their costs and increase their profits.
- Schema Discovery Upon Ingest: It’s generally not a good idea to wait for the data to be actually in the lake to know what’s in the data. Having visibility into the schema and a general idea of what the data contains as it is being streamed into the lake will eliminate the need for ‘blind ETLing’ or reliance on partial samples for schema discovery later on.
- Ensure Zero Data Loss: Ingestion can be in batch or streaming form. The data lake must ensure zero data loss and write the data exactly-once or at-least-once. Duplicate events or missed events can significantly hurt the reliability of the data stored in your lake, but exactly-once processing is notoriously difficult to implement since your storage layer is only eventually (not instantly) consistent. The data lake must also handle variability in a schema and ensure that data is written in the most optimized data format into the right partitions, and provide the ability to re-ingest data when needed.
- Persist Data In The Raw State: It’s always good to persist data in its original state so that it can be repurposed whenever new business requirements emerge. Furthermore, raw data is great for exploration and discovery-oriented analytics (e.g., mining, clustering, and segmentation), which work well with large samples, detailed data, and data anomalies (outliers, nonstandard data).
Data Transformation
Data generation and data collection across semi-structured and unstructured formats is both bursty and continuous. Inspecting, exploring, and analyzing these datasets in their raw form is tedious because the analytical engines scan the entire data set across multiple files. Here are few ways to reduce data scanned and query overheads –
- Columnar Data Formats For Read Analytics: Columnar storage makes the data easy and efficient to read, so it is better to store the data that will be used for analytics purposes in a format such as Apache Parquet or ORC. In addition to being optimized for reads, these file formats have the advantage of being open-source rather than proprietary, which implies they can be read by a variety of analytics services.
- Partition Data: Partitioning of the data helps reduce query costs and improves performance by limiting the number of scans the data query engines need to do in order to return the results for a specific query. Data is commonly partitioned by timestamp – which could mean by the hour, by a minute, or by a day – and the size of the partition should depend on the type of query intended to run. One can also use time, geo, lob to reduce data scans, tune partition granularity based on the data set under consideration (by hour vs. by second).
- Chunk Up Small Files: Small files can be optimally chunked into bigger ones asynchronously to reduce network overheads.
- Perform Stats-based Cost-based Optimization: Cost-based optimizer (CBO) and statistics can be used to generate efficient query execution plans that can improve performance. It also helps to understand the optimizer’s decisions, such as why the optimizer chooses a nested loop join instead of a hash join, and lets you understand the performance of a query. Dataset stats like file size, rows, histogram of values can be collected to optimize queries with join reordering. Column and table statistics are critical for estimating predicate selectivity and cost of the plan. Certain advanced rewrites require column statistics.
- Use Z-order Indexed Materialized Views For Cost-based Optimization: A materialized view is like a query with a result that is materialized and stored in a table. When a user query is found compatible with the query associated with a materialized view, the user query can be rewritten in terms of the materialized view. This technique improves the execution of the user query because most of the query result has been precomputed. A z-order index serves queries with multiple columns in any combination and not just data sorted on a single column.
Data Governance
Don’t wait until after your data lake is built to think about data quality. Having a well-crafted data governance strategy in place from the start is a fundamental practice for any big data project, helping to ensure consistent, common processes and responsibilities.
- Maintaining Data Catalogs: Data should be cataloged and identified, with sensitive data clearly labeled. Having a data catalog helps users discover and profile datasets for integrity by enriching metadata through different mechanisms, document datasets, and support a search interface.
- Ensuring Correct Metadata For Search: It’s important for every bit of data to have information about it (metadata) in a data lake. The act of creating metadata is quite common among enterprises as a way to organize their data and prevent a data lake from turning into a data swamp. It acts as a tagging system to help people search for different kinds of data. In a scenario where there is no metadata, people accessing the data may run into a problematic scenario where they may not know how to search for information.
- Set A Retention Policy: Data should not be stored forever in a data lake as it will incur the cost and may also result in compliance-related issues. Therefore, it is better to have appropriate retention policies for the incoming data.
- Privacy/Security: A key component of a healthy Data Lake is privacy and security, including topics such as role-based access control, authentication, authorization, as well as encryption of data at rest and in motion. A data lake security plan needs to address the following five important challenges:
  - Data access control – The standard approach calls for using built-in Identity and Access Management (IAM) controls from the cloud vendor.
  - Data protection – Encryption of data at rest is a requirement of most information security standards.
  - Data leak prevention – Most major data leaks come from within the organization, sometimes inadvertently and sometimes intentionally. Fine-grained access control is critical to preventing data leaks. This means limiting access at the row, column, and even cell level, with anonymization to obfuscate data correctly.
  - Prevent accidental deletion of data – Data resiliency through automated replicas does not prevent an application (or developers/users) from corrupting data or accidentally deleting it. To prevent accidental deletion, It is recommended to first set the correct access policies for the Data Lake. This includes applying account and file-level access control using the security features provided by the cloud service. It is also recommended to routinely create copies of critical data in another data lake. This can be used to recover from data corruption or deletion incidents.
  - Data governance, privacy, and compliance – Every enterprise must deal with its users’ data responsibly to avoid the reputation damage of a major data breach. The system must be designed to quickly enable compliance with industry and data privacy regulations

Following the above best practices will help create and maintain a sustainable and healthy data lake. By devising the right strategy of collecting and storing data in the right way, one can reduce the cost of the storage, make data access efficient and cost-effective, and ensure data security.

References

https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/

https://www.stitchdata.com/resources/data-ingestion/

https://parquet.apache.org/

https://orc.apache.org/

https://azure.microsoft.com/en-in/blog/optimize-cost-and-performance-with-query-acceleration-for-azure-data-lake-storage/

https://docs.snowflake.com/en/user-guide/views-materialized.html

https://www.talend.com/resources/what-is-data-governance/