There has been an incredible boom witnessed in the data environment over the last couple of years. The volume of data is almost exploding with every passing day and is becoming increasingly unstructured as well. According to an IDC made on the ever-growing datasphere, the collective sum of word data would reach approximately 175 zettabytes.
Traditionally, the data produced by companies used to be quite limited and hence they used to keep their data on their premises. This data could be controlled easily, and there was no alternative available like a cloud storage system. While this legacy approach of storing data on-premise is used even today, the data being produced by contemporary firms are outgrowing these capabilities.
Modern companies are gradually realizing the problems associated with on-premises storage, as the data being produced by them increases consistently. The on-premises storage solutions can especially prove to be highly slow and complex to procure, maintain and extend, and may involve extremely high expenses as well. High CAPEX is a major problem associated with on-premise data solutions as significant upfront equipment expenses ultimately lead to poor ROIs while having long payback periods.
In addition to storing this huge amount of data collected, these organizations also struggle to cope up with the problems associated with analyzing these data sets and making decisions on time. While many of these firms have on-premise analytics teams, their systems are often disconnected from the part of their operations. Due to this, companies may lose out on real-time data and suffer delays in responses.
The rapid growth of unstructured data is another problem associated with traditional, on-premises databases. Unstructured data refers to data that does not fit neatly into the traditional row and column structure of relational databases. As per a report commissioned by Igneous, its unstructured data growing 23% annually, and hence is expected to double every forty months. Hence, it has become crucial for companies to search for solutions that can address the concerns related to this new-age unstructured data.
To avoid the problems associated with storing data on-site, companies can opt to shift their data to Amazon Web Services (AWS). It is the most comprehensive and broadly adopted cloud platform on the planet. AWS provides companies with a more cost-effective and efficient solution for storing data. With the help of AWS Glue, people can additionally both prepare and load their data for analytics with ease. It is a fully managed extract, transform, and load (ETL) service that can discover the data of a company and stores the associated metadata in the AWS Glue Data Catalog.
Amazon EMR is another advantageous AWS service, and it can especially help companies to process expansive amounts of data in a highly swift and cost-effective fashion. EMR is renowned to offer analytical teams of a company with the elasticity and engines to efficiently run Petabyte-scale analysis for just about a fraction of the costs in comparison to the typical on-premises clusters.
Consulting partner for AWS
HashedIn is a software development company that has been a consulting partner for the AWS since the year 2010. This company was an early mover to the AWS, and over the years, has helped numerous customers by launching, migrating and maintain their infrastructures in AWS. They, in fact, became an Advanced AWS Consulting partner in 2017.
HashedIn helped the analytics team of the Aditya Birla Group [ABG] to migrate their on-premise architecture to the AWS cloud. HashedIn assisted them with the PoCs of shifting their processes to the AWS cloud. One of the companies coming under the ABG used to have an on-premise data warehouse which was on a shared infrastructure and was very slow. Due to its low speed, this data warehouse failed to serve such purposes. Hence, ABG chose to shift this database to AWS. There was another company belonging to ABG that collected huge amounts of data, approximately 1000 variables per 10 ms. This data was dumped by the ABG team in .HDF5 and .dat format files. The analytics team belonging to the ABG wanted to do an RCA on these files, and find a scalable architecture solution for it.
On the whole, HashedIn basically aided the ABG team to solve two key problems, which were:
- ABG had an on-premise datacenter than ran on Teradata servers. They wanted to transfer it to the AWS ecosystem for the purpose of running their analytics at a scale.
- The ABG team had their analytics running on the R server, which they wanted to move on the AWS EMR.
HashedIn Solution Structure
For the purpose of data migration, the proposed solution offered by HashedIn was to use AWS Glue over the AWS Direct Connect connection. This would be a wholly server-less solution where each of the extracts, transform, and load (ETL) service would be independent of the other. This factor would minimalize the chances of any conflicts or failures even if multiple people are transferring data simultaneously from the same source. This solution was designed to facilitate the migration of a 200GB database from on-premise to the AWS.
When it came to the signal analyses, a proposal was made to convert the hdf5 files to apache parquet format as it is more compatible with a spark. Due to the columnar storage capabilities of parquet as it is more efficient to index on AWS S3 and query using Athena. The end objective of this involved the conversion of the provided sensor data in hdf5 format in a suitable format, to set-up a running cluster on EMR and analyzes the relevant and performs the necessary root cause analysis.
HashedIn was successfully able to do the migrations from the data center to AWS RDS instances with the usage of AWS Glue. They were also able to efficiently set up an EMR cluster and subsequently replicate a few of the analytical models written in R through Pyspark and benchmark the entire process.