Inexpensive Analytics for Cloud SAAS Product

Anshuman Singh

17 Jul 2018

Inexpensive Analytics for Cloud SAAS Product

So you care about growth and metrics of your product? This guide will walk you through setting up inexpensive analytics for cloud SaaS product(end-to-end data pipeline). It’s for AWS, however, should work on Google Cloud Platform (GCP) or Azure if you replace AWS services with their equivalent.

Data Collection

It’s important to make sure that you are collecting data from all user and system interaction points. Create a spreadsheet and document all interactions, events and their destination. Typically, it would fall into these three categories

Collecting Log Files

Data written to log files are pushed to LogStash, which then pushes it to an AWS ElasticSearch service. When writing the logs, make sure we use a consistent formatting and have enough details.

User Events

All user interactions are tracked and sent to Google Analytics (GA). This includes user events from the UI, which is straightforward. However, it also includes non-UI events e.g.

With GA, you should switch on Custom User ID tagging. This would be a UUID generated for every user in your app and sent to GA. This has multiple benefits and allows associating user activity from multiple sessions to a single signed up user. Make sure you send the right User ID for all custom events tracked from a backend.

Application Data

This is straightforward. You are anyway storing user data. Just store these extra fields for every (almost) table & data entity. It will help analytics later on.

This helps in analytics and auditing when things go wrong. For critical entities, you can add more fields e.g. timesModified. Another option is to keep track of each and every modification. In such cases, a CQRS like command based modification will make sure you have full change history on the entity.

ETL Pipeline

All your data is now collected. However, it is in multiple places. For analytics, we want to get it all in one place, to run queries and get insights from it. This is what you can do.

ElasticSearch Data with Kibana & Elastalert

You shouldn’t be logging user events in logs. They go to GA. If you ensure that, that’s one less ETL activity. With that, there are two types of problems to solve using ElasticSearch logs.

DMS Pipeline for Importing Application Data

Reporting queries can’t be on the production database. AWS offers Database Migration Service (DMS) which can keep two databases in sync. You can use DMS to set up a replication task and data warehouse, an AWS RDS instance. If you choose AWS Aurora as the database, first 6 months of DMS instance is free. DMS uses log replication for syncing. The load on the production database due to it would be minimal.
In this step, you should exclude certain tables and fields.

GA API for Importing User Event Data

Google Analytics API can be used directly to import data. However, a different approach can work too. Metabase is an open source project to create beautiful dashboards. It integrates with data-sources like MySQL, Postgres, AWS Redshift and Google Analytics. You will be using Metabase for creating dashboard later and so might as well utilize it for GA integration. Here is how you can set it up

You would want the uploads to be overlapping and incremental. The easy way to do it is to run the script every 6 hrs and have the Metabase report return data for last 24 hrs. Python script would then delete the existing data for the date range it already has and the re-upload everything from the response. This does more work than necessary but is simple enough to implement.
Analytics Pipeline

Data Warehouse

Now you have all the useful data in the Aurora DB. We need to create a reporting schema that’s easy enough to write reports on as well as fast. The ETL pipeline gets normalized data. We will next create a Star schema for reporting. Here is how that is done.

Creating the Schema

Data Population

Once the schema is created, write queries to populate data. This is straightforward, except the queries would be run periodically, so you have to think about incremental updates.

Incremental Updates

Depending on the amount of data, there are multiple approaches. The simplest one is to wipe off the complete data set and repopulate it. Its simplest, but does more work than necessary and untenable for larger data sizes. Here is a quick take on making things incremental.

Schema creation scripts are run once. And then incremental updates are run using an hourly cron job.

Ongoing Maintenance

Once you set this up, there isn’t really any ongoing maintenance needed to keep operating as is. However, as things change and new reports are needed, you would need to do the following.

As a best practice, create a dashboard to monitor the pipeline. It would keep track of the last update from each source and if that becomes too old, it means something is broken.
That should set you with a solid simple and relatively inexpensive analytics pipeline. But that’s just the start. Now go ahead and build reports and dashboards to understand user behaviour and grow your product.

Some configuration and black magic are needed for each of the parts above to work. If you are planning to do it and face setup issues, leave a comment or message. Will be happy to help.


Have a question?

Need Technology advice?

Connect

+1 669 253 9011

contact@hashedin.com

facebook twitter linkedIn youtube