Inexpensive Analytics for Cloud SaaS Product | HashedIn

Inexpensive Analytics for Cloud SAAS Product

Business - 17 Jul 2018
Anshuman Singh

So you care about growth and metrics of your product? This guide will walk you through setting up an end-to-end data pipeline. It’s for AWS, however, should work on Google Cloud Platform (GCP) or Azure if you replace AWS services with their equivalent.

Data Collection

It’s important to make sure that you are collecting data from all user and system interaction points. Create a spreadsheet and document all interactions, events and their destination. Typically, it would fall into these three categories

  • Monitoring Data goes into Log Files
    These are system events, alerts, warnings, execution time log.
  • User Event goes to Google Analytics (GA)
    Data for all user interactions, from UI, Emails, Services, Chatbots and other integrations.
  • Application Data in Database
    The actual user domain data, used by the application.

Collecting Log Files

Data written to log files are pushed to LogStash, which then pushes it to an AWS ElasticSearch service. When writing the logs, make sure we use a consistent formatting and have enough details.

  • Log Context
    The active user, team, request, and activity that was going on when the log was written. Makes it easy to trace back the problem to its origin. And keep track of parts of the application causes the most problems.
  • Consistent Formatting
    Make it easier to extract the information later. While some people use JSON, we have found its easier to use a key-value pair. Not ideal, but quick to write, small in size and easy to read e.g.
  • Logging Size
    If you are doing something in a loop, maybe you just need to log the summary. Keep the logs summarised and at the right level will make later analysis easier and keep a tab on data collection costs. Mostly it isn’t a big deal if you don’t log 100Kb per request in a loop. We had!

User Events

All user interactions are tracked and sent to Google Analytics (GA). This includes user events from the UI, which is straightforward. However, it also includes non-UI events e.g.

  • Email Opens
    You can use a single pixel image to track email opens. This image URL has a tracker key. When the image is fetched, send a custom event to the GA backend.
  • Chatbot Interactions
    That is another user interaction. Messages sent action taken typically call a webhook URL. The application code handling web-hook sends a custom event to GA backend.
  • Integrations
    You SAAS product would have integrations with others. Whenever a backend service is called, send a custom event to the GA backend to track that activity.

With GA, you should switch on Custom User ID tagging. This would be a uuid generated for every user in your app and sent to GA. This has multiple benefits and allows associating user activity from multiple sessions to a single signed up user. Make sure you send the right User ID for all custom events tracked from a backend.

Application Data

This is straightforward. You are anyway storing user data. Just store these extra fields for every (almost) table & data entity. It will help analytics later on.

  • createdDate when was the entity created
  • by who created the entity
  • lastModifiedDate when was the entity last modified
  • lastModifiedBy who modified the entity last

This helps in analytics and auditing when things go wrong. For critical entities, you can add more fields e.g. timesModified. Another option is to keep track of each and every modification. In such cases, a CQRS like command based modification will make sure you have full change history on the entity.

ETL Pipeline

All your data is now collected. However it is in multiple places. For analytics, we want to get it all in one place, to run queries and get insights from it. This is what you can do.

ElasticSearch Data with Kibana & Elastalert

You shouldn’t be logging user events in logs. They go to GA. If you ensure that, that’s one less ETL activity. With that, there are two types of problems to solve using ElasticSearch logs.

  • Monitoring Dashboards
    AWS provides Kibana out of the box on ElasticSearch service. Set it up and create dashboards for monitoring application health and metrics e.g. jobs processed, response times etc.
  • Altering
    Just monitoring is not enough. You would need to know when something is going or about to go wrong. Yelp has an excellent utility called Elastalert. You can use it to write search queries to monitor and if the results cross a certain threshold, it can send a message to your team slack channel. Just set it up on a t2.micro machine to run as a daemon.

DMS Pipeline for Importing Application Data

Reporting queries can’t be on the production database. AWS offers Database Migration Service (DMS) which can keep two databases in sync. You can use DMS to set up a replication task and data warehouse, an AWS RDS instance. If you choose AWS Aurora as the database, first 6 months of DMS instance is free. DMS uses log replication for syncing. The load on the production database due to it would be minimal.

In this step, you should exclude certain tables and fields.

  • Tables not needed for analytics e.g. Audit tables, Temporary data
  • Large text and binary fields not needed for Analytics
  • Personally identifiable information e.g. name, email, address

GA API for Importing User Event Data

Google Analytics API can be used directly to import data. However, a different approach can work too. Metabase is an open source project to create beautiful dashboards. It integrates with data-sources like MySQL, Postgres, AWS Redshift and Google Analytics. You will be using Metabase for creating dashboard later and so might as well utilize it for GA integration. Here is how you can set it up

  • Setup Metabase and integrate GA account to it
  • Create reports in Metabase for events and data we want to import
  • Metabase exposes an API to download CSVs at the following URL
  • Setup an hourly python script to download CSV and upload to Aurora DB
  • You would need an API key, which can be generated using a Metabase user

You would want the uploads to be overlapping and incremental. The easy way to do it is to run the script every 6 hrs and have the Metabase report return data for last 24 hrs. Python script would then delete the existing data for the date range it already has and the re-upload everything from the response. This does more work than necessary but is simple enough to implement.

Data Warehouse

Now you have all the useful data in the Aurora DB. We need to create a reporting schema that’s easy enough to write reports on as well as fast. The ETL pipeline gets normalized data. We will next create a Star schema for reporting. Here is how that is done.

Creating the Schema

  • Identify Dimensions and create tables for that. For example, if the user, date, and country are dimensions, then you would have a dim_users, dim_dates and dim_country as dimension tables. The way to think about dimensions is that they show up in your SQL query where clauses.
  • Identify dimension attributes for the dimensions e.g. is it a paid user, their signup date etc.
  • Identify generated attributes for the dimensions. Sometimes, its a calculated attribute that we want to use in our queries e.g. user age in days since signup.
  • Create Facts Tables for a combination of dimensions e.g. how many times a user logged in a particular day would be a fact for dim_user and dim_date. There are many combinations possible and the right one depends on a domain and the reports you want to generate.

Data Population

Once the schema is created, write queries to populate data. This is straightforward, except the queries would be run periodically, so you have to think about incremental updates.

Incremental Updates

Depending on the amount of data, there are multiple approaches. The simplest one is to wipe off the complete data set and repopulate it. Its simplest, but does more work than necessary and untenable for larger data sizes. Here is a quick take on making things incremental.

  • Dimension table rows are easily made incremental. Just insert rows not already present, based on primary keys.
  • Dimension table columns will fall into two cases. The ones which never change, e.g. signup date, can be part of the insert queries. And those that change e.g. total_time_spent. You only need to update the ones that change.
  • Fact Tables rows would need additional rows based on newly inserted dimensions. Its a cross join and then exclude rows that are already present.
  • Fact Table columns will again fall into two cases. Those which have a temporal quality i.e. they don’t need to be updated once calculated. An example would be logins_per_day, which when calculated once for a past date, wouldn’t change. The other would be attributes that have to be calculated every time e.g. invitations_sent.

Schema creation scripts are run once. And then incremental updates are run using an hourly cron job.

Ongoing Maintenance

Once you set this up, there isn’t really any ongoing maintenance needed to keep operating as is. However, as things change and new reports are needed, you would need to do the following.

  • Performance & Query Optimization
    Some reports will start running slow with more data. Create indexes, follow standard query optimization techniques to improve their runtime.
  • Reporting Schema Changes
    Simple schema changes can be additive. But from time to time, you would need a more destructive change. If so, just make the change, wipe out the schema, recreate it with changes and re-run the data population scripts.
  • Production DB Schema Change
    Fix the queries and run them again. If its a big change, recreate the warehouse as in the previous step.
  • AWS DMS Management
    DMS tasks fail from time to time. However, its pretty infrequent, less than once a month. If that happens, just restart the tasks.

As a best practice, create a dashboard to monitor the pipeline. It would keep track of the last update from each source and if that becomes too old, it means something is broken.

That should set you with a solid simple and relatively inexpensive analytics pipeline. But that’s just the start. Now go ahead and build reports and dashboards to understand user behavior and grow your product.

Some configuration and black magic is needed for each of the parts above to work. If you are planning to do it and face setup issues, leave a comment or message. Will be happy to help.

Free tag for commerce

E-book on Digital Business Transformation