The Modern Data Stack

1*8vOo5aPJjv - Kae Capital
1*P5ynJTjFFVBFs8 SMYOnZQ - Kae Capital
1*DUb664C w6PIL1cEEUHSYg - Kae Capital

Data Sources

Companies generate a lot of data from different sources.

  • OLTP Databases– OLTP (Online transactional processing) systems handle large volumes of transactional data. It consists of user information and operational data generated by users such as e-commerce purchases and online banking. A standard database management system (DBMS) is an OLTP system. Mysql, mongoDB and Postgres are some well-known databases.
  • SaaS tools– Companies use many SaaS tools to run their business such as CRM tools to store sales, marketing and customer success data (Salesforce, Hubspot), payment/billing softwares (Stripe).
  • Event Collectors– Nowadays every possible touch point with the users is recorded as an event, which is used for analysis. It includes recording every click on websites and apps. Segment and Snowplow are popular choices for collecting events.

Extract and Load

All data from different data sources is extracted and loaded to a centralized data warehouse/ data lake. Earlier, the sequence used to be ETL- data is first extracted then transformed and then loaded into the data warehouse. Now, it has evolved to ELT- data is extracted and loaded into the datahouse and later transformed at the warehouse itself.

Data Storage

Ingestion tools stored the data at a cloud data warehouse or data lake. Data warehouse stores structured data (tables) that can be directly queried for analytics. The popular cloud data warehouses are Snowflake, Google Bigquery and Amazon Redshift.

Data Transformation

After storing the data, it is transformed directly in the warehouse into a structure ready for analysis, which is used by the data science and business team to run different analytics and ML models. Dbt, Airflow and LookML are the most popular transformation tools.

Analysis/ Output

The transformed data can used for different purposes-

  • BI/ Visualisation– These tools enable business users to derive insights. They provide a dashboard view with graphs/ pie charts which facilitates business visibility. Tableau, Looker, Power BI are some popular BI tools.
  • Data Workspaces– These tools make it easier for different users to query, visualize and collaborate on data and create dashboards. Some of the emerging data workspaces tools are hex, deepnote, mode, noteable.
  • Data Science, AI/ ML– Data scientists can run ML models on data with help of these tools. Some of the popular tools are Sagemaker, Continual.
  • Reverse ETL– It syncs back the aggregated data to SaaS tools like customer support, sales and marketing to provide full consumer visibility to business users at their primary software. Census and Hightouch are the popular reverse ETL tools.

Data Monitoring and Governance –

We also need to maintain operational data hygiene. There are three major data ops categories of softwares, which help in reducing the risk, operational complexity and cost of the cloud data-

  • Data Observability– Testing and monitoring pipelines are developed to detect and resolve errors or issues. Monte Carlo, Acceldata and Great Expectations are the popular choices.
  • Data Discovery– Data cataloguing, documentation and discovery so that people can discover the right tables for their use. Atlan, Amundsen, and Alation are the popular tools here.
  • Data Security– Access control and data security to safeguard the company’s data. Control which employee has access to which data. Cyral, Immuta are the emerging tools in this category.
  • Introduction of Data lakehouse by databricks and Unistore by Snowflake- Databricks has introduced the data lakehouse. A data lakehouse combines the flexibility, cost efficiency of a data lake with the data management capabilities of a data warehouse. It is an open data management architecture to enable analytics, BI and ML on all data types.
1*Yik22BekATAkJDVOFAQeww - Kae Capital
https://www.databricks.com/glossary/data-lakehouse
1*e2lZgt7 ox2G KDi4KZtJg - Kae Capital
https://www.snowflake.com/en/data-cloud/platform/
  • Data Marketplace — Snowflake has become a behemoth and is now adopting a platform approach enabling products to develop on top of it. Idea is companies can use the native application framework to build native Snowflake apps that can be distributed through Snowflake Marketplace. Snowflake customers can discover, evaluate and run the apps in their accounts, removing the need to move data, thereby improving privacy and security. It is enabling customers to bring apps to data rather than moving data to different apps. It eliminates the delay and cost of traditional ETL with direct access to ready-to-query data and pre-built SaaS connectors.
1*6BsWD3wvCwNCBq0OiCJ7nQ - Kae Capital
https://www.snowflake.com/snowflake-marketplace/
  • MDSaaS– Modern Data Stack as a service. Data Stack is complex and evaluating tools and setting up the entire stack can be a challenging time taking process. There are low/no-code platforms that provide all the tools needed to go from data sources to interactive dashboards. Some of the emerging startups here are Selfr.io, Octolis.

Team members handling this sector

Related blogs