Jinal Desai

My thoughts and learnings

Database vs Data Warehouse vs Data Mart vs Data Lake vs Data Lakehouse

db-vs-warehouse-vs-mart-vs-catalog-vs-lake-vs-lakehouse

There are lots of terminologies around data store, based on its purpose, use cases and how it’s being built. Understanding the difference by putting these terminologies together side-by-side is a very effective way to clear the confusion. Let’s see it one by one.

Database

  • Used for OLTP (Online Transaction Processing)
  • RDBMS or Non-SQL (semi-structured or unstructured data) storage
  • Mostly single purpose data store
  • Stores real-time data
  • Handles massive volume of simple queries quickly
  • Data is processed, organized, managed, updated and then stored in tables

Data Warehouse

  • Specific data for specific use
  • Used for OLAP (Online Analytics Processing)
  • Pulls data from many sources within organization
  • Stores mainly structured data
  • Stores historical data about your business
  • More costly to update
  • Optimized for data analysis
  • Used by data analysts and business analysts
  • Ad-hoc, read-only queries
  • Supports complex queries for in-depth analysis
  • Suitable for analytics, reporting and BI workload
  • ACID compliant up to highest level of data integrity

Data Mart

  • It’s a subset of a data warehouse
  • Focused and tailored version of data warehouse
  • Data about specific set of users in business or business unit
  • May tailored to demography or persona
  • Often utilized in real-time for analysis and actionable results
  • Focused on single business line or unit
  • Used to quickly access critical insights
  • For example, many companies have data marts aligned with specific departments

Data Lake

  • Store all the raw data
  • Mostly semi-structured and unstructured data
  • Dumping ground for all sorts of data generated in various parts of business
  • No filtering on data
  • Can be petabytes (1 million GBs)
  • Stores all data structures
  • Cost-effective
  • Difficult to analyze
  • Requires an up-to-date data catalog
  • Used by data scientists
  • Big data, real-time analytics
  • Used for predictive and advanced analytics
  • Suitable for ML and AI workloads
  • Non-ACID compliance, updates and deletes are complex operations
  • Poor data quality controls

Data Catalog for Data Lake (bonus)

  • It’s a source of truth for data lakes
  • It answers
    • What is the source of this data?
    • Where is this data used?
    • Who is the owner of the data?
    • How often is this data updated?
  • Good practice in terms of data governance
  • Ensures reproducibility
  • No catalog results in data swamp

data-warehouse-vs-data-lake-vs-data-lakehouse

Data Lakehouse

  • Best of both, Data Lake + Data Warehouse
  • Supports structured, semi-structured and unstructured data
  • Suitable for analytics and AI+ML workloads
  • ACID compliant to confirm data consistency
  • Costly to build from scratch
  • Supports BI, analytics, ML, AI and data science use cases
  • Good data quality controls in place
  • Open standard exists like Delta Lake to transform data lake with structured transactional layer

Data Mining (bonus)

  • Knowledge Discovery in Data (KDD)
  • Process of uncovering patterns and valuable information in large data sets
  • Also used for identifying relationships between large data sets
  • Involves
    • Exploring data
    • Searching data
    • Analyzing data
    • Finding patterns
    • Finding anomalies
    • Finding trends
    • Finding relationships

Leave a Reply

Your email address will not be published. Required fields are marked *