Database vs Data Warehouse vs Data Mart vs Data Lake vs Data Lakehouse

There are lots of terminologies around data store, based on its purpose, use cases and how it’s being built. Understanding the difference by putting these terminologies together side-by-side is a very effective way to clear the confusion. Let’s see it one by one.

Database

  • Used for OLTP (Online Transaction Processing)
  • RDBMS or Non-SQL (semi-structured or unstructured data) storage
  • Mostly single purpose data store
  • Stores real-time data
  • Handles massive volume of simple queries quickly
  • Data is processed, organized, managed, updated and then stored in tables

Data Warehouse

  • Specific data for specific use
  • Used for OLAP (Online Analytics Processing)
  • Pulls data from many sources within organization
  • Stores mainly structured data
  • Stores historical data about your business
  • More costly to update
  • Optimized for data analysis
  • Used by data analysts and business analysts
  • Ad-hoc, read-only queries
  • Supports complex queries for in-depth analysis
  • Suitable for analytics, reporting and BI workload
  • ACID compliant up to highest level of data integrity

Data Mart

  • It’s a subset of a data warehouse
  • Focused and tailored version of data warehouse
  • Data about specific set of users in business or business unit
  • May tailored to demography or persona
  • Often utilized in real-time for analysis and actionable results
  • Focused on single business line or unit
  • Used to quickly access critical insights
  • For example, many companies have data marts aligned with specific departments

Data Lake

  • Store all the raw data
  • Mostly semi-structured and unstructured data
  • Dumping ground for all sorts of data generated in various parts of business
  • No filtering on data
  • Can be petabytes (1 million GBs)
  • Stores all data structures
  • Cost-effective
  • Difficult to analyze
  • Requires an up-to-date data catalog
  • Used by data scientists
  • Big data, real-time analytics
  • Used for predictive and advanced analytics
  • Suitable for ML and AI workloads
  • Non-ACID compliance, updates and deletes are complex operations
  • Poor data quality controls

Data Catalog for Data Lake (bonus)

  • It’s a source of truth for data lakes
  • It answers
    • What is the source of this data?
    • Where is this data used?
    • Who is the owner of the data?
    • How often is this data updated?
  • Good practice in terms of data governance
  • Ensures reproducibility
  • No catalog results in data swamp

data-warehouse-vs-data-lake-vs-data-lakehouse

Data Lakehouse

  • Best of both, Data Lake + Data Warehouse
  • Supports structured, semi-structured and unstructured data
  • Suitable for analytics and AI+ML workloads
  • ACID compliant to confirm data consistency
  • Costly to build from scratch
  • Supports BI, analytics, ML, AI and data science use cases
  • Good data quality controls in place
  • Open standard exists like Delta Lake to transform data lake with structured transactional layer

Data Mining (bonus)

  • Knowledge Discovery in Data (KDD)
  • Process of uncovering patterns and valuable information in large data sets
  • Also used for identifying relationships between large data sets
  • Involves
    • Exploring data
    • Searching data
    • Analyzing data
    • Finding patterns
    • Finding anomalies
    • Finding trends
    • Finding relationships


Leave a Reply

Your email address will not be published. Required fields are marked *

*