There are lots of terminologies around data store, based on its purpose, use cases and how it’s being built. Understanding the difference by putting these terminologies together side-by-side is a very effective way to clear the confusion. Let’s see it one by one.
Database
- Used for OLTP (Online Transaction Processing)
- RDBMS or Non-SQL (semi-structured or unstructured data) storage
- Mostly single purpose data store
- Stores real-time data
- Handles massive volume of simple queries quickly
- Data is processed, organized, managed, updated and then stored in tables
Data Warehouse
- Specific data for specific use
- Used for OLAP (Online Analytics Processing)
- Pulls data from many sources within organization
- Stores mainly structured data
- Stores historical data about your business
- More costly to update
- Optimized for data analysis
- Used by data analysts and business analysts
- Ad-hoc, read-only queries
- Supports complex queries for in-depth analysis
- Suitable for analytics, reporting and BI workload
- ACID compliant up to highest level of data integrity
Data Mart
- It’s a subset of a data warehouse
- Focused and tailored version of data warehouse
- Data about specific set of users in business or business unit
- May tailored to demography or persona
- Often utilized in real-time for analysis and actionable results
- Focused on single business line or unit
- Used to quickly access critical insights
- For example, many companies have data marts aligned with specific departments
Data Lake
- Store all the raw data
- Mostly semi-structured and unstructured data
- Dumping ground for all sorts of data generated in various parts of business
- No filtering on data
- Can be petabytes (1 million GBs)
- Stores all data structures
- Cost-effective
- Difficult to analyze
- Requires an up-to-date data catalog
- Used by data scientists
- Big data, real-time analytics
- Used for predictive and advanced analytics
- Suitable for ML and AI workloads
- Non-ACID compliance, updates and deletes are complex operations
- Poor data quality controls
Data Catalog for Data Lake (bonus)
- It’s a source of truth for data lakes
- It answers
- What is the source of this data?
- Where is this data used?
- Who is the owner of the data?
- How often is this data updated?
- Good practice in terms of data governance
- Ensures reproducibility
- No catalog results in data swamp
Data Lakehouse
- Best of both, Data Lake + Data Warehouse
- Supports structured, semi-structured and unstructured data
- Suitable for analytics and AI+ML workloads
- ACID compliant to confirm data consistency
- Costly to build from scratch
- Supports BI, analytics, ML, AI and data science use cases
- Good data quality controls in place
- Open standard exists like Delta Lake to transform data lake with structured transactional layer
Data Mining (bonus)
- Knowledge Discovery in Data (KDD)
- Process of uncovering patterns and valuable information in large data sets
- Also used for identifying relationships between large data sets
- Involves
- Exploring data
- Searching data
- Analyzing data
- Finding patterns
- Finding anomalies
- Finding trends
- Finding relationships