In 1930’s around when oil found underground in the UAE region, all the companies in the world eyed to set up ties with the UAE oil companies one way or the other way. In the 2010’s smartphone era was normalized and a new generation of computers started. Big techs have realized how much data we are generating year by year and it’s continuously growing. Since then, all the big techs are hungry for data, they realized the value of data and how one can harness data in the business to earn money or grow profit.
In 2013 the world generated 4.4 zettabytes of data and in 2020 the figure reaches around 40 zettabytes of data. So, it grows almost 10 times in 7 years which is still growing today. These data are raw data, not useful directly. But the big techs now know how to use them properly to grow business, to harvest new products and technologies. Data is collected via various sources, cleaned, organized, enriched, distributed, transformed, stored and maintained to be usable and relevant to the current market/business needs.
We have already seen some basics of data storage and utilization in this article. In this, we will see some more terminologies related to the data.
It is the first stage to collect your data. There are many paths from where your data is collected directly or indirectly like smartphones, consent forms, emails, apps, products, user information, from other companies, etc.
A dataflow is a path for the data to move from one system/process to another. Here data source is batch data or streaming data flowing through various staging to make it usable. Most representable form of dataflow is in the form of diagram to make it better understand. There are many products out there you can use to build diagrams like MS Visio, etc.
Data pipeline is a generic term, you can use when you process your data via series of steps where there is some starting point and ending point. Data flows sometimes via the data pipelines. Here also the data source is batch data or streaming data. There are many products out there to build data pipelines like Snowflake, Google cloud dataflow, Apache Beam, Pentaho Data Integration (PDI), etc.
When any data fulfils its purpose, it was on-boarded for, then it is considered to be of high-quality data. Sometimes when data represents real-world constructs accurately then it is also considered of high quality. For example, for the e-commerce application when a customer places an order and there are no contact details given/attached with the order then the data doesn’t serve its purpose.
Data dictionary is metadata repository for the data. It describes contents, format and structure of the data it is about. It’s data about data. There are many tools for data dictionary like Redgate SQL doc, Doc xPress, etc. It depends on your data storage.
A data model (or datamodel) is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car is composed of a number of other elements which, in turn, represent the color and size of the car and define its owner. –wikipedia
Data modeling is a process of creating data models which mostly transforms raw data into useful information from the business perspective (or it aligned with the business requirements/goals) and can be turned into dynamic visualizations. It prepares the data for analysis: cleansing the data, defining the measures and dimensions, and enhancing data by establishing hierarchies, setting units and currencies, and adding formulas.
You can use tools like Hevo, Archi, ER/Studio, etc. for data modeling.
It is a process of understanding data from the problem perspective which in turn helps business to take important decisions. It will vary based on industries and/or individual departments’ needs.
Data = collection of facts
Analytics = organizing and examining data
Insights = discovering patterns in data
For example, marketing team collecting data via various email campaigns. Product head analyzes marketing and sales data for the quarter and derives insights into whether the email campaigns we ran last quarter yield any growth in sales or not.
There are many data insights tools out there like Qlik Sense, Microsoft Power BI, Google Looker, Tableau, etc.
It is a metadata about the origin of data, how it is derived or calculated and how it moves over time. It is a map of your data journey. It helps in areas like business visibility, keeping track of data origins, keeping track of how data changes over time, some governance requirements, etc.
You can use products like Octopai, Collibra, etc. for data lineage.
It is the practice of identifying important data across the business/organization and ensuring it is serving its purpose and remains high quality. It is driven by business policies, authoritative body/government compliances/policies/regulations and requirements. For example, GDPR regulations require every organization in the EU to have a lawful purpose to store and process sensitive personal data.
There are many tools for data governance from IBM, Microsoft, Talend, Google, etc. available in the market.
It is a facade on top of complex data platforms to connect and access data from various data sources. It’s not just a connection interface over a storage array or databases, but provides holistic connection for multiple and disparate data sources hiding technology complexity and other details.
It provides a unified data environment, combines data from multiple systems and locations, provides high availability and reliability, provides seamless data ingestion and integration, and allow us to connect with any data source via connectors and components. Products includes Talend, Datamation, etc.
It is a data platform architecture which allows users to access data easily without moving their data to the data warehouse or data lake. It makes data discoverable, easily accessible, removes silos, etc. It federates data ownership among domain data owners, so that communication happens between distributed data across different locations.
It’s a data openly available, accessible, exploitable, editable and shareable by anyone for any purpose, even commercially –wikipedia
It is part of the open-source data movement, where data is freely available to anyone to use and republish without any kind of restrictions like copyright, patent or other form of control.
There are many datasets available on the internet as open data, like Covid19 dataset.