Many of the data engineering and Big Data offerings by Google Cloud are based on the Hadoop Ecosystem or vice-versa one way or the other. So, if you are having clear understanding about the Hadoop Ecosystem then the GCP data engineering services understood clearly. Let’s try to delve into more details.
This will help best when you are having Hadoop cluster running MapReduce jobs or running Apache Spark for analytics and running data transformation jobs which, you want to migrate to the Google Cloud. If you understand the mapping between the two ecosystems, you can easily migrate to corresponding GCP tools.
Apache Hadoop
Apache Hadoop = HDFS + YARN + MapReduce + Hadoop Common
Apache Hadoop software is an open-source framework that allows for the distributed storage and processing of large datasets across clusters of computers using simple programming models. Hadoop is designed to scale up from a single computer to thousands of clustered computers, with each machine offering local computation and storage. In this way, Hadoop can efficiently store and process large datasets ranging in size from gigabytes to petabytes of data.
Hadoop Distributed File System (HDFS): As the primary component of the Hadoop ecosystem, HDFS is a distributed file system that provides high-throughput access to application data with no need for schemas to be defined up front.
HDFS is derived originally from Google File System. Google has developed GFS for storing big data in distributed computers and retrieving the same data in an efficient manner. Apache developed an open-source version of GFS. GFS has been evolved into GCS (Google Cloud Storage) gradually.
Yet Another Resource Negotiator (YARN): YARN is a resource-management platform responsible for managing compute resources in clusters and using them to schedule users’ applications. It performs scheduling and resource allocation across the Hadoop system.
MapReduce: MapReduce is a programming model for large-scale data processing. Using distributed and parallel computation algorithms, MapReduce makes it possible to carry over processing logic and helps to write applications that transform big datasets into one manageable set.
Hadoop Common are common utilities that support the other Hadoop modules.
MapReduce was also originally developed by Google to solve distributed computing on demand, which was then open-sourced by Apache.
HDFS = Developed to solve distributed storage
MapReduce = Developed to solve distributed computing
User interaction flows like:
MapReduce -> YARN -> HDFS
- User defines map and reduce tasks using MapReduce API and submits them to Hadoop for computation (it becomes MapReduce job now)
- YARN figures out where and how to run the job, and stores the results in HDFS
You can run Apache Hadoop clusters into the Google Cloud Dataproc service.
Apache Hadoop/Spark/Pig clusters > (goes to) Google Cloud Dataproc
Google Cloud Storage (GCS)
GCS is built on top of HDFS, but instead of using Apache MapReduce it uses Google MapReduce algorithm for large-scale data processing which is an enhanced version (proprietary version) of Apache MapReduce.
Google Cloud Storage (GCS) = HDFS + Google MapReduce
Why did Google develop GCS, instead of using HDFS?
HDFS by definition is a server, the main node and lookup needs a server always running in the background which will cost the user always. So, Google developed GCS.
Google Cloud Dataproc
It’s a managed Hadoop/Spark cluster.
Apache Hive
It provides SQL interface to Hadoop. It is the bridge to Hadoop for those who don’t have exposure to OOP in Java.
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop.
RDBMS vs Apache Hive
Why should we never use Apache Hive or Google Cloud BigQuery for OLTP?
Why should we never use Cloud SQL or any other SQL DBMS for OLAP?
Check the table below for answers:
RDBMS | Apache Hive |
---|---|
Small datasets (Megabytes or Gigabytes) |
Big datasets (Gigabytes or Petabytes) |
Serial computations (Single computer with backup) |
Parallel computations (Distributed system with multiple machines) |
Low latecy (Records indexed, can be accessed quickly and constraints enforced like not null, unique, etc.) |
High latency (Records not indexed, can’t be accessed quickly) |
Indexes allowed | Minimal index support |
Row level operations allowed in general | Row level updates and deletes as a special case, primarily built for read operations |
Read and Write operations supported, its optimized for (Schema-on-write) |
Read operations supported, its optimized for (Schema-on-read, no constraints enforced) |
ACID compliant (Only data which satisfied constraints are stored) |
Not ACID compliant by default (Data can be dumped into Hive from any sources) |
SQL (optimized for transaction queries) | HiveQL (optimized for analytics queries) |
Basic built-in functions | Many more built-in functions |
No restriction on joins | Only equi-joins allowed |
Whole range of subqueries supported | Restricted subqueries support |
Google Cloud BigQuery
Google Cloud BigQuery is mapped to Apache Hive, as it is semantically and logically almost as exact equivalent of Hive. From the implementation perspective, BigQuery is quite different compared to Hive but it is doing the same job as Hive is doing for Hadoop ecosystem.
Apache Hive = Google Cloud BigQuery
Why Hive is suited for high-latency applications (like OLAP), in contrast to BigQuery which can be used in almost real-time applications?
Hive is MapReduce on top of HDFS, which is batch processing while BigQuery is SQL-like interface on top of GCS which is efficient in batch as well as streaming data processing.
Apache HBase
HBase is an open-source non-relational distributed database modeled after Google Cloud BigTable and written in Java. It is developed as part of Apache Software Foundation’s Apache Hadoop project and runs on top of HDFS (Hadoop Distributed File System), providing BigTable-like capabilities for Hadoop. It provides a fault-tolerant way of storing large quantities of sparse data.
It integrates with your application just like a traditional database.
Google Cloud BigTable = Apache HBase
Apache Pig
It is a data manipulation language, which transforms unstructured data into a structured format. It is pre-installed on Google Cloud Dataproc machines.
Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
Apache Spark
Apache Spark is a distributed computing engine used along with Hadoop. It provides interactive shell with REPL (Read, Evaluate, Print and Loop) environment to quickly process datasets.
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.
Apache Spark has built in libraries for machine learning, stream processing, graph processing, etc.
Apache Spark = Google Cloud Dataflow
Apache Flink
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
Apache Oozie
Apache Oozie is a server-based workflow scheduling system to manage Hadoop jobs.
Workflows in Oozie are defined as a collection of control flow and action nodes in a directed acyclic graph. Control flow nodes define the beginning and the end of a workflow (start, end, and failure nodes) as well as a mechanism to control the workflow execution path (decision, fork, and join nodes). Action nodes are the mechanism by which a workflow triggers the execution of a computation/processing task. Oozie provides support for different types of actions including Hadoop MapReduce, Hadoop distributed file system operations, Pig, SSH, and email. Oozie can also be extended to support additional types of actions.
Apache Airflow
Apache Airflow is an open-source workflow management platform for data engineering pipelines. Airflow is written in Python, and workflows are created via Python scripts. Airflow is designed under the principle of “configuration as code”. Airflow uses directed acyclic graphs (DAGs) to manage workflow orchestration. Tasks and dependencies are defined in Python and then Airflow manages the scheduling and execution.
Apache Beam
Apache Beam is an open-source unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing. Google released an open SDK implementation of the Dataflow model in 2014 and an environment to execute Dataflows locally (non-distributed) as well as in the Google Cloud Platform service. Apache Beam is one implementation of the Dataflow model paper.
Summary
Apache Hadoop Ecosystem | Google Cloud Platform |
---|---|
HDFS + Apache MapReduce | Google Cloud Storage = HDFS + Google MapReduce |
Apache Hadoop Clusters, Apache Spark, Apache Pig, Apache Hive, Apache Flink, Apache Presto, etc. | Google Cloud Dataproc (You can configure and use practically anything that works on distributed cluster in Dataproc via SSH) |
Apache Hive (SQL interface to Hadoop, batch data processing) |
Google Cloud BigQuery (batch + stream data processing) |
Apache HBase (DBMS on top of Hadoop) |
Google Cloud BigTable |
Apache Spark or Apache Beam | Google Cloud Dataflow |
Apache Kafka | Google Cloud Pub/Sub |
Apache Flink | Google Cloud Pub/Sub + Google Cloud Dataproc or Google Cloud Dataflow or Google Cloud DataFusion + Google Cloud Storage |
Apache Oozie | Google Cloud Dataproc + Google Cloud Scheduler |
Apache Airflow | Google Cloud Composers |
Conclusion
I hope this will be helpful in understanding how Apache Hadoop ecosystem is closely mapped to the GCP BigData and Data Engineering offerings.
Leave a Reply