Hadoop Tutorial for Beginners | Learn Hadoop from A to Z
1. Hadoop Tutorial
The Hadoop tutorial is a comprehensive guide on Big
Data Hadoop Training And Certification that covers what is Hadoop,
what is the need of Apache Hadoop, why Apache Hadoop is most popular, How
Apache Hadoop works?
Apache Hadoop is an open source, Scalable, and Fault
tolerant framework written in Java. It efficiently processes large volumes of
data on a cluster of commodity hardware. Hadoop is not only a storage system
but is a platform for large data storage as well as processing. This Big Data
Hadoop tutorial provides a thorough Hadoop introduction.
We will also learn in this Hadoop tutorial about Hadoop
architecture, Hadoop daemons, different flavors of Hadoop. At last, we will
cover the introduction of Hadoop components like HDFS, MapReduce, Yarn, etc.
2. What is Hadoop Technology?
Hadoop is an open-source tool from the ASF – Apache Software
Foundation. Open source project means it is freely available and we can even
change its source code as per the requirements. If certain functionality does
not fulfill your need then you can change it according to your need. Most of
Hadoop code is written by Yahoo, IBM, Facebook, Cloudera.
It provides an efficient framework for running jobs on
multiple nodes of clusters. Cluster means a group of systems connected via LAN.
Apache Hadoop provides parallel processing of data as it works on multiple
machines simultaneously. Lets see a video Hadoop Tutorial to understand what is
Hadoop in a better way.
Learn: How Hadoop Works?
Big Data Hadoop Tutorial Video
Hope the above Big Data Hadoop Tutorial video helped you.
Let us see further.
By getting inspiration from Google, which has written a
paper about the technologies. It is using technologies like Map-Reduce
programming model as well as its file system (GFS). As Hadoop was originally
written for the Nutch search engine project. When Doug Cutting and his team
were working on it, very soon Hadoop became a top-level project due to its huge
popularity. Let us understand Hadoop definition and meaning.
Apache Hadoop is an open source framework written in Java.
The basic Hadoop programming language is Java, but this does not mean you can
code only in Java. You can code in C, C++, Perl, Python, ruby etc. You can code
the Hadoop framework in any language but it will be more good to code in java
as you will have lower level control of the code.
Big Data and Hadoop efficiently processes large volumes of
data on a cluster of commodity hardware. Hadoop is for processing huge volume
of data. Commodity hardware is the low-end hardware, they are cheap devices
which are very economical. Hence, Hadoop is very economic.
Hadoop can be setup on a single machine (pseudo-distributed
mode, but it shows its real power with a cluster of machines. We can scale it
to thousand nodes on the fly ie, without any downtime. Therefore, we need not
make any system down to add more systems in the cluster. Follow this guide to
learn Hadoop installation on a multi-node cluster.
Hadoop consists of three key parts –
Hadoop Distributed File System (HDFS) – It is the storage
layer of Hadoop.
Map-Reduce – It is the data processing layer of Hadoop.
YARN – It is the resource management layer of Hadoop.
In this Hadoop tutorial for beginners we will all these
three in detail, but first lets discuss the significance of Hadoop.
3. Why Hadoop?
Let us now understand in this Hadoop tutorial that why Big
Data Hadoop is very popular, why Apache Hadoop capture more than 90% of big
data market.
Apache Hadoop is not only a storage system but is a platform
for data storage as well as processing. It is scalable (as we can add more
nodes on the fly), Fault tolerant (Even if nodes go down, data processed by
another node).
Following characteristics of Hadoop make it a unique
platform:
Flexibility to store and mine any type of data whether it is
structured, semi-structured or unstructured. It is not bounded by a single
schema.
Excels at processing data of complex nature. Its scale-out
architecture divides workloads across many nodes. Another added advantage is
that its flexible file-system eliminates ETL bottlenecks.
Scales economically, as discussed it can deploy on commodity
hardware. Apart from this its open-source nature guards against vendor lock.
Learn Hadoop features in detail.
4. What is Hadoop Architecture?
After understanding what is Apache Hadoop, let us now
understand the Big Data Hadoop Architecture in detail in this Hadoop tutorial.
Hadoop works in master-slave fashion. There is a master node
and there are n numbers of slave nodes where n can be 1000s. Master manages,
maintains and monitors the slaves while slaves are the actual worker nodes. In
Hadoop architecture, the Master should deploy on good configuration hardware,
not just commodity hardware. As it is the centerpiece of Hadoop cluster.
Master stores the metadata (data about data) while slaves
are the nodes which store the data. Distributedly data stores in the cluster.
The client connects with master node to perform any task. Now in this Hadoop
for beginners tutorial we will discuss different components of Hadoop in
detail.
5. Hadoop Components
There are three most important Apache Hadoop Components. In
this Hadoop tutorial, you will learn what is HDFS, what is Hadoop MapReduce and
what is Yarn Hadoop. Let us discuss them one by one-
5.1. What is HDFS?
Hadoop HDFS or Hadoop Distributed File System is a
distributed file system which provides storage in Hadoop in a distributed
fashion.
In Hadoop Architecture on the master node, a daemon called
namenode run for HDFS. On all the slaves a daemon called datanode run for HDFS.
Hence slaves are also called as datanode. Namenode stores meta-data and manages
the datanodes. On the other hand, Datanodes stores the data and do the actual
task.
HDFS is a highly fault tolerant, distributed, reliable and
scalable file system for data storage. First Follow this guide to learn more
about features of HDFS and then proceed further with the Hadoop tutorial.
HDFS is developed to handle huge volumes of data. The file
size expected is in the range of GBs to TBs. A file is split up into blocks
(default 128 MB) and stored distributedly across multiple machines. These
blocks replicate as per the replication factor. After replication, it stored at
different nodes. This handles the failure of a node in the cluster. So if there
is a file of 640 MB, it breaks down into 5 blocks of 128 MB each (if we use the
default value).
5.2. What is MapReduce?
In this Hadoop Basics Tutorial, now its time to understand
one of the most important pillars of Hadoop, i.e. Hadoop MapReduce. The Hadoop
MapReduce is a programming model. As it is designed for large volumes of data
in parallel by dividing the work into a set of independent tasks. MapReduce is
the heart of Hadoop, it moves computation close to the data. As a movement of a
huge volume of data will be very costly. It allows massive scalability across
hundreds or thousands of servers in a Hadoop cluster.
Hence, Hadoop MapReduce is a framework for distributed
processing of huge volumes of data set over a cluster of nodes. As data stores
in a distributed manner in HDFS. It provides the way to Map–Reduce to perform
parallel processing.
5.3. What is YARN Hadoop?
YARN – Yet Another Resource Negotiator is the resource
management layer of Hadoop. In the multi-node cluster, as it becomes very
complex to manage/allocate/release the resources (CPU, memory, disk). Hadoop
Yarn manages the resources quite efficiently. It allocates the same on request
from any application.
On the master node, the ResourceManager daemon runs for the
YARN then for all the slave nodes NodeManager daemon runs.
Learn the differences between two resource manager Yarn vs.
Apache Mesos. Next topic in the Big Data Hadoop for beginners is a very
important part of Hadoop i.e. Hadoop Daemons
6. Hadoop Daemons
Daemons are the processes that run in the background. There
are mainly 4 daemons which run for Hadoop.
Namenode – It runs on master node for HDFS.
Datanode – It runs on slave nodes for HDFS.
ResourceManager – It runs on master node for Yarn.
NodeManager – It runs on slave node for Yarn.
These 4 demons run for Hadoop to be functional. Apart from
this, there can be secondary NameNode, standby NameNode, Job HistoryServer,
etc.
7.’How do Hadoop works?’
Till now in Hadoop training we have studied Hadoop
Introduction and Hadoop architecture in detail. Now next let us summarize
Apache Hadoop working step by step:
i) Input data breaks into blocks of size 128 Mb (by default)
and then moves to different nodes.
ii) Once all the blocks of the file stored on datanodes then
a user can process the data.
iii) Then, master schedules the program (submitted by the
user) on individual nodes.
iv) Once all the nodes process the data then the output is
written back to HDFS.
8. Hadoop Flavors
This section of Hadoop Tutorial talks about the various
flavors of Hadoop.
Apache – Vanilla flavor, as the actual code is residing in
Apache repositories.
Hortonworks – Popular distribution in the industry.
Cloudera – It is the most popular in the industry.
MapR – It has rewritten HDFS and its HDFS is faster as
compared to others.
IBM – Proprietary distribution is known as Big Insights.
All the databases have provided native connectivity with
Hadoop for fast data transfer. Because, to transfer data from Oracle to Hadoop,
you need a connector.
All flavors are almost same and if you know one, you can
easily work on other flavors as well.
9. Hadoop Ecosystem Components
In this section of Hadoop tutorial, we will cover Hadoop
ecosystem components. Let us see what all the components form the Hadoop
Eco-System:
Hadoop HDFS – Distributed storage layer for Hadoop.
Yarn Hadoop – Resource management layer introduced in Hadoop
2.x.
Hadoop Map-Reduce – Parallel processing layer for Hadoop.
HBase – It is a column-oriented database that runs on top of
HDFS. It is a NoSQL database which does not understand the structured query.
For sparse data set, it suits well.
Hive – Apache Hive is a data warehousing infrastructure
based on Hadoop and it enables easy data summarization, using SQL queries.
Pig – It is a top-level scripting language. As we use it
with Hadoop. Pig enables writing complex data processing without Java
programming.
Flume – It is a reliable system for efficiently collecting
large amounts of log data from many different sources in real-time.
Sqoop – It is a tool design to transport huge volumes of
data between Hadoop and RDBMS.
Oozie – It is a Java Web application uses to schedule Apache
Hadoop jobs. It combines multiple jobs sequentially into one logical unit of
work.
Zookeeper – A centralized service for maintaining
configuration information, naming, providing distributed synchronization, and
providing group services.
Mahout – A library of scalable machine-learning algorithms,
implemented on top of Apache Hadoop and using the MapReduce paradigm.
Refer this Hadoop Ecosystem Components tutorial for the
detailed study of All the Ecosystem components of Hadoop.
So, this was all about the Hadoop Tutorial. Hope you like
our explanation.
10. Conclusion: Hadoop Tutorial
Hence, in conclusion to this Big Data tutorial, we can say
that Apache Hadoop is the most popular and powerful big data tool. Big Data
stores huge amount of data in the distributed manner and processes the data in
parallel on a cluster of nodes. It provides the world’s most reliable storage
layer- HDFS. Batch processing engine MapReduce and Resource management layer-
YARN. 4 daemons (NameNode, datanode, node manager, resource manager) run in
Hadoop to ensure Hadoop
functionality.[Source]-https://data-flair.training/blogs/hadoop-tutorial/
Comments
Post a Comment