Hadoop
Hadoop is an open source distributed processing framework
that manages data processing and storage for big data applications running in
clustered systems. It is at the center of a growing ecosystem of big data
technologies that are primarily used to support advanced analytics initiatives,
including predictive analytics, data mining and machine learning applications.
Hadoop can handle various forms of structured and unstructured data, giving
users more flexibility for collecting, processing and analyzing data than
relational databases and data warehouses provide.
Hadoop is primarily geared to analytics uses, and its
ability to process and store different types of data makes it a particularly
good fit for big data analytics applications. Big data environments typically
involve not only large amounts of data, but also various kinds, from structured
transaction data to semistructured and unstructured forms of information, such
as internet clickstream records, web server and mobile application logs, social
media posts, customer emails and sensor data from the internet of things (IoT).
Formally known as Apache Hadoop, the technology is developed
as part of an open source project within the Apache Software Foundation (ASF).
Commercial distributions of Hadoop are currently offered by four primary
vendors of big data platforms: Amazon Web Services (AWS), Cloudera, Hortonworks
and MapR Technologies. In addition, Google, Microsoft and other vendors offer
cloud-based managed services that are built on top of Hadoop and related
technologies.
Hadoop and big data
Hadoop runs on clusters of commodity servers and can scale
up to support thousands of hardware nodes and massive amounts of data. It uses
a namesake distributed file system that's designed to provide rapid data access
across the nodes in a cluster, plus fault-tolerant capabilities so applications
can continue to run if individual nodes fail. Consequently, Hadoop became a
foundational data management platform for big data analytics uses after it
emerged in the mid-2000s.
History of Hadoop
Hadoop was created by computer scientists Doug Cutting and
Mike Cafarella, initially to support processing in the Nutch open source search
engine and web crawler. After Google published technical papers detailing its
Google File System (GFS) and MapReduce programming framework in 2003 and 2004,
Cutting and Cafarella modified earlier technology plans and developed a
Java-based MapReduce implementation and a file system modeled on Google's.
In early 2006, those elements were split off from Nutch and
became a separate Apache subproject, which Cutting named Hadoop after his son's
stuffed elephant. At the same time, Cutting was hired by internet services
company Yahoo, which became the first production user of Hadoop later in 2006.
Use of the framework grew over the next few years, and three
independent Hadoop vendors were founded: Cloudera in 2008, MapR a year later
and Hortonworks as a Yahoo spinoff in 2011. In addition, AWS launched a Hadoop
cloud service called Elastic MapReduce in 2009. That was all before Apache
released Hadoop 1.0.0, which became available in December 2011 after a
succession of 0.x releases.
Components of Hadoop
The core components in the first iteration of Hadoop were
MapReduce, the Hadoop Distributed File System (HDFS) and Hadoop Common, a set
of shared utilities and libraries. As its name indicates, MapReduce uses map
and reduce functions to split processing jobs into multiple tasks that run at
the cluster nodes where data is stored and then to combine what the tasks
produce into a coherent set of results. MapReduce initially functioned as both
Hadoop's processing engine and cluster resource manager, which tied HDFS
directly to it and limited users to running MapReduce batch applications.
That changed in Hadoop 2.0, which became generally available
in October 2013 when version 2.2.0 was released. It introduced Apache Hadoop
YARN, a new cluster resource management and job scheduling technology that took
over those functions from MapReduce. YARN -- short for Yet Another Resource
Negotiator but typically referred to by the acronym alone -- ended the strict
reliance on MapReduce and opened up Hadoop to other processing engines and
various applications besides batch jobs.
The core components of Hadoop
The Hadoop 2.0 series of releases also added high
availability (HA) and federation features for HDFS, support for running Hadoop
clusters on Microsoft Windows servers and other capabilities designed to expand
the distributed processing framework's versatility for big data management and
analytics.
Hadoop 3.0.0 was the next major version of Hadoop. Released
by Apache in December 2017, it didn't expand Hadoop's set of core components.
However, it added a YARN Federation feature designed to enable YARN to support
tens of thousands of nodes or more in a single cluster, up from a previous
10,000-node limit. The new version also included support for GPUs and erasure
coding, an alternative to data replication that requires significantly less
storage space.
How Hadoop works and its importance
Put simply: Hadoop has two main components. The first
component, the Hadoop Distributed File System, helps split the data, put it on
different nodes, replicate it and manage it. The second component, MapReduce,
processes the data on each node in parallel and calculates the results of the
job. There is also a method to help manage the data processing jobs.
Hadoop is important because:
it can store and process vast amounts of structured and
unstructured data, quickly.
application and data processing are protected against
hardware failure. So if one node goes down, jobs are redirected automatically
to other nodes to ensure that the distributed computing doesn’t fail.
the data doesn’t have to be preprocessed before it’s stored.
Organizations can store as much data as they want, including unstructured data,
such as text, videos and images, and decide how to use it later.
it’s scalable so companies can add nodes to enable their
systems to handle more data.
it can analyze data in real time to enable better decision
making.
Hadoop applications
YARN greatly expanded the applications that Hadoop clusters
can handle to include stream processing and real-time analytics applications
run in tandem with processing engines, like Apache Spark and Apache Flink. For
example, some manufacturers are using real-time data that's streaming into
Hadoop in predictive maintenance applications to try to detect equipment
failures before they occur. Fraud detection, website personalization and
customer experience scoring are other real-time use cases.
Because Hadoop can process and store such a wide assortment
of data, it enables organizations to set up data lakes as expansive reservoirs
for incoming streams of information. In a Hadoop data lake, raw data is often
stored as is so data scientists and other analysts can access the full data
sets if need be; the data is then filtered and prepared by analytics or IT
teams as needed to support different applications.
Data lakes generally serve different purposes than
traditional data warehouses that hold cleansed sets of transaction data. But,
in some cases, companies view their Hadoop data lakes as modern-day data
warehouses. Either way, the growing role of big data analytics in business
decision-making has made effective data governance and data security processes
a priority in data lake deployments.
Hadoop use cases
Some use cases for Hadoop include:
Customer analytics -- examples include efforts to predict
customer churn, analyze clickstream data to better target online ads to web
users, and track customer sentiment based on comments about a company on social
networks. Insurers use Hadoop for applications such as analyzing policy pricing
and managing safe driver discount programs. Healthcare organizations look for
ways to improve treatments and patient outcomes with Hadoop's aid.
Risk management -- financial institutions use Hadoop
clusters to develop more accurate risk analysis models for their customers.
Financial services companies can use Hadoop to build and run applications to
assess risk, build investment models and develop trading algorithms.
Predictive maintenance -- with input from IoT devices
feeding data into big data programs, companies in the energy industry can use
Hadoop-powered analytics to help predict when equipment might fail to determine
when maintenance should be performed.
Operational intelligence -- Hadoop can help
telecommunications firms get a better understanding of switching, frequency
utilization and capacity use for capacity planning and management. By analyzing
how services are consumed as well as the bandwidth in specific regions, they
can determine the best places to locate new cell towers, for example. In
addition, by capturing and analyzing the data that’s produced by the
infrastructure and by sensors, telcos can more quickly respond to problems in
the network.
Supply chain risk management -- manufacturing companies, for
example, can track the movement of goods and vehicles so they can determine the
costs of various transportation options. Using Hadoop, manufacturers can
analyze large amounts of historical, time-stamped location data as well as map
out potential delays so they can optimize their delivery routes.
Big data tools associated with Hadoop
The ecosystem that has been built up around Hadoop includes
a range of other open source technologies that can complement and extend its
basic capabilities. The list of related big data tools includes:
Apache Flume: a tool used to collect, aggregate and move
large amounts of streaming data into HDFS;
Apache HBase: a distributed database that is often paired
with Hadoop;
Apache Hive: an SQL-on-Hadoop tool that provides data
summarization, query and analysis;
Apache Oozie: a server-based workflow scheduling system to
manage Hadoop jobs;
Apache Phoenix: an SQL-based massively parallel processing
(MPP) database engine that uses HBase as its data store;
Apache Pig: a high-level platform for creating programs that
run on Hadoop clusters;
Apache Sqoop: a tool to help transfer bulk data between
Hadoop and structured data stores, such as relational databases; and
Apache ZooKeeper: a configuration, synchronization and
naming registry service for large distributed systems.
Evolution of the Hadoop market
In addition to AWS, Cloudera, Hortonworks and MapR, several
other IT vendors -- most notably, IBM, Intel and Pivotal (a Dell Technologies
subsidiary) -- entered the Hadoop distribution market. However, those three
companies all later dropped out and aligned themselves with one of the
remaining vendors after failing to make much headway with Hadoop users. Intel
dropped its distribution and invested in Cloudera in 2014, while Pivotal and
IBM agreed to resell the Hortonworks version in 2016 and 2017, respectively.
Even the remaining vendors have hedged their bets on Hadoop
itself by expanding their big data platforms to also include Spark and numerous
other technologies. Spark, which runs both batch and real-time workloads, has
ousted MapReduce in many batch applications and can bypass HDFS to access data
from Amazon Simple Storage Service (S3) in the AWS cloud -- a capability
supported by Cloudera and Hortonworks, as well as AWS itself. In 2017, both
Cloudera and Hortonworks dropped the word Hadoop from the names of their rival
conferences for big data users.
Timeline of Hadoop developments
Nonetheless, the overall big data ecosystem -- or the Hadoop
ecosystem, as it's also still known -- continues to attract the attention of
users and vendors alike. And, increasingly, the focus is on cloud deployments.
To compete with Amazon EMR, as Elastic MapReduce is now called, Cloudera,
Hortonworks and MapR have all taken steps to make it easier to deploy and
manage their platforms in the cloud, including support for transient clusters
that can be shut down when no longer
needed.[Source]-https://searchdatamanagement.techtarget.com/definition/Hadoop
big data hadoop course at Asterix Solution is designed to scale up from single servers to thousands of machines, each offering local computation and storage. With the rate at which memory cost decreased the processing speed of data never increased and hence loading the large set of data is still a big headache and here comes Hadoop as the solution for it.
Comments
Post a Comment