Hadoop Ecosystem Tutorial -

Meaning of Hadoop Ecosystem:

Hadoop ecosystem is not a service and programming, Hadoop ecosystem is the one type of platform which used to process a large amount of Hadoop Data.Hadoop ecosystem using HDFS and MapReduce for Storing and processing a large amount of data and also used Hive for querying the data.Hadoop Ecosystem consists of following three different types of data

Structured Data – Data having clear structure which can be stored in tabular form

Semi-Structured Data – Data having some structure which cannot be stored in tabular form

UnStructured Data – Data does not have any structure which cannot be stored data in tabular form

Storage Layer of Hadoop Ecosystem:

In Hadoop Ecosystem uses an HDFS for stores a large amount of data that stored file system is running on Hadoop cluster machines.
HDFS stores the three types of data like Structured, Semi-Structured and unstructured.
Hadoop cluster adding nodes to the cluster. If adding more nodes the cluster is having a large number of datasets.

Computation Layer of Hadoop Ecosystem:

In Hadoop Ecosystem uses a MapReduce for Processing a large amount of data set in the form of key /value pair.
MapReduce is the one type of Programming that program applied to Structured, Semi-Structured and unstructured data in Hadoop.
Uses of Key/Value pair is mapping the element and it acts as an identifier.

Main Components of Hadoop Ecosystem:

1.Data Transfer Components:

Sqoop:

Sqoop is one type of tools and used to transfer a large amount of data between Hadoop from RDBMS to HDFS.
It is export and import data from data stores to HDFS
It uses a MapReduce for export the data for processing the large amount of data

Flume:

Flume is one type of tool and it gets data from online server which placed on HDFS
Flume get data from log file and convert that data to Hadoop format
Main purposes of flume are collecting the data and moving huge amount of data.
Flume also knew as log collector

PDF Download – Advanced Hadoop Training Topics

2.Application Programming Components:

Hive:

Main Purposes of Hive is querying and analyze the data from HDFS.
Hive deals unstructured data with query language and it is similar to SQL.
It runs on MapReduce and processing backend data in HDFS.
It processing SQL language so Hive also called HiveQL
It is a data warehousing infrastructure and build on top of the Hadoop
It has three functions: Data summarization, Querying and Analysis

Pig:

Pig is the one type of tool and it is used to analyze the large amount of data that runs on HDFS
Scripting Language of Pig is Pig Latin.
Pig Performs Data Manipulations and it is similar to SQL.
Pig converts all the tasks at Map and Reduce tasks and that tasks are run in Hadoop.
It deals structured data only using pig latin.

3.Data Storage Components:

HBase:

HBase is a NoSQL database and it runs on top of the HDFS.
Main Purpose of HBase is read and write a large number of datasets.
HBase stores the data in rows and columns.
Hbase programming written in only Java.
HBase does not support Structured Query Language.
It stores the column-oriented database in Hadoop

4.Analysis Components:

Mahout:

Mahout is an open source machine learning and it is written in Java Program.
Mahout used by data scientist to do machine learning
Mahout helps to write machine learning algorithm and used to big data analytics
Some machine learning techniques are Clustering, Recommendation, and clustering

5.Workflow Management:

Oozie:

It mainly used to implement the java web application program.
It also manages the Hadoop jobs.
Oozie combines the multiple jobs and that runs in sequential order to achieve big task
In Oozie one or more jobs are programmed at a time
It has three types of a job such as Workflow jobs, Coordinators jobs, and Bundle jobs.

6.Administration and Coordinates Components:

Ambari:

Ambari Maintained by the administrator of the system and it is an open source framework.
It has many administration tools for installing, maintaining and monitoring the Hadoop cluster

Hue:

It is also called as an administrative interface.
It has GUI tool for handling pig and hive query and also used to browsing files and Oozieworkflows.

Zoo Keeper:

ZooKeeper is an open source mechanism and perform a synchronization between Hadoop tools and components
Main services of ZooKeeper is configuration and synchronization used by the distributed application.