Meaning of Hadoop Ecosystem:
Hadoop ecosystem is not a service and programming, Hadoop ecosystem is the one type of platform which used to process a large amount of Hadoop Data.Hadoop ecosystem using HDFS and MapReduce for Storing and processing a large amount of data and also used Hive for querying the data.Hadoop Ecosystem consists of following three different types of data
Structured Data – Data having clear structure which can be stored in tabular form
Semi-Structured Data – Data having some structure which cannot be stored in tabular form
UnStructured Data – Data does not have any structure which cannot be stored data in tabular form
Storage Layer of Hadoop Ecosystem:
- In Hadoop Ecosystem uses an HDFS for stores a large amount of data that stored file system is running on Hadoop cluster machines.
- HDFS stores the three types of data like Structured, Semi-Structured and unstructured.
- Hadoop cluster adding nodes to the cluster. If adding more nodes the cluster is having a large number of datasets.
Computation Layer of Hadoop Ecosystem:
- In Hadoop Ecosystem uses a MapReduce for Processing a large amount of data set in the form of key /value pair.
- MapReduce is the one type of Programming that program applied to Structured, Semi-Structured and unstructured data in Hadoop.
- Uses of Key/Value pair is mapping the element and it acts as an identifier.
Main Components of Hadoop Ecosystem:
1.Data Transfer Components:
Sqoop:
- Sqoop is one type of tools and used to transfer a large amount of data between Hadoop from RDBMS to HDFS.
- It is export and import data from data stores to HDFS
- It uses a MapReduce for export the data for processing the large amount of data
Flume:
- Flume is one type of tool and it gets data from online server which placed on HDFS
- Flume get data from log file and convert that data to Hadoop format
- Main purposes of flume are collecting the data and moving huge amount of data.
- Flume also knew as log collector
PDF Download – Advanced Hadoop Training Topics
2.Application Programming Components:
Hive:
- Main Purposes of Hive is querying and analyze the data from HDFS.
- Hive deals unstructured data with query language and it is similar to SQL.
- It runs on MapReduce and processing backend data in HDFS.
- It processing SQL language so Hive also called HiveQL
- It is a data warehousing infrastructure and build on top of the Hadoop
- It has three functions: Data summarization, Querying and Analysis
Pig:
- Pig is the one type of tool and it is used to analyze the large amount of data that runs on HDFS
- Scripting Language of Pig is Pig Latin.
- Pig Performs Data Manipulations and it is similar to SQL.
- Pig converts all the tasks at Map and Reduce tasks and that tasks are run in Hadoop.
- It deals structured data only using pig latin.
3.Data Storage Components:
HBase:
- HBase is a NoSQL database and it runs on top of the HDFS.
- Main Purpose of HBase is read and write a large number of datasets.
- HBase stores the data in rows and columns.
- Hbase programming written in only Java.
- HBase does not support Structured Query Language.
- It stores the column-oriented database in Hadoop
4.Analysis Components:
Mahout:
- Mahout is an open source machine learning and it is written in Java Program.
- Mahout used by data scientist to do machine learning
- Mahout helps to write machine learning algorithm and used to big data analytics
- Some machine learning techniques are Clustering, Recommendation, and clustering
5.Workflow Management:
Oozie:
- It mainly used to implement the java web application program.
- It also manages the Hadoop jobs.
- Oozie combines the multiple jobs and that runs in sequential order to achieve big task
- In Oozie one or more jobs are programmed at a time
- It has three types of a job such as Workflow jobs, Coordinators jobs, and Bundle jobs.
6.Administration and Coordinates Components:
Ambari:
- Ambari Maintained by the administrator of the system and it is an open source framework.
- It has many administration tools for installing, maintaining and monitoring the Hadoop cluster
Hue:
- It is also called as an administrative interface.
- It has GUI tool for handling pig and hive query and also used to browsing files and Oozieworkflows.
Zoo Keeper:
- ZooKeeper is an open source mechanism and perform a synchronization between Hadoop tools and components
- Main services of ZooKeeper is configuration and synchronization used by the distributed application.