Top 20 Hadoop Interview Questions and Answers

Top 20 Hadoop Interview Questions and Answer

1. WHAT IS HADOOP?

Hadoop is an open source framework and one type of tool which is used to store large amount of data sets.Hadoop is provided for data storage,data access,data processing and security operations. Many organizations are used hadoop for storage purpose because Hadoop storing large amount of data quickly.

2. WHAT ARE THE MAIN COMPONENTS OF HADOOP?

HDFS – HDFS means Hadoop Distributed File System and it manages big data sets with high volume. HDFS allows read and write the files but cannot updated the files in HDFS. When we move file in HDFS that file are automatically split into small files and that small files are replicated of three different servers.

Yarn – YARN means Yet Another Resource Negotiator. YARN is the resource management responsible for managing resources in cluster and scheduling applications.

MapReduce – MapReduce is the software framework and it is used to process large amount of data in Parallel.

3. MEANING OF HADOOP ECOSYSTEM:

Hadoop ecosystem is not a service and programming , Hadoop ecosystem is the one type of platform which used to processing a large amount of Hadoop Data.Hadoop ecosystem using HDFS and MapReduce for Storing and processing the large amount of data and also used Hive for querying the data.

4. MEANING OF HADOOP CLUSTER:

Cluster means Many Computers are worked together as one system.Hadoop Cluster means Computer Cluster used at Hadoop. Hadoop Cluster Mainly designed for storing large amount of unstructed data in Distributed file systems. It referred as “Shared Nothing” Systems and shared data between nodes. Hadoop Clusters are Arranged in racks and it having three nodes which is worker node,master node and Client nodes.

5. WHAT IS PIG IN HADOOP?

Pig is the one type of tool and it is used to analyze the large amount of data that runs on HDFS. Scripting Language of Pig is Pig Latin. Pig Performs Data Manipulations and it is similar to SQL. Pig converts all the tasks at Map and Reduce tasks and that tasks are run in Hadoop. It deals structured data only using pig latin.

6. WHAT DO THE FOUR V’S OF BIG DATA DENOTE?

Volume –Scale of data
Velocity –Analysis of streaming data
Variety – Different forms of data
Veracity –Uncertainty of data

7. WHAT IS “SPECULATIVE EXECUTION” IN HADOOP?

If a node appears to be executing a task slower, the master node can redundantly execute another instance of the same task on another node. Then, the task which finishes first will be accepted and the other one is killed. This process is called “speculative execution”.

8. HOW CAN YOU TRANSFER DATA FROM HIVE TO HDFS?

By writing the query:

hive> insert overwrite directory ‘/’ select * from emp;

You can write your query for the data you want to import from Hive to HDFS. The output you receive will be stored in part files in the specified HDFS path.

9. WHAT IS A DATANODE?

Datanodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible for serving read and write requests for the clients.

10. WHAT IS A HEARTBEAT IN HDFS?

A heartbeat is a signal indicating that it is alive. Datanode sends heartbeat to Namenode will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or task tracker is unable to perform the assigned task.

11. WHAT IS A DAEMON?

Daemon is a process or service that runs in background. In general, we use this word in UNIX environment. The equivalent of Daemon in Windows is “services” and in Dos is ” TSR”.

12. WHICH ARE THE THREE MODES IN WHICH HADOOP CAN BE RUN?

standalone (local) mode
Pseudo-distributed mode
Fully distributed mode

13. WHAT ARE THE STABLE VERSIONS OF HADOOP?

Release 2.7.1 (stable)
Release 2.4.1
Release 1.2.1 (stable)

14. WHAT ARE THE FEATURES OF APACHE FLUME?

Main Feature of flume is collected data from multiple web servers
It import large amount of data that produced by facebook,twitter.
It supports Fan-in-Fan-out flows and more amount of sources and destination types.
It collects the data from multiple sources and move to destination.

15. CAP THEOREM IN HADOOP:

CAP theorem is designed for distributed file systems(collection of interconnected nodes).CAP Theorem also known as Brewer’s theorem and used to distributed consistency.It contains follwing three technical terms for distributed systems.

C – Consistency
A – Availability
P – Partition Tolerance

16. WHAT ARE THE MOST COMMONLY DEFINED INPUT FORMATS IN HADOOP?

Text Input Format– default input format
Key Value Input Format– This input format is used for plain text files wherein the files are broken down into lines.
Sequence File Input Format– This input format is used for reading files in sequence.

17. WHAT ARE THE COMPONENTS OF APACHE HBASE?

Region Server: A table can be divided into several regions and served to the clients by a Region Server.
HMaster: It coordinates and manages the Region Server (similar as NameNode manages DataNode in HDFS).
ZooKeeper: Zookeeper acts like as a coordinator inside HBase distributed environment.

18. WHAT IS APACHE FLUME?

Apache Flume is one tool and used to moving data from one place to another place.Flume is the distributed systems that transporting the data at reliable manner.Flume is most important part of hadoop ecosystem.In Apache flume all data unit consider as one event. It collecting log data from various web servers to HDFS.

19. IS IT POSSIBLE TO DO AN INCREMENTAL IMPORT USING SQOOP?

Yes, Sqoop supports two types of incremental imports

Append
Last Modified

20. WHAT IS DISTRIBUTED CACHE IN HADOOP?

The Map Reduce framework provides Distributed Cache functionality to cache the files (text, jars, arcHives, etc.) required by the applications during job execution.