Hadoop Interview Questions and Answers Set 6

51. What is Cloudera and why it is used?

Cloudera is the distribution of Hadoop. It is a user created on VM by default.

Cloudera belongs to Apache and is used for data processing.

52. How can we check whether Namenode is working or not?

To check whether Namenode is working or not, use the command

/etc/init.d/hadoop-namenode status.

53. Which files are used by the startup and shutdown commands?

Slaves and Masters are used by the startup and the shutdown commands.

54. Can we create a Hadoop cluster from scratch?

Yes we can do that also once we are familiar with the Hadoop environment.

55. How can you transfer data from Hive to HDFS?

By writing the query:

hive> insert overwrite directory ‘/’ select * from emp; 

You can write your query for the data you want to import from Hive to HDFS. The output you receive will be stored in part files in the specified HDFS path.

56. What is Job Tracker role in Hadoop?

 Job Tracker’s primary function is resource management (managing the task trackers), tracking resource availability and task life cycle management (tracking the taks progress and fault tolerance).

  • It is a process that runs on a separate node, not on a DataNode often.
  • Job Tracker communicates with the NameNode to identify data location.
  • Finds the best Task Tracker Nodes to execute tasks on given nodes.
  • Monitors individual Task Trackers and submits the overall job back to the client.
  • It tracks the execution of MapReduce workloads local to the slave node.

57. What are the core methods of a Reducer?

 The three core methods of a Reducer are:

setup(): this method is used for configuring various parameters like input data size, distributed cache.

public void setup (context)

reduce(): heart of the reducer always called once per key with the associated reduced task

public void reduce(Key, Value, context)

cleanup(): this method is called to clean temporary files, only once at the end of the task

public void cleanup (context)

58. Compare Hadoop & Spark

Criteria Hadoop Spark
Dedicated storage HDFS None
Speed of processing Average Excellent
Libraries Separate tools available Spark Core, SQL, Streaming, MLlib, GraphX

 

59. Can i access Hive Without Hadoop ?

Yes,We can access Hive without hadoop with the help of other data storage systems like Amazon S3, GPFS (IBM) and MapR file system

60. What is Apache Spark?

Spark is a fast, easy-to-use and flexible data processing framework. It has an advanced execution engine supporting cyclic data  flow and in-memory computing. Spark can run on Hadoop, standalone or in the cloud and is capable of accessing diverse data sources including HDFS, HBase, Cassandra and others.