Hadoop Interview Questions and Answers Set 5

41. What is a Record Reader?

A RecordReader uses the data within the boundaries created by the input split to generate key/value pairs. Each of the generated Key/value pair will be sent one by one to their mapper.

42. What is a sequence file in Hadoop?

Sequence file is used to store binary key/value pairs. Sequence files support splitting even when the data inside the file is compressed which is not possible with a regular compressed file. You can either choose to perform a record level compression in which the value in the key/value pair will be compressed. Or you can also choose to choose at the block level where multiple records will be compressed together.

43. How do you overwrite replication factor?

There are few ways to do this. Look at the below illustration.

Illustration

hadoop fs -setrep -w 5 -R hadoop-test

hadoop fs -Ddfs.replication=5 -cp hadoop-test/test.csv hadoop-test/test_with_rep5.csv

44. How do you do a file system check in HDFS?

FSCK command is used to do a file system check in HDFS. It is a very useful command to check the health of the file, block names and block locations.

Illustration

hdfs fsck /dir/hadoop-test -files -blocks -locations

45. Is Namenode also a commodity?

No. Namenode can never be a commodity hardware because the entire HDFS rely on it. It is the single point of failure in HDFS. Namenode has to be a high-availability machine.

46. What is the difference between an InputSplit and a Block?

Block is a physical division of data and does not take in to account the logical boundary of records. Meaning you could have a record that started in one block and ends in another block. Where as InputSplit considers the logical boundaries of records as well.

47. What is the difference between SORT BY and ORDER BY in Hive?

ORDER BY performs a total ordering of the query result set. This means that all the data is passed through a single reducer, which may take an unacceptably long time to execute for larger data sets.

SORT BY orders the data only within each reducer, thereby performing a local ordering, where each reducer’s output will be sorted. You will not achieve a total ordering on the dataset. Better performance is traded for total ordering.

48. In which directory Hadoop is installed?

Cloudera and Apache has the same directory structure. Hadoop is installed in

cd/usr/lib/hadoop/

49. What are the port numbers of Namenode, job tracker and task tracker?

The port number for Namenode is ’50070′, for job tracker is ’50030′ and for task

tracker is ’50060′.

50. What are the Hadoop configuration files at present?

There are 3 configuration files in Hadoop:

1.core-site.xml

2.hdfs-site.xml

3.mapred-site.xml

These files are located in thehadoop/conf/subdirectory.