Hadoop Interview Questions and Answers Set 8

71. Which operating system(s) are supported for production Hadoop deployment?

The main supported operating system is Linux. However, with some additional software Hadoop can be deployed on Windows.

72. What is the best practice to deploy the secondary namenode

Deploy secondary namenode on a separate standalone machine. The secondary namenode needs to be deployed on a separate machine. It will not interfere with primary namenode operations in this way. The secondary namenode must have the same memory requirements as the main namenode.

73. What are the side effects of not running a secondary name node?

The cluster performance will degrade over time since edit log will grow bigger and bigger. If the secondary namenode is not running at all, the edit log will grow significantly and it will slow the system down. Also, the system will go into safemode for an extended time since the namenode needs to combine the edit log and the current filesystem checkpoint image.

74. What daemons run on Master nodes?

NameNode, Secondary NameNode and JobTracker

Hadoop is comprised of five separate daemons and each of these daemon run in its own JVM. NameNode, Secondary NameNode and JobTracker run on Master nodes. DataNode and TaskTracker run on each Slave nodes.

75. Explain about the BloomMapFile.

BloomMapFile is a class, that extends the MapFile class. It is used in HBase table format to provide quick membership test for the keys using dynamic bloom filters.

76. What is the usage of foreach operation in Pig scripts?

FOREACH operation in Apache Pig is used to apply transformation to each element in the data bag, so that respective action is performed to generate new data items.

Syntax- FOREACH data_bagname GENERATE exp1, exp2

77. Explain about the different complex data types in Pig.

Apache Pig supports 3 complex data types-

Maps- These are key, value stores joined together using #.

Tuples- Just similar to the row in a table, where different items are separated by a comma. Tuples can have multiple attributes.

Bags- Unordered collection of tuples. Bag allows multiple duplicate tuples.

78. Differentiate between PigLatin and HiveQL

  • It is necessary to specify the schema in HiveQL, whereas it is optional in PigLatin.
  • HiveQL is a declarative language, whereas PigLatin is procedural.
  • HiveQL follows a flat relational data model, whereas PigLatin has nested relational data model.

79. Whether pig latin language is case-sensitive or not?

Answer: pig latin is sometimes not a case sensitive.let us see example,Load is equivalent to load.

A=load ‘b’ is not equivalent to a=load ‘b’

UDF are also case sensitive,count is not equivalent to COUNT.

80. What are the use cases of Apache Pig? 

Apache Pig is used for analyzing and performing tasks involving ad-hoc processing. Apache Pig is used for:

Research on large raw data sets like data processing for search platforms. For example, Yahoo uses Apache Pig to analyse data gathered from Yahoo search engines and Yahoo News Feeds.

Processing huge data sets like Web logs, streaming online data, etc.

In customer behavior prediction models like e-commerce websites.