Hadoop Interview Questions and Answers Set 9

81. What does Apache Mahout do?

Mahout supports four main data science use cases:

Collaborative filtering – mines user behavior and makes product recommendations (e.g. Amazon recommendations)

Clustering – takes items in a particular class (such as web pages or newspaper articles) and organizes them into naturally occurring groups, such that items belonging to the same group are similar to each other

Classification – learns from existing categorizations and then assigns unclassified items to the best category

Frequent item-set mining – analyzes items in a group (e.g. items in a shopping cart or terms in a query session) and then identifies which items typically appear together

82. Mention some machine learning algorithms exposed by Mahout?

Below is a current list of machine learning algorithms exposed by Mahout.

Collaborative Filtering

  • Item-based Collaborative Filtering
  • Matrix Factorization with Alternating Least Squares
  • Matrix Factorization with Alternating Least Squares on Implicit Feedback

Classification

  • Naive Bayes
  • Complementary Naive Bayes
  • Random Forest

Clustering

  • Canopy Clustering
  • k-Means Clustering
  • Fuzzy k-Means
  • Streaming k-Means
  • Spectral Clustering

83. What is Apache Flume?

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data source. Review this Flume use case to learn how Mozilla collects and Analyse the Logs using Flume and Hive.

Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop.

84. Explain about the different channel types in Flume. Which channel type is faster?

The 3 different built in channel types available in Flume are-

MEMORY Channel – Events are read from the source into memory and passed to the sink.

JDBC Channel – JDBC Channel stores the events in an embedded Derby database.

FILE Channel –File Channel writes the contents to a file on the file system after reading the event from a source. The file is deleted only after the contents are successfully delivered to the sink.

MEMORY Channel is the fastest channel among the three however has the risk of data loss. The channel that you choose completely depends on the nature of the big data application and the value of each event.

85. Why we are using Flume?

Most often Hadoop developer use this too to get data from social media sites. Its developed by Cloudera for aggregating and moving very large amount if data. The primary use is to gather log files from different sources and asynchronously persist in the hadoop cluster.

86. Which Scala library is used for functional programming?

Scalaz library has purely functional data structures that complement the standard Scala library. It has pre-defined set of foundational type classes like Monad, Functor, etc.

87. What do you understand by “Unit” and “()” in Scala?

Unit is a subtype of scala.anyval and is nothing but Scala equivalent of Java void that provides the Scala with an abstraction of the java platform. Empty tuple i.e. () in Scala is a term that represents unit value.

88. What do you understand by a closure in Scala?

Closure is a function in Scala where the return value of the function depends on the value of one or more variables that have been declared outside the function.

89. List some use cases where classification machine learning algorithms can be used.

  • Natural language processing (Best example for this is Spoken Language Understanding )
  • Market Segmentation
  • Text Categorization (Spam Filtering )
  • Bioinformatics (Classifying proteins according to their function)
  • Fraud Detection
  • Face detection

90. Mention what is data cleansing?

Data cleaning also referred as data cleansing, deals with identifying and removing errors and inconsistencies from data in order to enhance the quality of data.