61. How Spark uses Hadoop?
Spark has its own cluster management computation and mainly uses Hadoop for storage.
62. What is Spark SQL?
SQL Spark, better known as Shark is a novel module introduced in Spark to work with structured data and perform structured data processing. Through this module, Spark executes relational SQL queries on the data. The core of the component supports an altogether different RDD called SchemaRDD, composed of rows objects and schema objects defining data type of each column in the row. It is similar to a table in relational database.
63. What are the additional benefits YARN brings in to Hadoop?
Effective utilization of the resources as multiple applications can be run in YARN all sharing a common resource.YARN is backward compatible so all the existing MapReduce jobs.Using YARN, one can even run applications that are not based on the MaReduce model
64. Compare Sqoop and Flume
|Application||Importing data from RDBMS||Moving bulk streaming data into HDFS|
|Architecture||Connector – connecting to respective data||Agent – fetching of the right data|
|Loading of data||Event driven||Not event driven|
65. What is Sqoop metastore?
Sqoop metastore is a shared metadata repository for remote users to define and execute saved jobs created using sqoop job defined in the metastore. The sqoop –site.xml should be configured to connect to the metastore.
66. Which are the elements of Kafka?
The most important elements of Kafka:
Topic – It is the bunch of similar kind of messages
Producer – using this one can issue communications to the topic
Consumer – it endures to a variety of topics and takes data from brokers.
Brokers – this is the place where the issued messages are stored
67. What is Kafka?
Wikipedia defines Kafka as “an open-source message broker project developed by the Apache Software Foundation written in Scala, where the design is heavily influenced by transaction logs”. It is essentially a distributed publish-subscribe messaging system.
68. What is the role of the ZooKeeper?
Kafka uses Zookeeper to store offsets of messages consumed for a specific topic and partition by a specific Consumer Group.
69. What are the key benefits of using Storm for Real Time Processing?
Easy to operate : Operating storm is quiet easy.
Real fast : It can process 100 messages per second per node.
Fault Tolerant : It detects the fault automatically and re-starts the functional attributes.
Reliable : It guarantees that each unit of data will be executed at least once or exactly once.
Scalable : It runs across a cluster of machine
70. List out different stream grouping in Apache storm?
- Shuffle grouping
- Fields grouping
- Global grouping
- All grouping
- None grouping
- Direct grouping
- Local grouping