Apache Spark Tutorial

What is Apche Spark?

Spark also open source framework and mainly used for data analytics. Spark runs more faster than hadoop and it designed on top of the hadoop. Spark does not have separate file system and it integrated with another one. Main feature of spark is does not use YARN for functioning.

Spark does not have own file system for processing data because programmers install spark at top of the hadoop.Spark also used HDFS for storing the data.Spark copies more data from physical server because it reduces required time to interact with physical server.

Apache-Spark-Lightning-Fast-Cluster-Computing-min

Spark Built on Hadoop:

The following three ways are used to spark built on hadoop

Standalone – Standalone places on top of the hdfs and it allocated memory space for HDFS. It covers full part for hadoop cluster and runs on side by side of hadoop.

YARN – Spark is runs on yarn so yarn is most important to built spark but root access not required to built spark. Yarn helps to integrate the spark to hadoop stack.

MapReduce – Mapreduce is used to publish the hadoop jobs on spark.

Two Types of Abstractions of Spark:

1. Resilient Distributed Datasets:

It collects the data that split into small parts and stores that data on worker node. There are two operations in RDD one is Transformations and other one is Actions. It creates the hadoop dataset from HDFS Data.

2. Directed Acyclic Graph:

In DAG all nodes are partition and transformation is edge of the hadoop. DAG eliminates the MapReduce storage execution and give hadoop performances.

Components of Spark:

Spark Core – It is the execution engine of spark and it provides memory space in storage systems.

Spark Sql – It used to intoduces the data abstraction and it support semi structured and structured data also.

Machine Learning Library – It is the distributed framework and memory based framework. Machine learnng library is nine time faster of hadoop mahout.

Role of Spark Driver in Spark:

Spark driver is the entry and end point of the spark. It runs the main function of the spark application and master node of the spark. It splits the execution graph into more number of stages. It stores the meta data and small execution of data called units.

Role of Executor in Spark:

Main purposes of executor is execution of spark tasks. All spark applications are having own executor process. It runs whole lifetime of the spark application and it is called “Static Allocation of Executors”.

Memory Mangement of Spark:

Spark applications runs as java programs so memory space is equal to heap size. In Spark memory splits into several regions. There are four types of memory management used in spark.

  • Execution Memory
  • Storage Memory
  • User Memory
  • Reserved Memory