Introduction to Spark SQL

Meaning of Spark SQL:


Spark SQL is programming module for working with structured data using data frame and data set abstractions. Spark SQL is the good optimization technique. In Spark SQL we can be querying the data from Spark inside that connect through JDBC and ODBC connectors to Spark SQL. Spark SQL act as a SQL query engine.

Features of Spark SQL:


Features of Spark SQL

Integrated – Spark SQL is the mixes of SQL queries so we can run queries complex analytic programs using tight integration property of Spark SQL.

Unified Data Access – In Spark SQL we can load and be querying the data from various resources.

Standard Connectivity – Spark SQL include server mode with standard JDBC and ODBC connectors.

Scalability – In Spark SQL we can use one engine for interactive and long queries.

Spark SQL Data Frames:


Data Frame is the collections of distributed collections of data which organized into named columns. Data Frames is equivalent to relational tables or R/Python and it constructed from different resources array such as hive table. We can create data frame using following ways,

  • Structured data files
  • Tables in Hive
  • External databases
  • Using existing RDD

Main Layers of Spark SQL:


Language API – Spark is compatible with Spark SQL and it also supported by API(Python, Hive, Scala, Java).

Schema RDD – Spark designed with a data structure called RDD. Spark SQL mainly works on tables and records so we can use schema RDD for a temporary table and also use Schema RDD as Data Frames.

Data Sources – Data Sources of Spark SQL is text file and avro file. Available data sources are Parquet file, JSON document, HIVE tables, and Cassandra database.

Spark SQL Functions:


Built-in Functions – Built-in functions used to access column values. We can access built-in functions using Importing keyword.

User Defined Functions – In Spark SQL user defined functions based on user defined functions in scala.

Aggregate Functions – It used to operate a group of rows data and calculate single return value per groups.

Windows Aggregate – It used to operate a group of rows data and calculate single return value per rows value in groups.

Uses of Apache Spark SQL:


  • It executes SQL queries.
  • We can read data from existing Hive installation using SparkSQL.
  • When we run SQL within another programming language we will get the result as Dataset/DataFrame.

Conclusion:


Spark SQL is a module of Apache Spark that analyzes the structured data. It provides Scalability, it ensures high compatibility of the system. It has standard connectivity through JDBC or ODBC. Thus, it provides the most natural way to express the Structured Data.