Apache Hadoop Hive Tutorial

What is Hive?


Hive is Data warehousing tool and used to process the data in Hadoop and HDFS.Hive is similar to SQL because it analyzes and processes the data with querying language. Hive runs on MapReduce and top of the Hadoop. Hive also knew as HiveQL.Main Functions of the hive is data summarization, querying, and analysis.

Recommended Reading – Why and Benefits of Hadoop

Apache Hadoop Hive Tutorial

Features of Hive: 


  • Hive processed data from Hadoop and HDFS
  • Hive designed by OLAP
  • It Supports SQL and HiveQl

Data Storage of Hive: 


Metastore: 


It stores all metadata database, tables, and datatypes.Metastore also tracks HDFS data.

Tables:

There are two types of table.

1.First or Normal table

2.Second or External table

HDFS has used the external table for stores the data.The main difference between two tables are when deleting normal table all data also deleted in a normal table but in External table data are not deleted when deleting table of external.

Tables contains following columns

  • Timestamp
  • user id
  • Page_url
  • referrer_url
  • IP address

Partitions: 


Partition is a slice of the table and table contains many partitions.It mainly used to identify the rows and how is data stored.Partitions data are not stored but delivered the result on load. The syntax of partitions declaration is “WHERE” clause in a select statement.

Buckets: 


Partitions data are divided into buckets based on a value of a hash function.Main Purpose of buckets is speeding up the join operation.

Hive is Not: 


  • Hive is not relational database
  • Hive is not a design for OLTP
  • Hive is not a programming language

HiveQL: 


It is a traditional database.HQL stores data in table format and all table contains columns and columns contains rows.All columns having own datatypes and HQL supports all datatypes like primitive and complex.

Primitive Datatype – Integer,Boolean,Bigint,smallint,TinyInt,Double,String,Binary

Complex Datatype – Map, Structs, and list

It supports Data Definition Statements like Create table, Deletes table, Drops table.It also supports Data Manipulation Statements like Load, select, insert

A load is used to take data from HDFS and move to Hive. Insert used to move data from hive to another one. Select is querying data.

Basic Query Statements: 


Create Table: 


CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name

[(col_name data_type [COMMENT col_comment], …)] [COMMENT table_comment]

[PARTITIONED BY (col_name data_type [COMMENT col_comment], …)]

[CLUSTERED BY (col_name, col_name, …) [SORTED BY (col_name [ASC|DESC], …)] INTO num_buckets BUCKETS]

[ [ROW FORMAT row_format] [STORED AS file_format]]

[LOCATION hdfs_path]

Insert Table: 


FROM from_statement

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 …) [IF NOT EXISTS]] select_statement1

[INSERT OVERWRITE TABLE tablename2 [PARTITION … [IF NOT EXISTS]] select_statement2]

[INSERT INTO TABLE tablename2 [PARTITION …] select_statement2] …;

Select Table: 


SELECT [ALL | DISTINCT] select_expr, select_expr, …

FROM table_reference

[WHERE where_condition]

[GROUP BY col_list]

[CLUSTER BY col_list]

[SORT BY col_list]]

[LIMIT number]