Loading HDFS Folder as a Partition of Hive External Table without Data Moving

Hive is a great bigdata tool for the capability of translating SQL querying into a series of Map-Reduce tasks. In some cases, loading data into Hive becomes an issue to be solved. A tipical situation comes as followed: a large amount of structed data was generated by some process and was saved on HDFS, and now, you are finding a way to build database upon the mentioned data, and thus it could be handled by SQL querying. The problem is that, due to the big volume of data, the high cost of moving data from the birth place to Hive data directory could be ineluctable.

In the following parts of this post, a practical solution would be presented.

the Generation of Data

First of all, you'll have to dump your data onto HDFS with a specific structed format.

For example, I dumped GZip-ed, tab-sperating, 4-column data, generated by Spark Streaming tasks, to HDFS location: hdfs://namenode/path/to/data/<date>/<hour>/<dstreamid>. Here, <date> and <hour> is the date and hour of generating, respectively, and <dstreamid> is the ID of Direct Stream of Spark Streaming. As a consequence, the structure of folder shall be similar to that listed below:

hdfs://namenode/path/to/data/2019-11-01/13/123456/_SUCCESS
hdfs://namenode/path/to/data/2019-11-01/13/123456/part-00000.gz
hdfs://namenode/path/to/data/2019-11-01/13/123456/part-00001.gz
hdfs://namenode/path/to/data/2019-11-01/13/123456/part-00002.gz
hdfs://namenode/path/to/data/2019-11-01/13/123456/part-00003.gz
hdfs://namenode/path/to/data/2019-11-01/13/123456/part-00004.gz
hdfs://namenode/path/to/data/2019-11-01/13/123456/part-00005.gz

Creating Hive Table

Having data on HDFS folder, we are going to build a Hive table which is compatible with the format of data. For the reason that data moving while loading data into Hive table is not expected, an external table shall be created.

CREATE EXTERNAL TABLE `table_name` (
  `field_1` string,
  `field_2` string,
  `field_3` string,
  `field_4` string)
PARTITIONED BY (
  `date` string,
  `hour` string,
  `dstreamid` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'field.delim'='\t',
  'serialization.format'='\t')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://namenode/path/to/data/';

Here:

there are four fields named from field_1 to field_4 (you can modify them on demand);
there are three fields used for partition, and they are date/hour/dstreamid; normally, they should be consist to the sub-folder name when saving data onto HDFS;
field delimiter is \t, i.e. tab seperator;
the format of input is org.apache.hadoop.mapred.TextInputFormat;
the location of data of this partition is hdfs://namenode/path/to/data/, the full HDFS path of respective data;
the name of table is table_name, which could be modified on demand.

Loading Data into Hive Table

Having date generated and table created, we are well prepared to loading data into table. Here, ALTER TABLE clause will be applied.

ALTER TABLE
  table_name
ADD IF NOT EXISTS PARTITION
  (dt='2019-11-01', hour='13', dstreamid='123456')
LOCATION
  'hdfs://namenode/path/to/data/2019-11-01/13/123456';

This instructs Hive to:

alter the table named table_name;
add a partition, if the specified partition is not exists currently;
the location of data specified is hdfs://namenode/path/to/data/2019-11-01/13/123456.

After execution of the SQL, the HDFS folder is loaded as a partition of Hive external table, without data moving.