Flume has a built-in HDFS sink. Importing data to Hive is almost the same as saving data to HDFS directories, with a little difference. This is a guide about the Flume configuration and the corresoponding Hive-QL to load the data table.
Example
The events from the source have headers and contents with following format (please refer to another article if you are interested in how to customize events):
Headers: {table: ‘TableA’, timestamp: 1415912506}
Body: “key1|123|345|2,1,3”
The events could have different table names and corresponding body format. We want to store all events to different Hive tables based on the table names in their headers.
The timestamp
header is required if we want to partition the data by date (the date variables in the configuration file require timestamp
header).
The Flume configuration for the sink is as below:
|
|
Create table in Hive:
|
|
Also, you need to add partitions periodically:
|
|