When importing data using Flume, you might want to route Flume events to multiple destinations (e.g.: different directories in HDFS) based on their content. Flume has a functionality called Multiplexing to achieve this goal, this article is a guide to the configuration.
Flume has a built-in HDFS sink. Importing data to Hive is almost the same as saving data to HDFS directories, with a little difference. This is a guide about the Flume configuration and the corresoponding Hive-QL to load the data table.
Flume is an open-source Apache project, it is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. This article shows how to import XML Files with Flume, including the development of a deserializer plugin and the corresponding configurations of Flume. We are using Flume 1.5.0 integrated in MapR.
The secenario is that XML files are sychronized to a directory periodically, we need to config a Spooling Directory Source to load these XML files into Flume.
The default deserializer of Flume’s Spooling Directory Source is
LineDeserializer, which simply parses each line as an Flume event. In our case, we need to implement a deserializer for XML files based on the structure.
If you need a single-node MapR cluster and you are not able to use the official MapR sandbox image, you can use this guide to install MapR on a CentOS.