When importing data using Flume, you might want to route Flume events to multiple destinations (e.g.: different directories in HDFS) based on their content. Flume has a functionality called Multiplexing to achieve this goal, this article is a guide to the configuration.
Implement a Flume Deserializer Plugin to Import XML Files
Background
Flume is an open-source Apache project, it is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. This article shows how to import XML Files with Flume, including the development of a deserializer plugin and the corresponding configurations of Flume. We are using Flume 1.5.0 integrated in MapR.
The secenario is that XML files are sychronized to a directory periodically, we need to config a Spooling Directory Source to load these XML files into Flume.
Implement a Flume Deserializer
The default deserializer of Flumeās Spooling Directory Source is LineDeserializer
, which simply parses each line as an Flume event. In our case, we need to implement a deserializer for XML files based on the structure.
MapR M3 Single-node Cluster Installation on CentOS 6
If you need a single-node MapR cluster and you are not able to use the official MapR sandbox image, you can use this guide to install MapR on a CentOS.