Hello All,
I'm also a new user to Flume and was hoping someone could point me in the right
direction or tell me what silly little piece I'm missing from the puzzle. I
apologize if this has been covered but after searching for a few days I
couldn't find anything that helped. Also if there's a better suited group for
this to be posted to just let me know.
I have flume configured to read from a log4j log file using a tail source and
send data into an HDFS sink. All of the plumbing seems to work fine - I'm able
to query the data using a quick map reduce job and verify that the entries are
in fact getting into Hadoop. What's interesting (annoying) is some additional
characters that are being added to each request. Running hadoop dfs -cat
somefile I get something like this (where [Data_From_The_Log_Here] is properly
formatted and looks valid from what I can tell) :
SEQ!org.apache.hadoop.io.LongWritableorg.apache.hadoop.io.TextY] õpµ^R÷ﳬÕ
;*j 7[Data_From_The_Log_Here]ÿÿÿÿY] õpµ^R÷ﳬÕ
;*j
[Data_From_The_Log_Here]ÿÿÿÿY] õpµ^R÷ﳬÕ
;*j%
Î[Data_From_The_Log_Here]ÿÿÿÿY] õpµ^R÷ﳬÕ
;*jF
½[Data_From_The_Log_Here]
Here's the flume config:
requestToHDFS.channels = MemoryChannel
requestToHDFS.sinks = HDFS
requestToHDFS.sources = Tail
requestToHDFS.sources.Tail.channels = MemoryChannel
requestToHDFS.sources.Tail.interceptors = ts
requestToHDFS.sources.Tail.interceptors.ts.type =
org.apache.flume.interceptor.TimestampInterceptor$Builder
requestToHDFS.sources.Tail.type = exec
requestToHDFS.sources.Tail.command = tail -F /path/to/someLogFile.log
requestToHDFS.sinks.HDFS.channel = MemoryChannel
requestToHDFS.sinks.HDFS.type = hdfs
requestToHDFS.sinks.HDFS.hdfs.path =
hdfs://somehadoopserver:9000/logs/%Y/%m/%d/%H
requestToHDFS.sinks.HDFS.hdfs.file.Type = DataStream
# also tried...
#requestToHDFS.sinks.HDFS.hdfs.file.Type = SequenceFile
requestToHDFS.sinks.HDFS.hdfs.writeFormat=Text
requestToHDFS.sinks.HDFS.hdfs.batchSize = 10
requestToHDFS.sinks.HDFS.hdfs.rollSize = 0
requestToHDFS.sinks.HDFS.hdfs.rollCount = 10000
requestToHDFS.sinks.HDFS.hdfs.rollInterval = 600
requestToHDFS.channels.MemoryChannel.type = memory
requestToHDFS.channels.MemoryChannel.capacity = 10000
requestToHDFS.channels.transactionCapacity = 100
I'm able to get around the issue by doing some parsing in a map reduce job to
isolate the log entries I want, but it seems like I'm missing something. The
additional characters/encoding/whatever on each line seems to have some data
that Flume uses for sending events across the wire. Is there a way to
eliminate this before a record is sent into HDFS? Is this just the way records
are stored in HDFS and I need to account for the additional characters when
querying the data? Ideally the entries in Hadoop would look something like
this:
[Data_From_The_Log_Here]
[Data_From_The_Log_Here]
[Data_From_The_Log_Here]
Versions are as follows:
Flume 1.2.0
Subversion https://svn.apache.org/repos/asf/flume/tags/flume-1.2.0-rc1 -r
1360090<https://svn.apache.org/repos/asf/flume/tags/flume-1.2.0-rc1%20-r%201360090>
Hadoop 1.1.1
Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1
-r 1411108
Thanks in advance!
Chris