Flume/HDFS Encoding

Cormier, Christopher Fri, 14 Dec 2012 12:48:56 -0800

Hello All,
I'm also a new user to Flume and was hoping someone could point me in the right 
direction or tell me what silly little piece I'm missing from the puzzle.  I 
apologize if this has been covered but after searching for a few days I 
couldn't find anything that helped.  Also if there's a better suited group for 
this to be posted to just let me know.


I have flume configured to read from a log4j log file using a tail source and 
send data into an HDFS sink.  All of the plumbing seems to work fine - I'm able 
to query the data using a quick map reduce job and verify that the entries are 
in fact getting into Hadoop.  What's interesting (annoying) is some additional 
characters that are being added to each request.  Running hadoop dfs -cat 
somefile I get something like this (where [Data_From_The_Log_Here] is properly 
formatted and looks valid from what I can tell) :

SEQ!org.apache.hadoop.io.LongWritableorg.apache.hadoop.io.TextY] õpµ^R÷ï³¬Õ     
;*j     7[Data_From_The_Log_Here]ÿÿÿÿY] õpµ^R÷ï³¬Õ                              
                                                                                
                                                                                
                        ;*j
  [Data_From_The_Log_Here]ÿÿÿÿY] õpµ^R÷ï³¬Õ
                                                                                
                                                                                
                                                 ;*j%
                                                                                
                                                                                
                                                     
Î[Data_From_The_Log_Here]ÿÿÿÿY] õpµ^R÷ï³¬Õ
                                                                                
                                                                                
                                         ;*jF
                                                                                
                                                                                
                                             ½[Data_From_The_Log_Here]

Here's the flume config:

requestToHDFS.channels = MemoryChannel
requestToHDFS.sinks = HDFS
requestToHDFS.sources = Tail

requestToHDFS.sources.Tail.channels = MemoryChannel
requestToHDFS.sources.Tail.interceptors = ts
requestToHDFS.sources.Tail.interceptors.ts.type = 
org.apache.flume.interceptor.TimestampInterceptor$Builder
requestToHDFS.sources.Tail.type = exec
requestToHDFS.sources.Tail.command = tail -F /path/to/someLogFile.log

requestToHDFS.sinks.HDFS.channel = MemoryChannel
requestToHDFS.sinks.HDFS.type = hdfs
requestToHDFS.sinks.HDFS.hdfs.path = 
hdfs://somehadoopserver:9000/logs/%Y/%m/%d/%H

requestToHDFS.sinks.HDFS.hdfs.file.Type = DataStream
# also tried...
#requestToHDFS.sinks.HDFS.hdfs.file.Type = SequenceFile

requestToHDFS.sinks.HDFS.hdfs.writeFormat=Text
requestToHDFS.sinks.HDFS.hdfs.batchSize = 10
requestToHDFS.sinks.HDFS.hdfs.rollSize = 0
requestToHDFS.sinks.HDFS.hdfs.rollCount = 10000
requestToHDFS.sinks.HDFS.hdfs.rollInterval = 600

requestToHDFS.channels.MemoryChannel.type = memory
requestToHDFS.channels.MemoryChannel.capacity = 10000
requestToHDFS.channels.transactionCapacity = 100

I'm able to get around the issue by doing some parsing in a map reduce job to 
isolate the log entries I want, but it seems like I'm missing something.  The 
additional characters/encoding/whatever on each line seems to have some data 
that Flume uses for sending events across the wire.  Is there a way to 
eliminate this before a record is sent into HDFS?  Is this just the way records 
are stored in HDFS and I need to account for the additional characters when 
querying the data?  Ideally the entries in Hadoop would look something like 
this:

[Data_From_The_Log_Here]
[Data_From_The_Log_Here]
[Data_From_The_Log_Here]

Versions are as follows:
Flume 1.2.0
Subversion https://svn.apache.org/repos/asf/flume/tags/flume-1.2.0-rc1 -r 
1360090<https://svn.apache.org/repos/asf/flume/tags/flume-1.2.0-rc1%20-r%201360090>
Hadoop 1.1.1
Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 
-r 1411108

Thanks in advance!

Chris

Flume/HDFS Encoding

Reply via email to