Re: Flume logs http request info

Hari Shreedharan Wed, 27 Feb 2013 11:08:59 -0800

Thomas, 

Looks like your data is written out as text. It is possible that while Flume 
had written out the entire dataset, your HDFS cluster failed to allocate a 
fresh block after persisting half your row. In such a case, a dangling partial 
event is possible - and Flume will retry the whole event because HDFS will 
throw an exception. Either you should use a binary format where malformed data 
can be easily identified and discarded or the job you are using should be able 
to ignore malformed data. I am not a Hive expert, but I know that you can only 
select rows from a table which match a certain criteria - and making sure you 
have a non-nullable last column is a good check - so if the last column is null 
(select * from table where last_row!=null), the row can be ignored - since it 
may not have been written out correctly.



Hope this helps.


Hari 

-- 
Hari Shreedharan


On Wednesday, February 27, 2013 at 3:25 AM, Thomas Adam wrote:

> Hi,
> 
> I have a issue with my flume agents which collectes JSON data and save
> it to an hdfs store for hive. Today my daily job was broken because
> malformed rows. I looked in this files to see what is happend and I
> see I have something like this in my file:
> 
> ...
> POST / HTTP/1.0
> Host: localhost:50000
> Content-Length: 185
> Content-Type: application/x-www-form-urlencoded
> ...
> 
> And this brokens my JSON serde in Hive. IMHO the flume agents logs
> data themselves and I'm sure that I don't send any things like this.
> 
> I have two flume agents.
> The first one collects data from my application with the HTTPSource:
> 
> http.sources = user_events
> http.channels = user_events
> http.sinks = user_events
> 
> http.sources.user_events.type = org.apache.flume.source.http.HTTPSource
> http.sources.user_events.port = 50000
> http.sources.user_events.interceptors = timestamp
> http.sources.user_events.interceptors.timestamp.type = timestamp
> http.sources.user_events.channels = user_events
> 
> http.channels.user_events.type = memory
> http.channels.user_events.capacity = 100000
> http.channels.user_events.transactionCapacity = 1000
> 
> http.sinks.user_events.type = avro
> http.sinks.user_events.channel = user_events
> http.sinks.user_events.hostname = 10.2.0.190
> http.sinks.user_events.port = 20000
> http.sinks.user_events.batch-size = 100
> 
> And the second agents puts the data into hdfs:
> 
> hdfs.sources = user_events
> hdfs.channels = user_events
> hdfs.sinks = user_events
> 
> hdfs.sources.user_events.type = avro
> hdfs.sources.user_events.channels = user_events
> hdfs.sources.user_events.bind = 10.2.0.190
> hdfs.sources.user_events.port = 20000
> 
> hdfs.channels.user_events.type = memory
> hdfs.channels.user_events.capacity = 100000
> hdfs.channels.user_events.transactionCapacity = 1000
> 
> hdfs.sinks.user_events.type = hdfs
> hdfs.sinks.user_events.channel = user_events
> hdfs.sinks.user_events.hdfs.path =
> hdfs://10.2.0.190:8020/user/beeswax/warehouse/user_events/dt=%Y-%m-%d/hour=%H
> hdfs.sinks.user_events.hdfs.filePrefix = flume
> hdfs.sinks.user_events.hdfs.rollInterval = 600
> hdfs.sinks.user_events.hdfs.rollSize = 134217728
> hdfs.sinks.user_events.hdfs.rollCount = 0
> hdfs.sinks.user_events.hdfs.batchSize = 1000
> hdfs.sinks.user_events.hdfs.fileType = DataStream
> 
> It' works since 3 months without any problems and I don't change
> anything in this time.
> I use flume 1.3.0 and cdh 4.1.2
> 
> I hope some one can help me too resolve this issue.
> 
> Thanks & Regards
> Thomas
> 
>

Re: Flume logs http request info

Reply via email to