Thomas, Looks like your data is written out as text. It is possible that while Flume had written out the entire dataset, your HDFS cluster failed to allocate a fresh block after persisting half your row. In such a case, a dangling partial event is possible - and Flume will retry the whole event because HDFS will throw an exception. Either you should use a binary format where malformed data can be easily identified and discarded or the job you are using should be able to ignore malformed data. I am not a Hive expert, but I know that you can only select rows from a table which match a certain criteria - and making sure you have a non-nullable last column is a good check - so if the last column is null (select * from table where last_row!=null), the row can be ignored - since it may not have been written out correctly.
Hope this helps. Hari -- Hari Shreedharan On Wednesday, February 27, 2013 at 3:25 AM, Thomas Adam wrote: > Hi, > > I have a issue with my flume agents which collectes JSON data and save > it to an hdfs store for hive. Today my daily job was broken because > malformed rows. I looked in this files to see what is happend and I > see I have something like this in my file: > > ... > POST / HTTP/1.0 > Host: localhost:50000 > Content-Length: 185 > Content-Type: application/x-www-form-urlencoded > ... > > And this brokens my JSON serde in Hive. IMHO the flume agents logs > data themselves and I'm sure that I don't send any things like this. > > I have two flume agents. > The first one collects data from my application with the HTTPSource: > > http.sources = user_events > http.channels = user_events > http.sinks = user_events > > http.sources.user_events.type = org.apache.flume.source.http.HTTPSource > http.sources.user_events.port = 50000 > http.sources.user_events.interceptors = timestamp > http.sources.user_events.interceptors.timestamp.type = timestamp > http.sources.user_events.channels = user_events > > http.channels.user_events.type = memory > http.channels.user_events.capacity = 100000 > http.channels.user_events.transactionCapacity = 1000 > > http.sinks.user_events.type = avro > http.sinks.user_events.channel = user_events > http.sinks.user_events.hostname = 10.2.0.190 > http.sinks.user_events.port = 20000 > http.sinks.user_events.batch-size = 100 > > And the second agents puts the data into hdfs: > > hdfs.sources = user_events > hdfs.channels = user_events > hdfs.sinks = user_events > > hdfs.sources.user_events.type = avro > hdfs.sources.user_events.channels = user_events > hdfs.sources.user_events.bind = 10.2.0.190 > hdfs.sources.user_events.port = 20000 > > hdfs.channels.user_events.type = memory > hdfs.channels.user_events.capacity = 100000 > hdfs.channels.user_events.transactionCapacity = 1000 > > hdfs.sinks.user_events.type = hdfs > hdfs.sinks.user_events.channel = user_events > hdfs.sinks.user_events.hdfs.path = > hdfs://10.2.0.190:8020/user/beeswax/warehouse/user_events/dt=%Y-%m-%d/hour=%H > hdfs.sinks.user_events.hdfs.filePrefix = flume > hdfs.sinks.user_events.hdfs.rollInterval = 600 > hdfs.sinks.user_events.hdfs.rollSize = 134217728 > hdfs.sinks.user_events.hdfs.rollCount = 0 > hdfs.sinks.user_events.hdfs.batchSize = 1000 > hdfs.sinks.user_events.hdfs.fileType = DataStream > > It' works since 3 months without any problems and I don't change > anything in this time. > I use flume 1.3.0 and cdh 4.1.2 > > I hope some one can help me too resolve this issue. > > Thanks & Regards > Thomas > >
