Hi, I have a issue with my flume agents which collectes JSON data and save it to an hdfs store for hive. Today my daily job was broken because malformed rows. I looked in this files to see what is happend and I see I have something like this in my file:
... POST / HTTP/1.0 Host: localhost:50000 Content-Length: 185 Content-Type: application/x-www-form-urlencoded ... And this brokens my JSON serde in Hive. IMHO the flume agents logs data themselves and I'm sure that I don't send any things like this. I have two flume agents. The first one collects data from my application with the HTTPSource: http.sources = user_events http.channels = user_events http.sinks = user_events http.sources.user_events.type = org.apache.flume.source.http.HTTPSource http.sources.user_events.port = 50000 http.sources.user_events.interceptors = timestamp http.sources.user_events.interceptors.timestamp.type = timestamp http.sources.user_events.channels = user_events http.channels.user_events.type = memory http.channels.user_events.capacity = 100000 http.channels.user_events.transactionCapacity = 1000 http.sinks.user_events.type = avro http.sinks.user_events.channel = user_events http.sinks.user_events.hostname = 10.2.0.190 http.sinks.user_events.port = 20000 http.sinks.user_events.batch-size = 100 And the second agents puts the data into hdfs: hdfs.sources = user_events hdfs.channels = user_events hdfs.sinks = user_events hdfs.sources.user_events.type = avro hdfs.sources.user_events.channels = user_events hdfs.sources.user_events.bind = 10.2.0.190 hdfs.sources.user_events.port = 20000 hdfs.channels.user_events.type = memory hdfs.channels.user_events.capacity = 100000 hdfs.channels.user_events.transactionCapacity = 1000 hdfs.sinks.user_events.type = hdfs hdfs.sinks.user_events.channel = user_events hdfs.sinks.user_events.hdfs.path = hdfs://10.2.0.190:8020/user/beeswax/warehouse/user_events/dt=%Y-%m-%d/hour=%H hdfs.sinks.user_events.hdfs.filePrefix = flume hdfs.sinks.user_events.hdfs.rollInterval = 600 hdfs.sinks.user_events.hdfs.rollSize = 134217728 hdfs.sinks.user_events.hdfs.rollCount = 0 hdfs.sinks.user_events.hdfs.batchSize = 1000 hdfs.sinks.user_events.hdfs.fileType = DataStream It' works since 3 months without any problems and I don't change anything in this time. I use flume 1.3.0 and cdh 4.1.2 I hope some one can help me too resolve this issue. Thanks & Regards Thomas
