Eric,
  I found the bug just a little bit ago... 

>  agent.sinks.s3.hdfs.batchSize = 10000
> -agent.sinks.s3.hdfs.serializer = avro_event
> -agent.sinks.s3.hdfs.fileType = SequenceFile
> +agent.sinks.s3.hdfs.writeFormat = Text
> +agent.sinks.s3.hdfs.fileType = DataStream
>  agent.sinks.s3.hdfs.timeZone = UTC
> +agent.sinks.s3.hdfs.filePrefix = FlumeData
> +agent.sinks.s3.hdfs.fileSuffix = .avro
> +agent.sinks.s3.serializer = avro_event

  Essentially I was setting the serializer in the wrong part of the 
configuration, and Flume wasn't letting me know.. once I fixed that, using the 
avro-tools package on the files created by this Sink seems to work just fine. 
Its terribly un-documented, but it does seem to work now.

--Matt

On May 8, 2013, at 1:12 PM, Eric Sammer <[email protected]> wrote:

> Matt:
> 
> This is because what you're actually doing is writing Avro records into 
> Hadoop Sequence Files. The Avro tools only know how to read Avro Data Files 
> (which are, effectively, meant to supercede Sequence Files). The serializer 
> plugin only says "write each event as an Avro record." It doesn't say "write 
> these Avro records as an Avro Data File." It's all very confusing, 
> admittedly. I don't think we support writing Avro Data Files with the HDFS 
> sink today. In other words, you need to use the Sequence File APIs to read 
> the *files* produced by Flume. The records within those files will, however, 
> be Avro records.
> 
> 
> 
> On Wed, May 8, 2013 at 10:42 AM, Matt Wise <[email protected]> wrote:
> We're still working on getting our POC of Flume up and running... right now 
> we have log events that pass through our Flume nodes via a Syslog input and 
> are happily sent off to ElasticSearch for indexing. We're also sending these 
> events to S3, but we're finding that they seem to be unreadable with the avro 
> tools.
> 
>> # S3 Output Sink
>> agent.sinks.s3.type = hdfs
>> agent.sinks.s3.channel = fc1
>> agent.sinks.s3.hdfs.path = s3n://XXX:XXX@our_bucket/flume/events/%y-%m-%d/%H
>> agent.sinks.s3.hdfs.rollInterval = 600
>> agent.sinks.s3.hdfs.rollSize = 0
>> agent.sinks.s3.hdfs.rollCount = 10000
>> agent.sinks.s3.hdfs.batchSize = 10000
>> agent.sinks.s3.hdfs.serializer = avro_event
>> agent.sinks.s3.hdfs.fileType = SequenceFile
>> agent.sinks.s3.hdfs.timeZone = UTC
> 
> 
> When we try to look at the avro-serialized files, we get this error:
> 
>> [localhost avro]$ java -jar avro-tools-1.7.4.jar getschema 
>> FlumeData.1367857371493
>> Exception in thread "main" java.io.IOException: Not a data file.
>>         at 
>> org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105)
>>         at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
>>         at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:89)
>>         at 
>> org.apache.avro.tool.DataFileGetSchemaTool.run(DataFileGetSchemaTool.java:48)
>>         at org.apache.avro.tool.Main.run(Main.java:80)
>>         at org.apache.avro.tool.Main.main(Main.java:69)
> 
> At this point we're a bit unclear how we're supposed to use these FlumeData 
> files with normal Avro tools?
> 
> --Matt
> 
> 
> 
> -- 
> Eric Sammer
> twitter: esammer
> data: www.cloudera.com

Reply via email to