Eric, I found the bug just a little bit ago... > agent.sinks.s3.hdfs.batchSize = 10000 > -agent.sinks.s3.hdfs.serializer = avro_event > -agent.sinks.s3.hdfs.fileType = SequenceFile > +agent.sinks.s3.hdfs.writeFormat = Text > +agent.sinks.s3.hdfs.fileType = DataStream > agent.sinks.s3.hdfs.timeZone = UTC > +agent.sinks.s3.hdfs.filePrefix = FlumeData > +agent.sinks.s3.hdfs.fileSuffix = .avro > +agent.sinks.s3.serializer = avro_event
Essentially I was setting the serializer in the wrong part of the configuration, and Flume wasn't letting me know.. once I fixed that, using the avro-tools package on the files created by this Sink seems to work just fine. Its terribly un-documented, but it does seem to work now. --Matt On May 8, 2013, at 1:12 PM, Eric Sammer <[email protected]> wrote: > Matt: > > This is because what you're actually doing is writing Avro records into > Hadoop Sequence Files. The Avro tools only know how to read Avro Data Files > (which are, effectively, meant to supercede Sequence Files). The serializer > plugin only says "write each event as an Avro record." It doesn't say "write > these Avro records as an Avro Data File." It's all very confusing, > admittedly. I don't think we support writing Avro Data Files with the HDFS > sink today. In other words, you need to use the Sequence File APIs to read > the *files* produced by Flume. The records within those files will, however, > be Avro records. > > > > On Wed, May 8, 2013 at 10:42 AM, Matt Wise <[email protected]> wrote: > We're still working on getting our POC of Flume up and running... right now > we have log events that pass through our Flume nodes via a Syslog input and > are happily sent off to ElasticSearch for indexing. We're also sending these > events to S3, but we're finding that they seem to be unreadable with the avro > tools. > >> # S3 Output Sink >> agent.sinks.s3.type = hdfs >> agent.sinks.s3.channel = fc1 >> agent.sinks.s3.hdfs.path = s3n://XXX:XXX@our_bucket/flume/events/%y-%m-%d/%H >> agent.sinks.s3.hdfs.rollInterval = 600 >> agent.sinks.s3.hdfs.rollSize = 0 >> agent.sinks.s3.hdfs.rollCount = 10000 >> agent.sinks.s3.hdfs.batchSize = 10000 >> agent.sinks.s3.hdfs.serializer = avro_event >> agent.sinks.s3.hdfs.fileType = SequenceFile >> agent.sinks.s3.hdfs.timeZone = UTC > > > When we try to look at the avro-serialized files, we get this error: > >> [localhost avro]$ java -jar avro-tools-1.7.4.jar getschema >> FlumeData.1367857371493 >> Exception in thread "main" java.io.IOException: Not a data file. >> at >> org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105) >> at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97) >> at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:89) >> at >> org.apache.avro.tool.DataFileGetSchemaTool.run(DataFileGetSchemaTool.java:48) >> at org.apache.avro.tool.Main.run(Main.java:80) >> at org.apache.avro.tool.Main.main(Main.java:69) > > At this point we're a bit unclear how we're supposed to use these FlumeData > files with normal Avro tools? > > --Matt > > > > -- > Eric Sammer > twitter: esammer > data: www.cloudera.com
