Thanks, I am not generating the avro files with hadoop MR, but a different process. I Plan to just store the files on s3 for potential archive processing with EMR. Can I use AvroSequenceFile from a non M/R process to generate the sequence files having my avro records as the values, and null keys ? ________________________________ From: graham sanderson <[email protected]> Sent: Sunday, October 13, 2013 9:16 PM To: [email protected] Subject: Re: Generating snappy compressed avro files as hadoop map reduce input files
If you're using hadoop, why not use AvroSequenceFileOutputFormat - this works fine with snappy (block level compression may be best depending on your data) On Oct 13, 2013, at 10:58 AM, David Ginzburg <[email protected]<mailto:[email protected]>> wrote: As mentioned in http://stackoverflow.com/a/15821136 Hadoop's snappy codec just doesn't work with externally generated files. Can files generated by DataFileWriter<http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#setCodec%28org.apache.avro.file.CodecFactory%29> serve as input files for a map reduce job, specially EMR jobs ? ________________________________ From: Bertrand Dechoux <[email protected]<mailto:[email protected]>> Sent: Sunday, October 13, 2013 6:36 PM To: [email protected]<mailto:[email protected]> Subject: Re: Generating snappy compressed avro files as hadoop map reduce input files I am not sure to understand the relation between your problem and the way the temporary data are stored after the map phase. However, I guess you are looking for a DataFileWriter and its setCodec function. http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#setCodec%28org.apache.avro.file.CodecFactory%29 Regards Bertrand PS : A snappy-compressed avro file is not a standard file which has been compressed afterwards but really a specific file containing compressed blocks. This principle is similar to the SequenceFile's. Maybe that's what you mean by different snappy codec? On Sun, Oct 13, 2013 at 5:16 PM, David Ginzburg <[email protected]<mailto:[email protected]>> wrote: Hi, I am writing an application that produces avro record files , to be stored on AWS S3 as possible input to EMR. I would like to compress with snappy codec before storing them on S3. It is my understanding that hadoop currently uses a different snappy codec, mostly used as intermediate map output format . My question is how can I generate within my application logic (not MR) snappy compressed avro files?
