Re: Generating snappy compressed avro files as hadoop map reduce input files

graham sanderson Sun, 13 Oct 2013 14:32:37 -0700

I haven't actually tried writing, but look at AvroSequenceFileOutputFormat (and 
obviously have native snappy libraries on your box)


Also the javadoc is a bit IMHO ambiguous on AvroJob setup - you can totally use 
NullWritable (or any other hadoop supported Serializable) as a key.

On Oct 13, 2013, at 2:23 PM, David Ginzburg <[email protected]> wrote:

> Thanks,
> I am not generating the avro files with hadoop MR, but a different process.
> I Plan to just store the files on s3 for potential archive processing with 
> EMR.
> Can I use AvroSequenceFile from a non M/R process to generate the sequence 
> files having my avro records as the values, and null keys ?
> From: graham sanderson <[email protected]>
> Sent: Sunday, October 13, 2013 9:16 PM
> To: [email protected]
> Subject: Re: Generating snappy compressed avro files as hadoop map reduce 
> input files
>  
> If you're using hadoop, why not use AvroSequenceFileOutputFormat - this works 
> fine with snappy (block level compression may be best depending on your data)
> 
> On Oct 13, 2013, at 10:58 AM, David Ginzburg <[email protected]> wrote:
> 
>> As mentioned in http://stackoverflow.com/a/15821136 Hadoop's snappy codec 
>> just doesn't work with externally generated files.
>> 
>> Can files generated by DataFileWriter  serve as input files for a map reduce 
>> job, specially EMR jobs ? 
>> From: Bertrand Dechoux <[email protected]>
>> Sent: Sunday, October 13, 2013 6:36 PM
>> To: [email protected]
>> Subject: Re: Generating snappy compressed avro files as hadoop map reduce 
>> input files
>>  
>> I am not sure to understand the relation between your problem and the way 
>> the temporary data are stored after the map phase.
>> 
>> However, I guess you are looking for a DataFileWriter and its setCodec 
>> function.
>> http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#setCodec%28org.apache.avro.file.CodecFactory%29
>> 
>> Regards
>> 
>> Bertrand
>> 
>> PS : A snappy-compressed avro file is not a standard file which has been 
>> compressed afterwards but really a specific file containing compressed 
>> blocks. This principle is similar to the SequenceFile's. Maybe that's what 
>> you mean by different snappy codec?
>> 
>> On Sun, Oct 13, 2013 at 5:16 PM, David Ginzburg <[email protected]> 
>> wrote:
>> Hi,
>> 
>> I am writing an application that produces avro record files , to be stored 
>> on AWS S3 as possible input to EMR.
>> I would like to compress with snappy codec before storing them on S3.
>> It is my understanding that hadoop currently uses a different snappy codec, 
>> mostly used as intermediate map output format .
>> My question is how can I generate within my application logic (not MR) 
>> snappy compressed avro files?
>> 
>> 
>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME cryptographic signature

Re: Generating snappy compressed avro files as hadoop map reduce input files

Reply via email to