I think I've figured out how to make this work.

Initially I have a file "data.avro". I gzip it as "data.avro.gz" and try to feed it to Hadoop. This does not work.

Instead Avro supports "deflate" codec natively. So I transcode it into "data_deflate.avro" and feed it to hadoop and it works correctly. The file size is slight larger than if I gzip it as a whole.

I was using avro-tools to do the transcoding. It's command line handling is irregular. It takes me many trial and error to get it to work. The command that works for me is

java -jar avro-tools-1.7.6.jar recodec --codec=deflate input.avro output.avro

Wai Yip

[email protected] <mailto:[email protected]>
Wednesday, July 23, 2014 5:07 PM

I have successfully stream Avro data file to Python mrjobs using the library AvroAsTextInputFormat


-inputformat org.apache.avro.mapred.AvroAsTextInputFormat


However, unlike text file, it does not seems to handle gzipped file automatically. What can I do to stream a gzipped Avro file?

Wai Yip

Reply via email to