I think I've figured out how to make this work.
Initially I have a file "data.avro". I gzip it as "data.avro.gz" and try
to feed it to Hadoop. This does not work.
Instead Avro supports "deflate" codec natively. So I transcode it into
"data_deflate.avro" and feed it to hadoop and it works correctly. The
file size is slight larger than if I gzip it as a whole.
I was using avro-tools to do the transcoding. It's command line handling
is irregular. It takes me many trial and error to get it to work. The
command that works for me is
java -jar avro-tools-1.7.6.jar recodec --codec=deflate input.avro
output.avro
Wai Yip
[email protected] <mailto:[email protected]>
Wednesday, July 23, 2014 5:07 PM
I have successfully stream Avro data file to Python mrjobs using the
library AvroAsTextInputFormat
-inputformat org.apache.avro.mapred.AvroAsTextInputFormat
However, unlike text file, it does not seems to handle gzipped file
automatically. What can I do to stream a gzipped Avro file?
Wai Yip