Re: Mixed Avro/Hadoop Writable pipeline

Martin Kleppmann Wed, 03 Jul 2013 16:50:09 -0700

Hi Dan,

You're stepping off the documented path here, but I think that although it
might be a bit of work, it should be possible.

Things to watch out for: you might not be able to use
AvroMapper/AvroReducer so easily, and you may have to mess around with the
job conf a bit (Avro-configured jobs use their own shuffle config with
AvroKeyComparator, which may not be what you want if you're also trying to
use writables). I'd suggest simply reading the code in
org.apache.avro.mapred[uce] -- it's not too complicated.

Whether Avro files or writables (i.e. Hadoop sequence files) are better for
you depends mostly on which format you'd rather have your data in. If you
want to read the data files with something other than Hadoop, Avro is
definitely a good option. Also, Avro data files are self-describing (due to
their embedded schema) which makes them pleasant to use with tools like Pig
and Hive.

Martin

On 3 July 2013 10:12, Dan Filimon <[email protected]> wrote:

> Hi!
>
> I'm working on integrating Avro into our data processing pipeline.
> We're using quite a few standard Hadoop and Mahout writables (IntWritable,
> VectorWritable).
>
> I'm first going to replace the custom Writables with Avro, but in terms of
> the other ones, how important would you say it is to use AvroKey<Integer>
> instead of IntWritable for example?
>
> The changes will happen gradually but are they even worth it?
>
> Thanks!
>

Re: Mixed Avro/Hadoop Writable pipeline

Reply via email to