Re: Mixed Avro/Hadoop Writable pipeline

Martin Kleppmann Thu, 04 Jul 2013 07:20:50 -0700

Hadoop's writables use Java's java.io.Data{Input,Output}Stream by default
(see org.apache.hadoop.io.serializer.WritableSerialization). This uses a
fixed-length encoding: 4 bytes for an int, 8 bytes for a long.
http://docs.oracle.com/javase/6/docs/api/java/io/DataOutputStream.html#writeInt(int)


Avro-encoded numbers are always variable-length (if you want fixed-length,
use a 'fixed' type in the schema).

Martin


On 4 July 2013 11:14, Dan Filimon <[email protected]> wrote:

> The documentation for IntWritable doesn't explicitly mention it being
> fixed-length or not [1]. But, given there's also a VIntWritable [2], I
> think IntWritable is always 4 bytes.
>
> [1]
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/IntWritable.html
> [2]
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/VIntWritable.html
>
>
> On Thu, Jul 4, 2013 at 1:02 PM, Pradeep Gollakota <[email protected]>wrote:
>
>> Not sure about Avro<Integer> is 4 bytes or not. But IntWritable is
>> variable length. If the number can be represented in less than 4 bytes, it
>> will.
>> On Jul 4, 2013 2:22 AM, "Dan Filimon" <[email protected]>
>> wrote:
>>
>>> Well, I got it working eventually. :)
>>>
>>> First of all, I'll mention that I'm using the new MapReduce API, so no
>>> AvroMapper/AvroReducer voodoo for me. I'm just using the AvroKey<> and
>>> AvroValue<> wrappers and once I set the right properties using AvroJob's
>>> static methods (AvroJob.setMapOutputValueSchema() for example) and set the
>>> input to be an AvroKeyInputFormat, everything worked out fine.
>>>
>>> About the writables, I'm interested to know whether it'd be better to
>>> use Avro equivalent classes: AvroKey<Integer> or IntWritable. I assume the
>>> speed/size of these two should be the same 4 bytes?
>>>
>>>
>>> On Thu, Jul 4, 2013 at 2:48 AM, Martin Kleppmann 
>>> <[email protected]>wrote:
>>>
>>>> Hi Dan,
>>>>
>>>> You're stepping off the documented path here, but I think that although
>>>> it might be a bit of work, it should be possible.
>>>>
>>>> Things to watch out for: you might not be able to use
>>>> AvroMapper/AvroReducer so easily, and you may have to mess around with the
>>>> job conf a bit (Avro-configured jobs use their own shuffle config with
>>>> AvroKeyComparator, which may not be what you want if you're also trying to
>>>> use writables). I'd suggest simply reading the code in
>>>> org.apache.avro.mapred[uce] -- it's not too complicated.
>>>>
>>>> Whether Avro files or writables (i.e. Hadoop sequence files) are better
>>>> for you depends mostly on which format you'd rather have your data in. If
>>>> you want to read the data files with something other than Hadoop, Avro is
>>>> definitely a good option. Also, Avro data files are self-describing (due to
>>>> their embedded schema) which makes them pleasant to use with tools like Pig
>>>> and Hive.
>>>>
>>>> Martin
>>>>
>>>>
>>>> On 3 July 2013 10:12, Dan Filimon <[email protected]> wrote:
>>>>
>>>>> Hi!
>>>>>
>>>>> I'm working on integrating Avro into our data processing pipeline.
>>>>>  We're using quite a few standard Hadoop and Mahout writables
>>>>> (IntWritable, VectorWritable).
>>>>>
>>>>> I'm first going to replace the custom Writables with Avro, but in
>>>>> terms of the other ones, how important would you say it is to use
>>>>> AvroKey<Integer> instead of IntWritable for example?
>>>>>
>>>>> The changes will happen gradually but are they even worth it?
>>>>>
>>>>> Thanks!
>>>>>
>>>>
>>>>
>>>
>

Re: Mixed Avro/Hadoop Writable pipeline

Reply via email to