Re: Mixed Avro/Hadoop Writable pipeline

Dan Filimon Thu, 04 Jul 2013 03:16:57 -0700

The documentation for IntWritable doesn't explicitly mention it being
fixed-length or not [1]. But, given there's also a VIntWritable [2], I
think IntWritable is always 4 bytes.


[1]
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/IntWritable.html
[2]
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/VIntWritable.html


On Thu, Jul 4, 2013 at 1:02 PM, Pradeep Gollakota <[email protected]>wrote:

> Not sure about Avro<Integer> is 4 bytes or not. But IntWritable is
> variable length. If the number can be represented in less than 4 bytes, it
> will.
> On Jul 4, 2013 2:22 AM, "Dan Filimon" <[email protected]> wrote:
>
>> Well, I got it working eventually. :)
>>
>> First of all, I'll mention that I'm using the new MapReduce API, so no
>> AvroMapper/AvroReducer voodoo for me. I'm just using the AvroKey<> and
>> AvroValue<> wrappers and once I set the right properties using AvroJob's
>> static methods (AvroJob.setMapOutputValueSchema() for example) and set the
>> input to be an AvroKeyInputFormat, everything worked out fine.
>>
>> About the writables, I'm interested to know whether it'd be better to use
>> Avro equivalent classes: AvroKey<Integer> or IntWritable. I assume the
>> speed/size of these two should be the same 4 bytes?
>>
>>
>> On Thu, Jul 4, 2013 at 2:48 AM, Martin Kleppmann 
>> <[email protected]>wrote:
>>
>>> Hi Dan,
>>>
>>> You're stepping off the documented path here, but I think that although
>>> it might be a bit of work, it should be possible.
>>>
>>> Things to watch out for: you might not be able to use
>>> AvroMapper/AvroReducer so easily, and you may have to mess around with the
>>> job conf a bit (Avro-configured jobs use their own shuffle config with
>>> AvroKeyComparator, which may not be what you want if you're also trying to
>>> use writables). I'd suggest simply reading the code in
>>> org.apache.avro.mapred[uce] -- it's not too complicated.
>>>
>>> Whether Avro files or writables (i.e. Hadoop sequence files) are better
>>> for you depends mostly on which format you'd rather have your data in. If
>>> you want to read the data files with something other than Hadoop, Avro is
>>> definitely a good option. Also, Avro data files are self-describing (due to
>>> their embedded schema) which makes them pleasant to use with tools like Pig
>>> and Hive.
>>>
>>> Martin
>>>
>>>
>>> On 3 July 2013 10:12, Dan Filimon <[email protected]> wrote:
>>>
>>>> Hi!
>>>>
>>>> I'm working on integrating Avro into our data processing pipeline.
>>>>  We're using quite a few standard Hadoop and Mahout writables
>>>> (IntWritable, VectorWritable).
>>>>
>>>> I'm first going to replace the custom Writables with Avro, but in terms
>>>> of the other ones, how important would you say it is to use
>>>> AvroKey<Integer> instead of IntWritable for example?
>>>>
>>>> The changes will happen gradually but are they even worth it?
>>>>
>>>> Thanks!
>>>>
>>>
>>>
>>

Re: Mixed Avro/Hadoop Writable pipeline

Reply via email to