Hadoop's writables use Java's java.io.Data{Input,Output}Stream by default
(see org.apache.hadoop.io.serializer.WritableSerialization). This uses a
fixed-length encoding: 4 bytes for an int, 8 bytes for a long.
http://docs.oracle.com/javase/6/docs/api/java/io/DataOutputStream.html#writeInt(int)Avro-encoded numbers are always variable-length (if you want fixed-length, use a 'fixed' type in the schema). Martin On 4 July 2013 11:14, Dan Filimon <[email protected]> wrote: > The documentation for IntWritable doesn't explicitly mention it being > fixed-length or not [1]. But, given there's also a VIntWritable [2], I > think IntWritable is always 4 bytes. > > [1] > http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/IntWritable.html > [2] > http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/VIntWritable.html > > > On Thu, Jul 4, 2013 at 1:02 PM, Pradeep Gollakota <[email protected]>wrote: > >> Not sure about Avro<Integer> is 4 bytes or not. But IntWritable is >> variable length. If the number can be represented in less than 4 bytes, it >> will. >> On Jul 4, 2013 2:22 AM, "Dan Filimon" <[email protected]> >> wrote: >> >>> Well, I got it working eventually. :) >>> >>> First of all, I'll mention that I'm using the new MapReduce API, so no >>> AvroMapper/AvroReducer voodoo for me. I'm just using the AvroKey<> and >>> AvroValue<> wrappers and once I set the right properties using AvroJob's >>> static methods (AvroJob.setMapOutputValueSchema() for example) and set the >>> input to be an AvroKeyInputFormat, everything worked out fine. >>> >>> About the writables, I'm interested to know whether it'd be better to >>> use Avro equivalent classes: AvroKey<Integer> or IntWritable. I assume the >>> speed/size of these two should be the same 4 bytes? >>> >>> >>> On Thu, Jul 4, 2013 at 2:48 AM, Martin Kleppmann >>> <[email protected]>wrote: >>> >>>> Hi Dan, >>>> >>>> You're stepping off the documented path here, but I think that although >>>> it might be a bit of work, it should be possible. >>>> >>>> Things to watch out for: you might not be able to use >>>> AvroMapper/AvroReducer so easily, and you may have to mess around with the >>>> job conf a bit (Avro-configured jobs use their own shuffle config with >>>> AvroKeyComparator, which may not be what you want if you're also trying to >>>> use writables). I'd suggest simply reading the code in >>>> org.apache.avro.mapred[uce] -- it's not too complicated. >>>> >>>> Whether Avro files or writables (i.e. Hadoop sequence files) are better >>>> for you depends mostly on which format you'd rather have your data in. If >>>> you want to read the data files with something other than Hadoop, Avro is >>>> definitely a good option. Also, Avro data files are self-describing (due to >>>> their embedded schema) which makes them pleasant to use with tools like Pig >>>> and Hive. >>>> >>>> Martin >>>> >>>> >>>> On 3 July 2013 10:12, Dan Filimon <[email protected]> wrote: >>>> >>>>> Hi! >>>>> >>>>> I'm working on integrating Avro into our data processing pipeline. >>>>> We're using quite a few standard Hadoop and Mahout writables >>>>> (IntWritable, VectorWritable). >>>>> >>>>> I'm first going to replace the custom Writables with Avro, but in >>>>> terms of the other ones, how important would you say it is to use >>>>> AvroKey<Integer> instead of IntWritable for example? >>>>> >>>>> The changes will happen gradually but are they even worth it? >>>>> >>>>> Thanks! >>>>> >>>> >>>> >>> >
