Can you share the schema? How big is it? The schema itself is not compressed, so given your small data size it might be dominating.
On Wed, Jul 9, 2014 at 1:20 AM, Sachin Goyal <[email protected]> wrote: > Hi, > > I have been trying to use Avro compression codecs to reduce the size of > avro-output. > The Java object being serialized is pretty huge and here are the results > of applying different codecs. > > > Serialization : Kilo-Bytes > ------------- : ----------- > Avro (No Codec) : 57.3 > Avro (Snappy) : 52.0 > Avro (Bzip2) : 51.6 > Avro (Deflate) : 51.1 > Avro (xzCodec) : 51.0 > Direct JSON : 23.6 (Just for comparison since we use JSON too > heavily. This was done using Jackson) > > > > > The Java code I used to try codecs is as follows: > --------------------------------------------------------------------------- > ------------ > ReflectDatumWriter datumWriter = new ReflectDatumWriter > (productObj.getClass(), rdata); > DataFileWriter fileWriter = new DataFileWriter (datumWriter); > > > // Try each one of these codecs one at a time > fileWriter.setCodec(CodecFactory.snappyCodec()); > fileWriter.setCodec(CodecFactory.bzip2Codec()); > fileWriter.setCodec(CodecFactory.deflateCodec(9)); > fileWriter.setCodec(CodecFactory.xzCodec(5)); // using 9 here caused > out-of-memory > > // Now check output size > ByteArrayOutputStream baos = new ByteArrayOutputStream(); > > fileWriter.create(schema, baos); > fileWriter.append(productObj); > fileWriter.close(); > System.out.println ("Avro bytes = " + baos.toByteArray().length); > --------------------------------------------------------------------------- > ------------ > > > > And then, on the command line, I applied the normal zip command as: > $ zip output.zip output.avr; > $ ls -l output.* > This gives me the following output: > > 57339 output.avr > 9081 output.zip (20% the original size!) > > > > > So my questions are: > --------------------- > 1) Why I am not seeing a huge benefit in size when applying the codec? Am > I using the API correctly? > 2) I understand that the compression achieved by normal zip command would > be better than applying codecs in Avro, but is such a huge difference > expected? > > > One thing I expected and did notice is that Avro truly shines when the > number of objects to be appended are more than 10. > This is so because the schema is written only once and all the actual > objects are appended as binary. > So that was expected, but compression codecs output looked a bit > questionable. > > Please suggest if I am doing something wrong. > > Thanks > Sachin > > > > > -- Sean
