This thread looks useful. Are you flushing too often? http://apache-avro.679487.n3.nabble.com/avro-compression-using-snappy-and-deflate-td3870167.html
Russell Jurney http://datasyndrome.com On Jul 4, 2012, at 6:33 AM, Ruslan Al-Fakikh <[email protected]> wrote: > Hello, > > In my organization currently we are evaluating Avro as a format. Our > concern is file size. I've done some comparisons of a piece of our > data. > Say we have sequence files, compressed. The payload (values) are just > lines. As far as I know we use line number as keys and we use the > default codec for compression inside sequence files. The size is 1.6G, > when I put it to avro with deflate codec with deflate level 9 it > becomes 2.2G. > This is interesting, because the values in seq files are just string, > but Avro has a normal schema with primitive types. And those are kept > binary. Shouldn't Avro be less in size? > Also I took another dataset which is 28G (gzip files, plain > tab-delimited text, don't know what is the deflate level) and put it > to Avro and it became 38G > Why Avro is so big in size? Am I missing some size optimization? > > Thanks in advance!
