Hello, In my organization currently we are evaluating Avro as a format. Our concern is file size. I've done some comparisons of a piece of our data. Say we have sequence files, compressed. The payload (values) are just lines. As far as I know we use line number as keys and we use the default codec for compression inside sequence files. The size is 1.6G, when I put it to avro with deflate codec with deflate level 9 it becomes 2.2G. This is interesting, because the values in seq files are just string, but Avro has a normal schema with primitive types. And those are kept binary. Shouldn't Avro be less in size? Also I took another dataset which is 28G (gzip files, plain tab-delimited text, don't know what is the deflate level) and put it to Avro and it became 38G Why Avro is so big in size? Am I missing some size optimization?
Thanks in advance!