Hey Doug, Here is a little more of explanation http://mail-archives.apache.org/mod_mbox/avro-user/201207.mbox/%3CCACBYqwQWPaj8NaGVTOir4dO%2BOqri-UM-8RQ-5Uu2r2bLCyuBTA%40mail.gmail.com%3E I'll answer your questions later after some investigation
Thank you! On Thu, Jul 5, 2012 at 9:24 PM, Doug Cutting <cutt...@apache.org> wrote: > Rusian, > > This is unexpected. Perhaps we can understand it if we have more information. > > What Writable class are you using for keys and values in the SequenceFile? > > What schema are you using in the Avro data file? > > Can you provide small sample files of each and/or code that will reproduce > this? > > Thanks, > > Doug > > On Wed, Jul 4, 2012 at 6:32 AM, Ruslan Al-Fakikh <metarus...@gmail.com> wrote: >> Hello, >> >> In my organization currently we are evaluating Avro as a format. Our >> concern is file size. I've done some comparisons of a piece of our >> data. >> Say we have sequence files, compressed. The payload (values) are just >> lines. As far as I know we use line number as keys and we use the >> default codec for compression inside sequence files. The size is 1.6G, >> when I put it to avro with deflate codec with deflate level 9 it >> becomes 2.2G. >> This is interesting, because the values in seq files are just string, >> but Avro has a normal schema with primitive types. And those are kept >> binary. Shouldn't Avro be less in size? >> Also I took another dataset which is 28G (gzip files, plain >> tab-delimited text, don't know what is the deflate level) and put it >> to Avro and it became 38G >> Why Avro is so big in size? Am I missing some size optimization? >> >> Thanks in advance!