I changed to use the maximal compression level, i.e. 9, but the size is the same.
Ey-Chih Chow On Jul 19, 2012, at 7:07 PM, Harsh J wrote: > Snappy is known to have lower compression rates against Gzip, but > perhaps you can try larger blocks in the Avro DataFiles as indicated > in the thread, via a higher sync-interval? [1] What snappy is really > good at is a fast decompression rate though, so perhaps your reads are > going to be comparable with gzip plaintext? > > P.s. What do you get if you use deflate compression on the data files, > with maximal compression level (9)? [2] > > [1] - > http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/mapred/AvroOutputFormat.html#setSyncInterval(org.apache.hadoop.mapred.JobConf,%20int) > or > http://avro.apache.org/docs/1.7.1/api/java/index.html?org/apache/avro/mapred/AvroOutputFormat.html > > [2] - > http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/mapred/AvroOutputFormat.html#setDeflateLevel(org.apache.hadoop.mapred.JobConf,%20int) > or via > http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/file/CodecFactory.html#deflateCodec(int) > coupled with > http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/file/DataFileWriter.html#setCodec(org.apache.avro.file.CodecFactory) > > On Thu, Jul 19, 2012 at 5:29 AM, Ey-Chih chow <[email protected]> wrote: >> We are converting our compression scheme from gzip to snappy for our json >> logs. In one case, the size of a gzip file is 715MB and the corresponding >> snappy file is 1.885GB. The schema of the snappy file is "bytes". In other >> words, we compress line by line of our json logs and each line is a json >> string. Is there any way we can optimize our compression with snappy? >> >> Ey-Chih Chow >> >> >> On Jul 5, 2012, at 3:19 PM, Doug Cutting wrote: >> >>> You can use the Avro command-line tool to dump the metadata, which >>> will show the schema and codec: >>> >>> java -jar avro-tools.jar getmeta <file> >>> >>> Doug >>> >>> On Thu, Jul 5, 2012 at 3:11 PM, Ruslan Al-Fakikh <[email protected]> >>> wrote: >>>> Hey Doug, >>>> >>>> Here is a little more of explanation >>>> http://mail-archives.apache.org/mod_mbox/avro-user/201207.mbox/%3CCACBYqwQWPaj8NaGVTOir4dO%2BOqri-UM-8RQ-5Uu2r2bLCyuBTA%40mail.gmail.com%3E >>>> I'll answer your questions later after some investigation >>>> >>>> Thank you! >>>> >>>> >>>> On Thu, Jul 5, 2012 at 9:24 PM, Doug Cutting <[email protected]> wrote: >>>>> Rusian, >>>>> >>>>> This is unexpected. Perhaps we can understand it if we have more >>>>> information. >>>>> >>>>> What Writable class are you using for keys and values in the SequenceFile? >>>>> >>>>> What schema are you using in the Avro data file? >>>>> >>>>> Can you provide small sample files of each and/or code that will >>>>> reproduce this? >>>>> >>>>> Thanks, >>>>> >>>>> Doug >>>>> >>>>> On Wed, Jul 4, 2012 at 6:32 AM, Ruslan Al-Fakikh <[email protected]> >>>>> wrote: >>>>>> Hello, >>>>>> >>>>>> In my organization currently we are evaluating Avro as a format. Our >>>>>> concern is file size. I've done some comparisons of a piece of our >>>>>> data. >>>>>> Say we have sequence files, compressed. The payload (values) are just >>>>>> lines. As far as I know we use line number as keys and we use the >>>>>> default codec for compression inside sequence files. The size is 1.6G, >>>>>> when I put it to avro with deflate codec with deflate level 9 it >>>>>> becomes 2.2G. >>>>>> This is interesting, because the values in seq files are just string, >>>>>> but Avro has a normal schema with primitive types. And those are kept >>>>>> binary. Shouldn't Avro be less in size? >>>>>> Also I took another dataset which is 28G (gzip files, plain >>>>>> tab-delimited text, don't know what is the deflate level) and put it >>>>>> to Avro and it became 38G >>>>>> Why Avro is so big in size? Am I missing some size optimization? >>>>>> >>>>>> Thanks in advance! >> > > > > -- > Harsh J
