We use the avro tool, fromtext, to compress a json log file. I didn't find an option that can set the sync interval.
Ey-Chih Chow On Jul 20, 2012, at 10:02 AM, Ey-Chih chow wrote: > I changed to use the maximal compression level, i.e. 9, but the size is the > same. > > Ey-Chih Chow > > On Jul 19, 2012, at 7:07 PM, Harsh J wrote: > >> Snappy is known to have lower compression rates against Gzip, but >> perhaps you can try larger blocks in the Avro DataFiles as indicated >> in the thread, via a higher sync-interval? [1] What snappy is really >> good at is a fast decompression rate though, so perhaps your reads are >> going to be comparable with gzip plaintext? >> >> P.s. What do you get if you use deflate compression on the data files, >> with maximal compression level (9)? [2] >> >> [1] - >> http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/mapred/AvroOutputFormat.html#setSyncInterval(org.apache.hadoop.mapred.JobConf,%20int) >> or >> http://avro.apache.org/docs/1.7.1/api/java/index.html?org/apache/avro/mapred/AvroOutputFormat.html >> >> [2] - >> http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/mapred/AvroOutputFormat.html#setDeflateLevel(org.apache.hadoop.mapred.JobConf,%20int) >> or via >> http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/file/CodecFactory.html#deflateCodec(int) >> coupled with >> http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/file/DataFileWriter.html#setCodec(org.apache.avro.file.CodecFactory) >> >> On Thu, Jul 19, 2012 at 5:29 AM, Ey-Chih chow <[email protected]> wrote: >>> We are converting our compression scheme from gzip to snappy for our json >>> logs. In one case, the size of a gzip file is 715MB and the corresponding >>> snappy file is 1.885GB. The schema of the snappy file is "bytes". In >>> other words, we compress line by line of our json logs and each line is a >>> json string. Is there any way we can optimize our compression with snappy? >>> >>> Ey-Chih Chow >>> >>> >>> On Jul 5, 2012, at 3:19 PM, Doug Cutting wrote: >>> >>>> You can use the Avro command-line tool to dump the metadata, which >>>> will show the schema and codec: >>>> >>>> java -jar avro-tools.jar getmeta <file> >>>> >>>> Doug >>>> >>>> On Thu, Jul 5, 2012 at 3:11 PM, Ruslan Al-Fakikh <[email protected]> >>>> wrote: >>>>> Hey Doug, >>>>> >>>>> Here is a little more of explanation >>>>> http://mail-archives.apache.org/mod_mbox/avro-user/201207.mbox/%3CCACBYqwQWPaj8NaGVTOir4dO%2BOqri-UM-8RQ-5Uu2r2bLCyuBTA%40mail.gmail.com%3E >>>>> I'll answer your questions later after some investigation >>>>> >>>>> Thank you! >>>>> >>>>> >>>>> On Thu, Jul 5, 2012 at 9:24 PM, Doug Cutting <[email protected]> wrote: >>>>>> Rusian, >>>>>> >>>>>> This is unexpected. Perhaps we can understand it if we have more >>>>>> information. >>>>>> >>>>>> What Writable class are you using for keys and values in the >>>>>> SequenceFile? >>>>>> >>>>>> What schema are you using in the Avro data file? >>>>>> >>>>>> Can you provide small sample files of each and/or code that will >>>>>> reproduce this? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Doug >>>>>> >>>>>> On Wed, Jul 4, 2012 at 6:32 AM, Ruslan Al-Fakikh <[email protected]> >>>>>> wrote: >>>>>>> Hello, >>>>>>> >>>>>>> In my organization currently we are evaluating Avro as a format. Our >>>>>>> concern is file size. I've done some comparisons of a piece of our >>>>>>> data. >>>>>>> Say we have sequence files, compressed. The payload (values) are just >>>>>>> lines. As far as I know we use line number as keys and we use the >>>>>>> default codec for compression inside sequence files. The size is 1.6G, >>>>>>> when I put it to avro with deflate codec with deflate level 9 it >>>>>>> becomes 2.2G. >>>>>>> This is interesting, because the values in seq files are just string, >>>>>>> but Avro has a normal schema with primitive types. And those are kept >>>>>>> binary. Shouldn't Avro be less in size? >>>>>>> Also I took another dataset which is 28G (gzip files, plain >>>>>>> tab-delimited text, don't know what is the deflate level) and put it >>>>>>> to Avro and it became 38G >>>>>>> Why Avro is so big in size? Am I missing some size optimization? >>>>>>> >>>>>>> Thanks in advance! >>> >> >> >> >> -- >> Harsh J >
