We use the avro tool, fromtext, to compress a json log file.  I didn't find an 
option that can set the sync interval.

Ey-Chih Chow

On Jul 20, 2012, at 10:02 AM, Ey-Chih chow wrote:

> I changed to use the maximal compression level, i.e. 9, but the size is the 
> same.
> 
> Ey-Chih Chow
> 
> On Jul 19, 2012, at 7:07 PM, Harsh J wrote:
> 
>> Snappy is known to have lower compression rates against Gzip, but
>> perhaps you can try larger blocks in the Avro DataFiles as indicated
>> in the thread, via a higher sync-interval? [1] What snappy is really
>> good at is a fast decompression rate though, so perhaps your reads are
>> going to be comparable with gzip plaintext?
>> 
>> P.s. What do you get if you use deflate compression on the data files,
>> with maximal compression level (9)? [2]
>> 
>> [1] - 
>> http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/mapred/AvroOutputFormat.html#setSyncInterval(org.apache.hadoop.mapred.JobConf,%20int)
>> or 
>> http://avro.apache.org/docs/1.7.1/api/java/index.html?org/apache/avro/mapred/AvroOutputFormat.html
>> 
>> [2] - 
>> http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/mapred/AvroOutputFormat.html#setDeflateLevel(org.apache.hadoop.mapred.JobConf,%20int)
>> or via 
>> http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/file/CodecFactory.html#deflateCodec(int)
>> coupled with 
>> http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/file/DataFileWriter.html#setCodec(org.apache.avro.file.CodecFactory)
>> 
>> On Thu, Jul 19, 2012 at 5:29 AM, Ey-Chih chow <[email protected]> wrote:
>>> We are converting our compression scheme from gzip to snappy for our json 
>>> logs.  In one case, the size of a gzip file is 715MB and the corresponding 
>>> snappy file is 1.885GB.  The schema of the snappy file is "bytes".  In 
>>> other words, we compress line by line of our json logs and each line is a 
>>> json string.  Is there any way we can optimize our compression with snappy?
>>> 
>>> Ey-Chih Chow
>>> 
>>> 
>>> On Jul 5, 2012, at 3:19 PM, Doug Cutting wrote:
>>> 
>>>> You can use the Avro command-line tool to dump the metadata, which
>>>> will show the schema and codec:
>>>> 
>>>> java -jar avro-tools.jar getmeta <file>
>>>> 
>>>> Doug
>>>> 
>>>> On Thu, Jul 5, 2012 at 3:11 PM, Ruslan Al-Fakikh <[email protected]> 
>>>> wrote:
>>>>> Hey Doug,
>>>>> 
>>>>> Here is a little more of explanation
>>>>> http://mail-archives.apache.org/mod_mbox/avro-user/201207.mbox/%3CCACBYqwQWPaj8NaGVTOir4dO%2BOqri-UM-8RQ-5Uu2r2bLCyuBTA%40mail.gmail.com%3E
>>>>> I'll answer your questions later after some investigation
>>>>> 
>>>>> Thank you!
>>>>> 
>>>>> 
>>>>> On Thu, Jul 5, 2012 at 9:24 PM, Doug Cutting <[email protected]> wrote:
>>>>>> Rusian,
>>>>>> 
>>>>>> This is unexpected.  Perhaps we can understand it if we have more 
>>>>>> information.
>>>>>> 
>>>>>> What Writable class are you using for keys and values in the 
>>>>>> SequenceFile?
>>>>>> 
>>>>>> What schema are you using in the Avro data file?
>>>>>> 
>>>>>> Can you provide small sample files of each and/or code that will 
>>>>>> reproduce this?
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Doug
>>>>>> 
>>>>>> On Wed, Jul 4, 2012 at 6:32 AM, Ruslan Al-Fakikh <[email protected]> 
>>>>>> wrote:
>>>>>>> Hello,
>>>>>>> 
>>>>>>> In my organization currently we are evaluating Avro as a format. Our
>>>>>>> concern is file size. I've done some comparisons of a piece of our
>>>>>>> data.
>>>>>>> Say we have sequence files, compressed. The payload (values) are just
>>>>>>> lines. As far as I know we use line number as keys and we use the
>>>>>>> default codec for compression inside sequence files. The size is 1.6G,
>>>>>>> when I put it to avro with deflate codec with deflate level 9 it
>>>>>>> becomes 2.2G.
>>>>>>> This is interesting, because the values in seq files are just string,
>>>>>>> but Avro has a normal schema with primitive types. And those are kept
>>>>>>> binary. Shouldn't Avro be less in size?
>>>>>>> Also I took another dataset which is 28G (gzip files, plain
>>>>>>> tab-delimited text, don't know what is the deflate level) and put it
>>>>>>> to Avro and it became 38G
>>>>>>> Why Avro is so big in size? Am I missing some size optimization?
>>>>>>> 
>>>>>>> Thanks in advance!
>>> 
>> 
>> 
>> 
>> -- 
>> Harsh J
> 

Reply via email to