Re: Avro file size is too big

Ey-Chih chow Fri, 20 Jul 2012 10:03:03 -0700

I changed to use the maximal compression level, i.e. 9, but the size is the 
same.


Ey-Chih Chow

On Jul 19, 2012, at 7:07 PM, Harsh J wrote:

> Snappy is known to have lower compression rates against Gzip, but
> perhaps you can try larger blocks in the Avro DataFiles as indicated
> in the thread, via a higher sync-interval? [1] What snappy is really
> good at is a fast decompression rate though, so perhaps your reads are
> going to be comparable with gzip plaintext?
> 
> P.s. What do you get if you use deflate compression on the data files,
> with maximal compression level (9)? [2]
> 
> [1] - 
> http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/mapred/AvroOutputFormat.html#setSyncInterval(org.apache.hadoop.mapred.JobConf,%20int)
> or 
> http://avro.apache.org/docs/1.7.1/api/java/index.html?org/apache/avro/mapred/AvroOutputFormat.html
> 
> [2] - 
> http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/mapred/AvroOutputFormat.html#setDeflateLevel(org.apache.hadoop.mapred.JobConf,%20int)
> or via 
> http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/file/CodecFactory.html#deflateCodec(int)
> coupled with 
> http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/file/DataFileWriter.html#setCodec(org.apache.avro.file.CodecFactory)
> 
> On Thu, Jul 19, 2012 at 5:29 AM, Ey-Chih chow <[email protected]> wrote:
>> We are converting our compression scheme from gzip to snappy for our json 
>> logs.  In one case, the size of a gzip file is 715MB and the corresponding 
>> snappy file is 1.885GB.  The schema of the snappy file is "bytes".  In other 
>> words, we compress line by line of our json logs and each line is a json 
>> string.  Is there any way we can optimize our compression with snappy?
>> 
>> Ey-Chih Chow
>> 
>> 
>> On Jul 5, 2012, at 3:19 PM, Doug Cutting wrote:
>> 
>>> You can use the Avro command-line tool to dump the metadata, which
>>> will show the schema and codec:
>>> 
>>> java -jar avro-tools.jar getmeta <file>
>>> 
>>> Doug
>>> 
>>> On Thu, Jul 5, 2012 at 3:11 PM, Ruslan Al-Fakikh <[email protected]> 
>>> wrote:
>>>> Hey Doug,
>>>> 
>>>> Here is a little more of explanation
>>>> http://mail-archives.apache.org/mod_mbox/avro-user/201207.mbox/%3CCACBYqwQWPaj8NaGVTOir4dO%2BOqri-UM-8RQ-5Uu2r2bLCyuBTA%40mail.gmail.com%3E
>>>> I'll answer your questions later after some investigation
>>>> 
>>>> Thank you!
>>>> 
>>>> 
>>>> On Thu, Jul 5, 2012 at 9:24 PM, Doug Cutting <[email protected]> wrote:
>>>>> Rusian,
>>>>> 
>>>>> This is unexpected.  Perhaps we can understand it if we have more 
>>>>> information.
>>>>> 
>>>>> What Writable class are you using for keys and values in the SequenceFile?
>>>>> 
>>>>> What schema are you using in the Avro data file?
>>>>> 
>>>>> Can you provide small sample files of each and/or code that will 
>>>>> reproduce this?
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Doug
>>>>> 
>>>>> On Wed, Jul 4, 2012 at 6:32 AM, Ruslan Al-Fakikh <[email protected]> 
>>>>> wrote:
>>>>>> Hello,
>>>>>> 
>>>>>> In my organization currently we are evaluating Avro as a format. Our
>>>>>> concern is file size. I've done some comparisons of a piece of our
>>>>>> data.
>>>>>> Say we have sequence files, compressed. The payload (values) are just
>>>>>> lines. As far as I know we use line number as keys and we use the
>>>>>> default codec for compression inside sequence files. The size is 1.6G,
>>>>>> when I put it to avro with deflate codec with deflate level 9 it
>>>>>> becomes 2.2G.
>>>>>> This is interesting, because the values in seq files are just string,
>>>>>> but Avro has a normal schema with primitive types. And those are kept
>>>>>> binary. Shouldn't Avro be less in size?
>>>>>> Also I took another dataset which is 28G (gzip files, plain
>>>>>> tab-delimited text, don't know what is the deflate level) and put it
>>>>>> to Avro and it became 38G
>>>>>> Why Avro is so big in size? Am I missing some size optimization?
>>>>>> 
>>>>>> Thanks in advance!
>> 
> 
> 
> 
> -- 
> Harsh J

Re: Avro file size is too big

Reply via email to