Re: Avro file size is too big

Doug Cutting Thu, 05 Jul 2012 15:19:25 -0700

You can use the Avro command-line tool to dump the metadata, which
will show the schema and codec:


  java -jar avro-tools.jar getmeta <file>

Doug

On Thu, Jul 5, 2012 at 3:11 PM, Ruslan Al-Fakikh <[email protected]> wrote:
> Hey Doug,
>
> Here is a little more of explanation
> http://mail-archives.apache.org/mod_mbox/avro-user/201207.mbox/%3CCACBYqwQWPaj8NaGVTOir4dO%2BOqri-UM-8RQ-5Uu2r2bLCyuBTA%40mail.gmail.com%3E
> I'll answer your questions later after some investigation
>
> Thank you!
>
>
> On Thu, Jul 5, 2012 at 9:24 PM, Doug Cutting <[email protected]> wrote:
>> Rusian,
>>
>> This is unexpected.  Perhaps we can understand it if we have more 
>> information.
>>
>> What Writable class are you using for keys and values in the SequenceFile?
>>
>> What schema are you using in the Avro data file?
>>
>> Can you provide small sample files of each and/or code that will reproduce 
>> this?
>>
>> Thanks,
>>
>> Doug
>>
>> On Wed, Jul 4, 2012 at 6:32 AM, Ruslan Al-Fakikh <[email protected]> 
>> wrote:
>>> Hello,
>>>
>>> In my organization currently we are evaluating Avro as a format. Our
>>> concern is file size. I've done some comparisons of a piece of our
>>> data.
>>> Say we have sequence files, compressed. The payload (values) are just
>>> lines. As far as I know we use line number as keys and we use the
>>> default codec for compression inside sequence files. The size is 1.6G,
>>> when I put it to avro with deflate codec with deflate level 9 it
>>> becomes 2.2G.
>>> This is interesting, because the values in seq files are just string,
>>> but Avro has a normal schema with primitive types. And those are kept
>>> binary. Shouldn't Avro be less in size?
>>> Also I took another dataset which is 28G (gzip files, plain
>>> tab-delimited text, don't know what is the deflate level) and put it
>>> to Avro and it became 38G
>>> Why Avro is so big in size? Am I missing some size optimization?
>>>
>>> Thanks in advance!

Re: Avro file size is too big

Reply via email to