Not sure if it's helpful, but python and ruby cat_avro utilities are
linked here: 
http://hortonworks.com/blog/the-data-lifecycle-part-one-avroizing-the-enron-emails/

These do schema and sample. Should I modify these to get meta information?

Russell Jurney
twitter.com/rjurney
[email protected]
datasyndrome.com

On Jul 5, 2012, at 3:32 PM, Ruslan Al-Fakikh <[email protected]> wrote:

> Hey
>
> Sorry, couldn't use getmeta, It is in Avro 1.6, I have only 1.5 in my CDH 
> distro
> -bash-3.2$ java -jar avro-tools-1.5.4.jar getschema 000000_0.avro
> {
>  "type" : "record",
>  "name" : "TUPLE_0",
>  "fields" : [ {
>    "name" : "EventDateIgnore",
>    "type" : [ "null", "string" ],
>    "doc" : ""
>  }, {
>    "name" : "DatranClientIDIgnore",
>    "type" : [ "null", "int" ],
>    "doc" : ""
>  }, {
>    "name" : "CreativeID",
>    "type" : [ "null", "int" ],
>    "doc" : ""
>  }, {
>    "name" : "AgencyID",
>    "type" : [ "null", "int" ],
>    "doc" : ""
>  }, {
>    "name" : "PlacementID",
>    "type" : [ "null", "int" ],
>    "doc" : ""
>  }, {
>    "name" : "CookieID",
>    "type" : [ "null", "long" ],
>    "doc" : ""
>  }, {
>    "name" : "WebProfileID",
>    "type" : [ "null", "long" ],
>    "doc" : ""
>  }, {
>    "name" : "IPAddress",
>    "type" : [ "null", "string" ],
>    "doc" : ""
>  }, {
>    "name" : "ZipCode",
>    "type" : [ "null", "string" ],
>    "doc" : ""
>  }, {
>    "name" : "DMAID",
>    "type" : [ "null", "int" ],
>    "doc" : ""
>  }, {
>    "name" : "Impressions",
>    "type" : [ "null", "int" ],
>    "doc" : ""
>  }, {
>    "name" : "Clicks",
>    "type" : [ "null", "int" ],
>    "doc" : ""
>  }, {
>    "name" : "PostImpressions",
>    "type" : [ "null", "int" ],
>    "doc" : ""
>  }, {
>    "name" : "PostClicks",
>    "type" : [ "null", "int" ],
>    "doc" : ""
>  }, {
>    "name" : "ApertureDataID",
>    "type" : [ "null", "string" ],
>    "doc" : ""
>  }, {
>    "name" : "ApertureCategoryID",
>    "type" : [ "null", "string" ],
>    "doc" : ""
>  } ]
> }
>
> Also I can see that the file starts with
> Objavro.codecdeflateavro.schema�{"type":"record","name":"TUPLE_0","fields"
>
> Hope that helps.
>
> Thanks
>
> On Fri, Jul 6, 2012 at 2:19 AM, Doug Cutting <[email protected]> wrote:
>> You can use the Avro command-line tool to dump the metadata, which
>> will show the schema and codec:
>>
>>  java -jar avro-tools.jar getmeta <file>
>>
>> Doug
>>
>> On Thu, Jul 5, 2012 at 3:11 PM, Ruslan Al-Fakikh <[email protected]> 
>> wrote:
>>> Hey Doug,
>>>
>>> Here is a little more of explanation
>>> http://mail-archives.apache.org/mod_mbox/avro-user/201207.mbox/%3CCACBYqwQWPaj8NaGVTOir4dO%2BOqri-UM-8RQ-5Uu2r2bLCyuBTA%40mail.gmail.com%3E
>>> I'll answer your questions later after some investigation
>>>
>>> Thank you!
>>>
>>>
>>> On Thu, Jul 5, 2012 at 9:24 PM, Doug Cutting <[email protected]> wrote:
>>>> Rusian,
>>>>
>>>> This is unexpected.  Perhaps we can understand it if we have more 
>>>> information.
>>>>
>>>> What Writable class are you using for keys and values in the SequenceFile?
>>>>
>>>> What schema are you using in the Avro data file?
>>>>
>>>> Can you provide small sample files of each and/or code that will reproduce 
>>>> this?
>>>>
>>>> Thanks,
>>>>
>>>> Doug
>>>>
>>>> On Wed, Jul 4, 2012 at 6:32 AM, Ruslan Al-Fakikh <[email protected]> 
>>>> wrote:
>>>>> Hello,
>>>>>
>>>>> In my organization currently we are evaluating Avro as a format. Our
>>>>> concern is file size. I've done some comparisons of a piece of our
>>>>> data.
>>>>> Say we have sequence files, compressed. The payload (values) are just
>>>>> lines. As far as I know we use line number as keys and we use the
>>>>> default codec for compression inside sequence files. The size is 1.6G,
>>>>> when I put it to avro with deflate codec with deflate level 9 it
>>>>> becomes 2.2G.
>>>>> This is interesting, because the values in seq files are just string,
>>>>> but Avro has a normal schema with primitive types. And those are kept
>>>>> binary. Shouldn't Avro be less in size?
>>>>> Also I took another dataset which is 28G (gzip files, plain
>>>>> tab-delimited text, don't know what is the deflate level) and put it
>>>>> to Avro and it became 38G
>>>>> Why Avro is so big in size? Am I missing some size optimization?
>>>>>
>>>>> Thanks in advance!

Reply via email to