Not sure if it's helpful, but python and ruby cat_avro utilities are linked here: http://hortonworks.com/blog/the-data-lifecycle-part-one-avroizing-the-enron-emails/
These do schema and sample. Should I modify these to get meta information? Russell Jurney twitter.com/rjurney [email protected] datasyndrome.com On Jul 5, 2012, at 3:32 PM, Ruslan Al-Fakikh <[email protected]> wrote: > Hey > > Sorry, couldn't use getmeta, It is in Avro 1.6, I have only 1.5 in my CDH > distro > -bash-3.2$ java -jar avro-tools-1.5.4.jar getschema 000000_0.avro > { > "type" : "record", > "name" : "TUPLE_0", > "fields" : [ { > "name" : "EventDateIgnore", > "type" : [ "null", "string" ], > "doc" : "" > }, { > "name" : "DatranClientIDIgnore", > "type" : [ "null", "int" ], > "doc" : "" > }, { > "name" : "CreativeID", > "type" : [ "null", "int" ], > "doc" : "" > }, { > "name" : "AgencyID", > "type" : [ "null", "int" ], > "doc" : "" > }, { > "name" : "PlacementID", > "type" : [ "null", "int" ], > "doc" : "" > }, { > "name" : "CookieID", > "type" : [ "null", "long" ], > "doc" : "" > }, { > "name" : "WebProfileID", > "type" : [ "null", "long" ], > "doc" : "" > }, { > "name" : "IPAddress", > "type" : [ "null", "string" ], > "doc" : "" > }, { > "name" : "ZipCode", > "type" : [ "null", "string" ], > "doc" : "" > }, { > "name" : "DMAID", > "type" : [ "null", "int" ], > "doc" : "" > }, { > "name" : "Impressions", > "type" : [ "null", "int" ], > "doc" : "" > }, { > "name" : "Clicks", > "type" : [ "null", "int" ], > "doc" : "" > }, { > "name" : "PostImpressions", > "type" : [ "null", "int" ], > "doc" : "" > }, { > "name" : "PostClicks", > "type" : [ "null", "int" ], > "doc" : "" > }, { > "name" : "ApertureDataID", > "type" : [ "null", "string" ], > "doc" : "" > }, { > "name" : "ApertureCategoryID", > "type" : [ "null", "string" ], > "doc" : "" > } ] > } > > Also I can see that the file starts with > Objavro.codecdeflateavro.schema�{"type":"record","name":"TUPLE_0","fields" > > Hope that helps. > > Thanks > > On Fri, Jul 6, 2012 at 2:19 AM, Doug Cutting <[email protected]> wrote: >> You can use the Avro command-line tool to dump the metadata, which >> will show the schema and codec: >> >> java -jar avro-tools.jar getmeta <file> >> >> Doug >> >> On Thu, Jul 5, 2012 at 3:11 PM, Ruslan Al-Fakikh <[email protected]> >> wrote: >>> Hey Doug, >>> >>> Here is a little more of explanation >>> http://mail-archives.apache.org/mod_mbox/avro-user/201207.mbox/%3CCACBYqwQWPaj8NaGVTOir4dO%2BOqri-UM-8RQ-5Uu2r2bLCyuBTA%40mail.gmail.com%3E >>> I'll answer your questions later after some investigation >>> >>> Thank you! >>> >>> >>> On Thu, Jul 5, 2012 at 9:24 PM, Doug Cutting <[email protected]> wrote: >>>> Rusian, >>>> >>>> This is unexpected. Perhaps we can understand it if we have more >>>> information. >>>> >>>> What Writable class are you using for keys and values in the SequenceFile? >>>> >>>> What schema are you using in the Avro data file? >>>> >>>> Can you provide small sample files of each and/or code that will reproduce >>>> this? >>>> >>>> Thanks, >>>> >>>> Doug >>>> >>>> On Wed, Jul 4, 2012 at 6:32 AM, Ruslan Al-Fakikh <[email protected]> >>>> wrote: >>>>> Hello, >>>>> >>>>> In my organization currently we are evaluating Avro as a format. Our >>>>> concern is file size. I've done some comparisons of a piece of our >>>>> data. >>>>> Say we have sequence files, compressed. The payload (values) are just >>>>> lines. As far as I know we use line number as keys and we use the >>>>> default codec for compression inside sequence files. The size is 1.6G, >>>>> when I put it to avro with deflate codec with deflate level 9 it >>>>> becomes 2.2G. >>>>> This is interesting, because the values in seq files are just string, >>>>> but Avro has a normal schema with primitive types. And those are kept >>>>> binary. Shouldn't Avro be less in size? >>>>> Also I took another dataset which is 28G (gzip files, plain >>>>> tab-delimited text, don't know what is the deflate level) and put it >>>>> to Avro and it became 38G >>>>> Why Avro is so big in size? Am I missing some size optimization? >>>>> >>>>> Thanks in advance!
