Anyone have any experience with elephantbird? Seems like it can handle these cases with ease?
Sent from my iPhone On Mar 30, 2012, at 12:59 AM, Bill Graham <[email protected]> wrote: > In the TestAvroStorage.testRecordWithFieldSchemaFromTextWithSchemaFile > there's an example: > > STORE avro2 INTO 'output_dir' > USING org.apache.pig.piggybank.storage.avro.AvroStorage ( > '{"schema_file": "/path/to/schema/file" , > "field0": "def:member_id", > "field1": "def:browser_id", > "field3": "def:act_content" }' > ); > > You specify the file that contains the schema, then you have to map the > tuple fields to the name of the field in the avro schema. This mapping is a > drag, but it's currently required. > > Note that only the json-style constructor (as opposed to the string array > appoach) supports schema_file without this uncommitted patch: > https://issues.apache.org/jira/browse/PIG-2257 > > > thanks, > Bill > > On Thu, Mar 29, 2012 at 1:05 PM, IGZ Nick <[email protected]> wrote: > >> That's nice! Can you give me an example of how to use it? I am not able to >> figure it out from the code. The schemaManager is only used at one place >> after that, and that is when the params contains a "field<number>" key. I >> don't understand that part. Is there a way I can call it simply like STORE >> xyz INTO 'abc' USING AvroStorage('schema_file=/path/to/schema/file')? >> >> >> >> On Wed, Mar 28, 2012 at 5:41 PM, Bill Graham <[email protected]> wrote: >> >>> Yes, the schema can be in HDFS but the documentation for this is lacking. >>> Search for 'schema_file' here: >>> >>> >>> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java >>> >>> and here: >>> >>> >>> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java >>> >>> And be aware of this open JIRA: >>> https://issues.apache.org/jira/browse/PIG-2257 >>> >>> And this closed one: >>> https://issues.apache.org/jira/browse/PIG-2195 >>> >>> :) >>> >>> thanks, >>> Bill >>> >>> >>> On Wed, Mar 28, 2012 at 5:26 PM, IGZ Nick <[email protected]> wrote: >>> >>>> The schema has to be written in the script right? I don't think there is >>>> any way the schema can be in a file outside the script. That was the >>>> messyness I was talking about. Or is there a way I can write the schema in >>>> a separate file? One way I see is to create and store a dummy file with the >>>> schema >>>> >>>> >>>> Wed, Mar 28, 2012 at 4:19 PM, Bill Graham <[email protected]> wrote: >>>> >>>>> The default value will be part of the new Avro schema definition and >>>>> Avro should return it to you, so there shouldn't be any code messyness >>>>> with >>>>> that approach. >>>>> >>>>> >>>>> On Wed, Mar 28, 2012 at 4:01 PM, IGZ Nick <[email protected]> wrote: >>>>> >>>>>> Ok.. you mean I can just use the newer schema to read the old schema >>>>>> as well, by populating some default value for the missing field. I think >>>>>> that should work, messy code though! >>>>>> >>>>>> Thanks! >>>>>> >>>>>> On Wed, Mar 28, 2012 at 3:53 PM, Bill Graham <[email protected]>wrote: >>>>>> >>>>>>> If you evolved your schema to just add fields, then you should be >>>>>>> able to >>>>>>> use a single schema descriptor file to read both pre- and >>>>>>> post-evolved data >>>>>>> objects. This is because one of the rules of new fields in Avro is >>>>>>> that >>>>>>> they have to have a default value and be non-null. AvroStorage should >>>>>>> pick >>>>>>> that default field up for the old objects. If it doesn't, then that's >>>>>>> a bug. >>>>>>> >>>>>>> >>>>>>> On Wed, Mar 28, 2012 at 3:26 PM, IGZ Nick <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> @Bill, >>>>>>>> I did look at the option of providing input as a parameter while >>>>>>>> initializing AvroStorage(). But even then, I'll still need to >>>>>>> change my >>>>>>>> script to handle the two files because I'll still need to have >>>>>>> separate >>>>>>>> schemas right? >>>>>>>> >>>>>>>> @Stan, >>>>>>>> Thanks for pointing me to it, it is a useful feature. But in my >>>>>>> case, I >>>>>>>> would never have two input files with different schemas. The input >>>>>>> will >>>>>>>> always have only one of the schemas, but I want my new script (with >>>>>>> the >>>>>>>> additional column) to be able to process the old data as well, even >>>>>>> if the >>>>>>>> input only contains data with the older schema. >>>>>>>> >>>>>>>> On Wed, Mar 28, 2012 at 3:00 PM, Stan Rosenberg < >>>>>>> [email protected] >>>>>>>>> wrote: >>>>>>>> >>>>>>>>> There is a patch for Avro to deal with this use case: >>>>>>>>> https://issues.apache.org/jira/browse/PIG-2579 >>>>>>>>> (See the attached pig example which loads two avro input files >>>>>>> with >>>>>>>>> different schemas.) >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> >>>>>>>>> stan >>>>>>>>> >>>>>>>>> On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <[email protected]> >>>>>>> wrote: >>>>>>>>>> Hi guys, >>>>>>>>>> >>>>>>>>>> I use Pig to process some clickstream data. I need to track a >>>>>>> new >>>>>>>> field, >>>>>>>>> so >>>>>>>>>> I added a new field to my avro schema, and changed my Pig script >>>>>>>>>> accordingly. It works fine with the new files (which have that >>>>>>> new >>>>>>>>> column) >>>>>>>>>> but it breaks when I run it on my old files which do not have >>>>>>> that >>>>>>>> column >>>>>>>>>> in the schema (since avro stores schema in the data files >>>>>>> itself). I >>>>>>>> was >>>>>>>>>> expecting that Pig will assume the field to be null if that >>>>>>> particular >>>>>>>>>> field does not exist. But now I am having to maintain separate >>>>>>> scripts >>>>>>>> to >>>>>>>>>> process the old and new files. Is there any workaround this? >>>>>>> Because I >>>>>>>>>> figure I'll have to add new column frequently and I don't want >>>>>>> to >>>>>>>>> maintain >>>>>>>>>> a separate script for each window where the schema is constant. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> *Note that I'm no longer using my Yahoo! email address. Please email >>>>>>> me at >>>>>>> [email protected] going forward.* >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> *Note that I'm no longer using my Yahoo! email address. Please email >>>>> me at [email protected] going forward.* >>>>> >>>> >>>> >>> >>> >>> -- >>> *Note that I'm no longer using my Yahoo! email address. Please email me >>> at [email protected] going forward.* >>> >> >> > > > -- > *Note that I'm no longer using my Yahoo! email address. Please email me at > [email protected] going forward.*
