Yes, the schema can be in HDFS but the documentation for this is lacking. Search for 'schema_file' here:
http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java and here: http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java And be aware of this open JIRA: https://issues.apache.org/jira/browse/PIG-2257 And this closed one: https://issues.apache.org/jira/browse/PIG-2195 :) thanks, Bill On Wed, Mar 28, 2012 at 5:26 PM, IGZ Nick <[email protected]> wrote: > The schema has to be written in the script right? I don't think there is > any way the schema can be in a file outside the script. That was the > messyness I was talking about. Or is there a way I can write the schema in > a separate file? One way I see is to create and store a dummy file with the > schema > > > Wed, Mar 28, 2012 at 4:19 PM, Bill Graham <[email protected]> wrote: > >> The default value will be part of the new Avro schema definition and Avro >> should return it to you, so there shouldn't be any code messyness with that >> approach. >> >> >> On Wed, Mar 28, 2012 at 4:01 PM, IGZ Nick <[email protected]> wrote: >> >>> Ok.. you mean I can just use the newer schema to read the old schema as >>> well, by populating some default value for the missing field. I think that >>> should work, messy code though! >>> >>> Thanks! >>> >>> On Wed, Mar 28, 2012 at 3:53 PM, Bill Graham <[email protected]>wrote: >>> >>>> If you evolved your schema to just add fields, then you should be able >>>> to >>>> use a single schema descriptor file to read both pre- and post-evolved >>>> data >>>> objects. This is because one of the rules of new fields in Avro is that >>>> they have to have a default value and be non-null. AvroStorage should >>>> pick >>>> that default field up for the old objects. If it doesn't, then that's a >>>> bug. >>>> >>>> >>>> On Wed, Mar 28, 2012 at 3:26 PM, IGZ Nick <[email protected]> wrote: >>>> >>>> > @Bill, >>>> > I did look at the option of providing input as a parameter while >>>> > initializing AvroStorage(). But even then, I'll still need to change >>>> my >>>> > script to handle the two files because I'll still need to have >>>> separate >>>> > schemas right? >>>> > >>>> > @Stan, >>>> > Thanks for pointing me to it, it is a useful feature. But in my case, >>>> I >>>> > would never have two input files with different schemas. The input >>>> will >>>> > always have only one of the schemas, but I want my new script (with >>>> the >>>> > additional column) to be able to process the old data as well, even >>>> if the >>>> > input only contains data with the older schema. >>>> > >>>> > On Wed, Mar 28, 2012 at 3:00 PM, Stan Rosenberg < >>>> [email protected] >>>> > >wrote: >>>> > >>>> > > There is a patch for Avro to deal with this use case: >>>> > > https://issues.apache.org/jira/browse/PIG-2579 >>>> > > (See the attached pig example which loads two avro input files with >>>> > > different schemas.) >>>> > > >>>> > > Best, >>>> > > >>>> > > stan >>>> > > >>>> > > On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <[email protected]> >>>> wrote: >>>> > > > Hi guys, >>>> > > > >>>> > > > I use Pig to process some clickstream data. I need to track a new >>>> > field, >>>> > > so >>>> > > > I added a new field to my avro schema, and changed my Pig script >>>> > > > accordingly. It works fine with the new files (which have that new >>>> > > column) >>>> > > > but it breaks when I run it on my old files which do not have that >>>> > column >>>> > > > in the schema (since avro stores schema in the data files >>>> itself). I >>>> > was >>>> > > > expecting that Pig will assume the field to be null if that >>>> particular >>>> > > > field does not exist. But now I am having to maintain separate >>>> scripts >>>> > to >>>> > > > process the old and new files. Is there any workaround this? >>>> Because I >>>> > > > figure I'll have to add new column frequently and I don't want to >>>> > > maintain >>>> > > > a separate script for each window where the schema is constant. >>>> > > > >>>> > > > Thanks, >>>> > > >>>> > >>>> >>>> >>>> >>>> -- >>>> *Note that I'm no longer using my Yahoo! email address. Please email me >>>> at >>>> [email protected] going forward.* >>>> >>> >>> >> >> >> -- >> *Note that I'm no longer using my Yahoo! email address. Please email me >> at [email protected] going forward.* >> > > -- *Note that I'm no longer using my Yahoo! email address. Please email me at [email protected] going forward.*
