That's nice! Can you give me an example of how to use it? I am not able to
figure it out from the code. The schemaManager is only used at one place
after that, and that is when the params contains a "field<number>" key. I
don't understand that part. Is there a way I can call it simply like STORE
xyz INTO 'abc' USING AvroStorage('schema_file=/path/to/schema/file')?On Wed, Mar 28, 2012 at 5:41 PM, Bill Graham <[email protected]> wrote: > Yes, the schema can be in HDFS but the documentation for this is lacking. > Search for 'schema_file' here: > > > http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java > > and here: > > > http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java > > And be aware of this open JIRA: > https://issues.apache.org/jira/browse/PIG-2257 > > And this closed one: > https://issues.apache.org/jira/browse/PIG-2195 > > :) > > thanks, > Bill > > > On Wed, Mar 28, 2012 at 5:26 PM, IGZ Nick <[email protected]> wrote: > >> The schema has to be written in the script right? I don't think there is >> any way the schema can be in a file outside the script. That was the >> messyness I was talking about. Or is there a way I can write the schema in >> a separate file? One way I see is to create and store a dummy file with the >> schema >> >> >> Wed, Mar 28, 2012 at 4:19 PM, Bill Graham <[email protected]> wrote: >> >>> The default value will be part of the new Avro schema definition and >>> Avro should return it to you, so there shouldn't be any code messyness with >>> that approach. >>> >>> >>> On Wed, Mar 28, 2012 at 4:01 PM, IGZ Nick <[email protected]> wrote: >>> >>>> Ok.. you mean I can just use the newer schema to read the old schema as >>>> well, by populating some default value for the missing field. I think that >>>> should work, messy code though! >>>> >>>> Thanks! >>>> >>>> On Wed, Mar 28, 2012 at 3:53 PM, Bill Graham <[email protected]>wrote: >>>> >>>>> If you evolved your schema to just add fields, then you should be >>>>> able to >>>>> use a single schema descriptor file to read both pre- and post-evolved >>>>> data >>>>> objects. This is because one of the rules of new fields in Avro is that >>>>> they have to have a default value and be non-null. AvroStorage should >>>>> pick >>>>> that default field up for the old objects. If it doesn't, then that's >>>>> a bug. >>>>> >>>>> >>>>> On Wed, Mar 28, 2012 at 3:26 PM, IGZ Nick <[email protected]> wrote: >>>>> >>>>> > @Bill, >>>>> > I did look at the option of providing input as a parameter while >>>>> > initializing AvroStorage(). But even then, I'll still need to change >>>>> my >>>>> > script to handle the two files because I'll still need to have >>>>> separate >>>>> > schemas right? >>>>> > >>>>> > @Stan, >>>>> > Thanks for pointing me to it, it is a useful feature. But in my >>>>> case, I >>>>> > would never have two input files with different schemas. The input >>>>> will >>>>> > always have only one of the schemas, but I want my new script (with >>>>> the >>>>> > additional column) to be able to process the old data as well, even >>>>> if the >>>>> > input only contains data with the older schema. >>>>> > >>>>> > On Wed, Mar 28, 2012 at 3:00 PM, Stan Rosenberg < >>>>> [email protected] >>>>> > >wrote: >>>>> > >>>>> > > There is a patch for Avro to deal with this use case: >>>>> > > https://issues.apache.org/jira/browse/PIG-2579 >>>>> > > (See the attached pig example which loads two avro input files with >>>>> > > different schemas.) >>>>> > > >>>>> > > Best, >>>>> > > >>>>> > > stan >>>>> > > >>>>> > > On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <[email protected]> >>>>> wrote: >>>>> > > > Hi guys, >>>>> > > > >>>>> > > > I use Pig to process some clickstream data. I need to track a new >>>>> > field, >>>>> > > so >>>>> > > > I added a new field to my avro schema, and changed my Pig script >>>>> > > > accordingly. It works fine with the new files (which have that >>>>> new >>>>> > > column) >>>>> > > > but it breaks when I run it on my old files which do not have >>>>> that >>>>> > column >>>>> > > > in the schema (since avro stores schema in the data files >>>>> itself). I >>>>> > was >>>>> > > > expecting that Pig will assume the field to be null if that >>>>> particular >>>>> > > > field does not exist. But now I am having to maintain separate >>>>> scripts >>>>> > to >>>>> > > > process the old and new files. Is there any workaround this? >>>>> Because I >>>>> > > > figure I'll have to add new column frequently and I don't want to >>>>> > > maintain >>>>> > > > a separate script for each window where the schema is constant. >>>>> > > > >>>>> > > > Thanks, >>>>> > > >>>>> > >>>>> >>>>> >>>>> >>>>> -- >>>>> *Note that I'm no longer using my Yahoo! email address. Please email >>>>> me at >>>>> [email protected] going forward.* >>>>> >>>> >>>> >>> >>> >>> -- >>> *Note that I'm no longer using my Yahoo! email address. Please email me >>> at [email protected] going forward.* >>> >> >> > > > -- > *Note that I'm no longer using my Yahoo! email address. Please email me > at [email protected] going forward.* >
