The schema has to be written in the script right? I don't think there is any way the schema can be in a file outside the script. That was the messyness I was talking about. Or is there a way I can write the schema in a separate file? One way I see is to create and store a dummy file with the schema
Wed, Mar 28, 2012 at 4:19 PM, Bill Graham <[email protected]> wrote: > The default value will be part of the new Avro schema definition and Avro > should return it to you, so there shouldn't be any code messyness with that > approach. > > > On Wed, Mar 28, 2012 at 4:01 PM, IGZ Nick <[email protected]> wrote: > >> Ok.. you mean I can just use the newer schema to read the old schema as >> well, by populating some default value for the missing field. I think that >> should work, messy code though! >> >> Thanks! >> >> On Wed, Mar 28, 2012 at 3:53 PM, Bill Graham <[email protected]>wrote: >> >>> If you evolved your schema to just add fields, then you should be able >>> to >>> use a single schema descriptor file to read both pre- and post-evolved >>> data >>> objects. This is because one of the rules of new fields in Avro is that >>> they have to have a default value and be non-null. AvroStorage should >>> pick >>> that default field up for the old objects. If it doesn't, then that's a >>> bug. >>> >>> >>> On Wed, Mar 28, 2012 at 3:26 PM, IGZ Nick <[email protected]> wrote: >>> >>> > @Bill, >>> > I did look at the option of providing input as a parameter while >>> > initializing AvroStorage(). But even then, I'll still need to change my >>> > script to handle the two files because I'll still need to have separate >>> > schemas right? >>> > >>> > @Stan, >>> > Thanks for pointing me to it, it is a useful feature. But in my case, I >>> > would never have two input files with different schemas. The input will >>> > always have only one of the schemas, but I want my new script (with the >>> > additional column) to be able to process the old data as well, even if >>> the >>> > input only contains data with the older schema. >>> > >>> > On Wed, Mar 28, 2012 at 3:00 PM, Stan Rosenberg < >>> [email protected] >>> > >wrote: >>> > >>> > > There is a patch for Avro to deal with this use case: >>> > > https://issues.apache.org/jira/browse/PIG-2579 >>> > > (See the attached pig example which loads two avro input files with >>> > > different schemas.) >>> > > >>> > > Best, >>> > > >>> > > stan >>> > > >>> > > On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <[email protected]> >>> wrote: >>> > > > Hi guys, >>> > > > >>> > > > I use Pig to process some clickstream data. I need to track a new >>> > field, >>> > > so >>> > > > I added a new field to my avro schema, and changed my Pig script >>> > > > accordingly. It works fine with the new files (which have that new >>> > > column) >>> > > > but it breaks when I run it on my old files which do not have that >>> > column >>> > > > in the schema (since avro stores schema in the data files itself). >>> I >>> > was >>> > > > expecting that Pig will assume the field to be null if that >>> particular >>> > > > field does not exist. But now I am having to maintain separate >>> scripts >>> > to >>> > > > process the old and new files. Is there any workaround this? >>> Because I >>> > > > figure I'll have to add new column frequently and I don't want to >>> > > maintain >>> > > > a separate script for each window where the schema is constant. >>> > > > >>> > > > Thanks, >>> > > >>> > >>> >>> >>> >>> -- >>> *Note that I'm no longer using my Yahoo! email address. Please email me >>> at >>> [email protected] going forward.* >>> >> >> > > > -- > *Note that I'm no longer using my Yahoo! email address. Please email me > at [email protected] going forward.* >
