Re: Working with changing schemas (avro) in Pig

IGZ Nick Wed, 28 Mar 2012 16:02:25 -0700

Ok.. you mean I can just use the newer schema to read the old schema as
well, by populating some default value for the missing field. I think that
should work, messy code though!


Thanks!

On Wed, Mar 28, 2012 at 3:53 PM, Bill Graham <[email protected]> wrote:

> If you evolved your schema to just add fields, then you should be able to
> use a single schema descriptor file to read both pre- and post-evolved data
> objects. This is because one of the rules of new fields in Avro is that
> they have to have a default value and be non-null. AvroStorage should pick
> that default field up for the old objects. If it doesn't, then that's a
> bug.
>
>
> On Wed, Mar 28, 2012 at 3:26 PM, IGZ Nick <[email protected]> wrote:
>
> > @Bill,
> > I did look at the option of providing input as a parameter while
> > initializing AvroStorage(). But even then, I'll still need to change my
> > script to handle the two files because I'll still need to have separate
> > schemas right?
> >
> > @Stan,
> > Thanks for pointing me to it, it is a useful feature. But in my case, I
> > would never have two input files with different schemas. The input will
> > always have only one of the schemas, but I want my new script (with the
> > additional column) to be able to process the old data as well, even if
> the
> > input only contains data with the older schema.
> >
> > On Wed, Mar 28, 2012 at 3:00 PM, Stan Rosenberg <
> [email protected]
> > >wrote:
> >
> > > There is a patch for Avro to deal with this use case:
> > > https://issues.apache.org/jira/browse/PIG-2579
> > > (See the attached pig example which loads two avro input files with
> > > different schemas.)
> > >
> > > Best,
> > >
> > > stan
> > >
> > > On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <[email protected]> wrote:
> > > > Hi guys,
> > > >
> > > > I use Pig to process some clickstream data. I need to track a new
> > field,
> > > so
> > > > I added a new field to my avro schema, and changed my Pig script
> > > > accordingly. It works fine with the new files (which have that new
> > > column)
> > > > but it breaks when I run it on my old files which do not have that
> > column
> > > > in the schema (since avro stores schema in the data files itself). I
> > was
> > > > expecting that Pig will assume the field to be null if that
> particular
> > > > field does not exist. But now I am having to maintain separate
> scripts
> > to
> > > > process the old and new files. Is there any workaround this? Because
> I
> > > > figure I'll have to add new column frequently and I don't want to
> > > maintain
> > > > a separate script for each window where the schema is constant.
> > > >
> > > > Thanks,
> > >
> >
>
>
>
> --
> *Note that I'm no longer using my Yahoo! email address. Please email me at
> [email protected] going forward.*
>

Re: Working with changing schemas (avro) in Pig

Reply via email to