Re: Working with changing schemas (avro) in Pig

IGZ Nick Wed, 28 Mar 2012 17:26:41 -0700

The schema has to be written in the script right? I don't think there is
any way the schema can be in a file outside the script. That was the
messyness I was talking about. Or is there a way I can write the schema in
a separate file? One way I see is to create and store a dummy file with the
schema


Wed, Mar 28, 2012 at 4:19 PM, Bill Graham <[email protected]> wrote:

> The default value will be part of the new Avro schema definition and Avro
> should return it to you, so there shouldn't be any code messyness with that
> approach.
>
>
> On Wed, Mar 28, 2012 at 4:01 PM, IGZ Nick <[email protected]> wrote:
>
>> Ok.. you mean I can just use the newer schema to read the old schema as
>> well, by populating some default value for the missing field. I think that
>> should work, messy code though!
>>
>> Thanks!
>>
>> On Wed, Mar 28, 2012 at 3:53 PM, Bill Graham <[email protected]>wrote:
>>
>>>  If you evolved your schema to just add fields, then you should be able
>>> to
>>> use a single schema descriptor file to read both pre- and post-evolved
>>> data
>>> objects. This is because one of the rules of new fields in Avro is that
>>> they have to have a default value and be non-null. AvroStorage should
>>> pick
>>> that default field up for the old objects. If it doesn't, then that's a
>>> bug.
>>>
>>>
>>> On Wed, Mar 28, 2012 at 3:26 PM, IGZ Nick <[email protected]> wrote:
>>>
>>> > @Bill,
>>> > I did look at the option of providing input as a parameter while
>>> > initializing AvroStorage(). But even then, I'll still need to change my
>>> > script to handle the two files because I'll still need to have separate
>>> > schemas right?
>>> >
>>> > @Stan,
>>> > Thanks for pointing me to it, it is a useful feature. But in my case, I
>>> > would never have two input files with different schemas. The input will
>>> > always have only one of the schemas, but I want my new script (with the
>>> > additional column) to be able to process the old data as well, even if
>>> the
>>> > input only contains data with the older schema.
>>> >
>>> > On Wed, Mar 28, 2012 at 3:00 PM, Stan Rosenberg <
>>> [email protected]
>>> > >wrote:
>>> >
>>> > > There is a patch for Avro to deal with this use case:
>>> > > https://issues.apache.org/jira/browse/PIG-2579
>>> > > (See the attached pig example which loads two avro input files with
>>> > > different schemas.)
>>> > >
>>> > > Best,
>>> > >
>>> > > stan
>>> > >
>>> > > On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <[email protected]>
>>> wrote:
>>> > > > Hi guys,
>>> > > >
>>> > > > I use Pig to process some clickstream data. I need to track a new
>>> > field,
>>> > > so
>>> > > > I added a new field to my avro schema, and changed my Pig script
>>> > > > accordingly. It works fine with the new files (which have that new
>>> > > column)
>>> > > > but it breaks when I run it on my old files which do not have that
>>> > column
>>> > > > in the schema (since avro stores schema in the data files itself).
>>> I
>>> > was
>>> > > > expecting that Pig will assume the field to be null if that
>>> particular
>>> > > > field does not exist. But now I am having to maintain separate
>>> scripts
>>> > to
>>> > > > process the old and new files. Is there any workaround this?
>>> Because I
>>> > > > figure I'll have to add new column frequently and I don't want to
>>> > > maintain
>>> > > > a separate script for each window where the schema is constant.
>>> > > >
>>> > > > Thanks,
>>> > >
>>> >
>>>
>>>
>>>
>>> --
>>> *Note that I'm no longer using my Yahoo! email address. Please email me
>>> at
>>> [email protected] going forward.*
>>>
>>
>>
>
>
> --
> *Note that I'm no longer using my Yahoo! email address. Please email me
> at [email protected] going forward.*
>

Re: Working with changing schemas (avro) in Pig

Reply via email to