Re: Working with changing schemas (avro) in Pig

IGZ Nick Thu, 29 Mar 2012 13:05:56 -0700

That's nice! Can you give me an example of how to use it? I am not able to
figure it out from the code. The schemaManager is only used at one place
after that, and that is when the params contains a "field<number>" key. I
don't understand that part. Is there a way I can call it simply like STORE
xyz INTO 'abc' USING AvroStorage('schema_file=/path/to/schema/file')?



On Wed, Mar 28, 2012 at 5:41 PM, Bill Graham <[email protected]> wrote:

> Yes, the schema can be in HDFS but the documentation for this is lacking.
> Search for 'schema_file' here:
>
>
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java
>
> and here:
>
>
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
>
> And be aware of this open JIRA:
> https://issues.apache.org/jira/browse/PIG-2257
>
> And this closed one:
> https://issues.apache.org/jira/browse/PIG-2195
>
> :)
>
> thanks,
> Bill
>
>
> On Wed, Mar 28, 2012 at 5:26 PM, IGZ Nick <[email protected]> wrote:
>
>> The schema has to be written in the script right? I don't think there is
>> any way the schema can be in a file outside the script. That was the
>> messyness I was talking about. Or is there a way I can write the schema in
>> a separate file? One way I see is to create and store a dummy file with the
>> schema
>>
>>
>> Wed, Mar 28, 2012 at 4:19 PM, Bill Graham <[email protected]> wrote:
>>
>>> The default value will be part of the new Avro schema definition and
>>> Avro should return it to you, so there shouldn't be any code messyness with
>>> that approach.
>>>
>>>
>>> On Wed, Mar 28, 2012 at 4:01 PM, IGZ Nick <[email protected]> wrote:
>>>
>>>> Ok.. you mean I can just use the newer schema to read the old schema as
>>>> well, by populating some default value for the missing field. I think that
>>>> should work, messy code though!
>>>>
>>>> Thanks!
>>>>
>>>> On Wed, Mar 28, 2012 at 3:53 PM, Bill Graham <[email protected]>wrote:
>>>>
>>>>>  If you evolved your schema to just add fields, then you should be
>>>>> able to
>>>>> use a single schema descriptor file to read both pre- and post-evolved
>>>>> data
>>>>> objects. This is because one of the rules of new fields in Avro is that
>>>>> they have to have a default value and be non-null. AvroStorage should
>>>>> pick
>>>>> that default field up for the old objects. If it doesn't, then that's
>>>>> a bug.
>>>>>
>>>>>
>>>>> On Wed, Mar 28, 2012 at 3:26 PM, IGZ Nick <[email protected]> wrote:
>>>>>
>>>>> > @Bill,
>>>>> > I did look at the option of providing input as a parameter while
>>>>> > initializing AvroStorage(). But even then, I'll still need to change
>>>>> my
>>>>> > script to handle the two files because I'll still need to have
>>>>> separate
>>>>> > schemas right?
>>>>> >
>>>>> > @Stan,
>>>>> > Thanks for pointing me to it, it is a useful feature. But in my
>>>>> case, I
>>>>> > would never have two input files with different schemas. The input
>>>>> will
>>>>> > always have only one of the schemas, but I want my new script (with
>>>>> the
>>>>> > additional column) to be able to process the old data as well, even
>>>>> if the
>>>>> > input only contains data with the older schema.
>>>>> >
>>>>> > On Wed, Mar 28, 2012 at 3:00 PM, Stan Rosenberg <
>>>>> [email protected]
>>>>> > >wrote:
>>>>> >
>>>>> > > There is a patch for Avro to deal with this use case:
>>>>> > > https://issues.apache.org/jira/browse/PIG-2579
>>>>> > > (See the attached pig example which loads two avro input files with
>>>>> > > different schemas.)
>>>>> > >
>>>>> > > Best,
>>>>> > >
>>>>> > > stan
>>>>> > >
>>>>> > > On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <[email protected]>
>>>>> wrote:
>>>>> > > > Hi guys,
>>>>> > > >
>>>>> > > > I use Pig to process some clickstream data. I need to track a new
>>>>> > field,
>>>>> > > so
>>>>> > > > I added a new field to my avro schema, and changed my Pig script
>>>>> > > > accordingly. It works fine with the new files (which have that
>>>>> new
>>>>> > > column)
>>>>> > > > but it breaks when I run it on my old files which do not have
>>>>> that
>>>>> > column
>>>>> > > > in the schema (since avro stores schema in the data files
>>>>> itself). I
>>>>> > was
>>>>> > > > expecting that Pig will assume the field to be null if that
>>>>> particular
>>>>> > > > field does not exist. But now I am having to maintain separate
>>>>> scripts
>>>>> > to
>>>>> > > > process the old and new files. Is there any workaround this?
>>>>> Because I
>>>>> > > > figure I'll have to add new column frequently and I don't want to
>>>>> > > maintain
>>>>> > > > a separate script for each window where the schema is constant.
>>>>> > > >
>>>>> > > > Thanks,
>>>>> > >
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Note that I'm no longer using my Yahoo! email address. Please email
>>>>> me at
>>>>> [email protected] going forward.*
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Note that I'm no longer using my Yahoo! email address. Please email me
>>> at [email protected] going forward.*
>>>
>>
>>
>
>
> --
> *Note that I'm no longer using my Yahoo! email address. Please email me
> at [email protected] going forward.*
>

Re: Working with changing schemas (avro) in Pig

Reply via email to