Re: Working with changing schemas (avro) in Pig

Bill Graham Thu, 29 Mar 2012 22:00:14 -0700

In the TestAvroStorage.testRecordWithFieldSchemaFromTextWithSchemaFile
there's an example:


STORE avro2 INTO 'output_dir'
USING org.apache.pig.piggybank.storage.avro.AvroStorage (
'{"schema_file": "/path/to/schema/file" ,
 "field0": "def:member_id",
"field1": "def:browser_id",
"field3": "def:act_content" }'
);

You specify the file that contains the schema, then you have to map the
tuple fields to the name of the field in the avro schema. This mapping is a
drag, but it's currently required.

Note that only the json-style constructor (as opposed to the string array
appoach) supports schema_file without this uncommitted patch:
https://issues.apache.org/jira/browse/PIG-2257


thanks,
Bill

On Thu, Mar 29, 2012 at 1:05 PM, IGZ Nick <[email protected]> wrote:

> That's nice! Can you give me an example of how to use it? I am not able to
> figure it out from the code. The schemaManager is only used at one place
> after that, and that is when the params contains a "field<number>" key. I
> don't understand that part. Is there a way I can call it simply like STORE
> xyz INTO 'abc' USING AvroStorage('schema_file=/path/to/schema/file')?
>
>
>
> On Wed, Mar 28, 2012 at 5:41 PM, Bill Graham <[email protected]> wrote:
>
>> Yes, the schema can be in HDFS but the documentation for this is lacking.
>> Search for 'schema_file' here:
>>
>>
>> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java
>>
>> and here:
>>
>>
>> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
>>
>> And be aware of this open JIRA:
>> https://issues.apache.org/jira/browse/PIG-2257
>>
>> And this closed one:
>> https://issues.apache.org/jira/browse/PIG-2195
>>
>> :)
>>
>> thanks,
>> Bill
>>
>>
>> On Wed, Mar 28, 2012 at 5:26 PM, IGZ Nick <[email protected]> wrote:
>>
>>> The schema has to be written in the script right? I don't think there is
>>> any way the schema can be in a file outside the script. That was the
>>> messyness I was talking about. Or is there a way I can write the schema in
>>> a separate file? One way I see is to create and store a dummy file with the
>>> schema
>>>
>>>
>>> Wed, Mar 28, 2012 at 4:19 PM, Bill Graham <[email protected]> wrote:
>>>
>>>> The default value will be part of the new Avro schema definition and
>>>> Avro should return it to you, so there shouldn't be any code messyness with
>>>> that approach.
>>>>
>>>>
>>>> On Wed, Mar 28, 2012 at 4:01 PM, IGZ Nick <[email protected]> wrote:
>>>>
>>>>> Ok.. you mean I can just use the newer schema to read the old schema
>>>>> as well, by populating some default value for the missing field. I think
>>>>> that should work, messy code though!
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Wed, Mar 28, 2012 at 3:53 PM, Bill Graham <[email protected]>wrote:
>>>>>
>>>>>>  If you evolved your schema to just add fields, then you should be
>>>>>> able to
>>>>>> use a single schema descriptor file to read both pre- and
>>>>>> post-evolved data
>>>>>> objects. This is because one of the rules of new fields in Avro is
>>>>>> that
>>>>>> they have to have a default value and be non-null. AvroStorage should
>>>>>> pick
>>>>>> that default field up for the old objects. If it doesn't, then that's
>>>>>> a bug.
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 28, 2012 at 3:26 PM, IGZ Nick <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> > @Bill,
>>>>>> > I did look at the option of providing input as a parameter while
>>>>>> > initializing AvroStorage(). But even then, I'll still need to
>>>>>> change my
>>>>>> > script to handle the two files because I'll still need to have
>>>>>> separate
>>>>>> > schemas right?
>>>>>> >
>>>>>> > @Stan,
>>>>>> > Thanks for pointing me to it, it is a useful feature. But in my
>>>>>> case, I
>>>>>> > would never have two input files with different schemas. The input
>>>>>> will
>>>>>> > always have only one of the schemas, but I want my new script (with
>>>>>> the
>>>>>> > additional column) to be able to process the old data as well, even
>>>>>> if the
>>>>>> > input only contains data with the older schema.
>>>>>> >
>>>>>> > On Wed, Mar 28, 2012 at 3:00 PM, Stan Rosenberg <
>>>>>> [email protected]
>>>>>> > >wrote:
>>>>>> >
>>>>>> > > There is a patch for Avro to deal with this use case:
>>>>>> > > https://issues.apache.org/jira/browse/PIG-2579
>>>>>> > > (See the attached pig example which loads two avro input files
>>>>>> with
>>>>>> > > different schemas.)
>>>>>> > >
>>>>>> > > Best,
>>>>>> > >
>>>>>> > > stan
>>>>>> > >
>>>>>> > > On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <[email protected]>
>>>>>> wrote:
>>>>>> > > > Hi guys,
>>>>>> > > >
>>>>>> > > > I use Pig to process some clickstream data. I need to track a
>>>>>> new
>>>>>> > field,
>>>>>> > > so
>>>>>> > > > I added a new field to my avro schema, and changed my Pig script
>>>>>> > > > accordingly. It works fine with the new files (which have that
>>>>>> new
>>>>>> > > column)
>>>>>> > > > but it breaks when I run it on my old files which do not have
>>>>>> that
>>>>>> > column
>>>>>> > > > in the schema (since avro stores schema in the data files
>>>>>> itself). I
>>>>>> > was
>>>>>> > > > expecting that Pig will assume the field to be null if that
>>>>>> particular
>>>>>> > > > field does not exist. But now I am having to maintain separate
>>>>>> scripts
>>>>>> > to
>>>>>> > > > process the old and new files. Is there any workaround this?
>>>>>> Because I
>>>>>> > > > figure I'll have to add new column frequently and I don't want
>>>>>> to
>>>>>> > > maintain
>>>>>> > > > a separate script for each window where the schema is constant.
>>>>>> > > >
>>>>>> > > > Thanks,
>>>>>> > >
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Note that I'm no longer using my Yahoo! email address. Please email
>>>>>> me at
>>>>>> [email protected] going forward.*
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Note that I'm no longer using my Yahoo! email address. Please email
>>>> me at [email protected] going forward.*
>>>>
>>>
>>>
>>
>>
>> --
>> *Note that I'm no longer using my Yahoo! email address. Please email me
>> at [email protected] going forward.*
>>
>
>


-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
[email protected] going forward.*

Re: Working with changing schemas (avro) in Pig

Reply via email to