Re: Working with changing schemas (avro) in Pig

Alex Rovner Sun, 01 Apr 2012 09:33:11 -0700

Anyone have any experience with elephantbird? Seems like it can handle these 
cases with ease?


Sent from my iPhone

On Mar 30, 2012, at 12:59 AM, Bill Graham <[email protected]> wrote:

> In the TestAvroStorage.testRecordWithFieldSchemaFromTextWithSchemaFile
> there's an example:
> 
> STORE avro2 INTO 'output_dir'
> USING org.apache.pig.piggybank.storage.avro.AvroStorage (
> '{"schema_file": "/path/to/schema/file" ,
> "field0": "def:member_id",
> "field1": "def:browser_id",
> "field3": "def:act_content" }'
> );
> 
> You specify the file that contains the schema, then you have to map the
> tuple fields to the name of the field in the avro schema. This mapping is a
> drag, but it's currently required.
> 
> Note that only the json-style constructor (as opposed to the string array
> appoach) supports schema_file without this uncommitted patch:
> https://issues.apache.org/jira/browse/PIG-2257
> 
> 
> thanks,
> Bill
> 
> On Thu, Mar 29, 2012 at 1:05 PM, IGZ Nick <[email protected]> wrote:
> 
>> That's nice! Can you give me an example of how to use it? I am not able to
>> figure it out from the code. The schemaManager is only used at one place
>> after that, and that is when the params contains a "field<number>" key. I
>> don't understand that part. Is there a way I can call it simply like STORE
>> xyz INTO 'abc' USING AvroStorage('schema_file=/path/to/schema/file')?
>> 
>> 
>> 
>> On Wed, Mar 28, 2012 at 5:41 PM, Bill Graham <[email protected]> wrote:
>> 
>>> Yes, the schema can be in HDFS but the documentation for this is lacking.
>>> Search for 'schema_file' here:
>>> 
>>> 
>>> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java
>>> 
>>> and here:
>>> 
>>> 
>>> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
>>> 
>>> And be aware of this open JIRA:
>>> https://issues.apache.org/jira/browse/PIG-2257
>>> 
>>> And this closed one:
>>> https://issues.apache.org/jira/browse/PIG-2195
>>> 
>>> :)
>>> 
>>> thanks,
>>> Bill
>>> 
>>> 
>>> On Wed, Mar 28, 2012 at 5:26 PM, IGZ Nick <[email protected]> wrote:
>>> 
>>>> The schema has to be written in the script right? I don't think there is
>>>> any way the schema can be in a file outside the script. That was the
>>>> messyness I was talking about. Or is there a way I can write the schema in
>>>> a separate file? One way I see is to create and store a dummy file with the
>>>> schema
>>>> 
>>>> 
>>>> Wed, Mar 28, 2012 at 4:19 PM, Bill Graham <[email protected]> wrote:
>>>> 
>>>>> The default value will be part of the new Avro schema definition and
>>>>> Avro should return it to you, so there shouldn't be any code messyness 
>>>>> with
>>>>> that approach.
>>>>> 
>>>>> 
>>>>> On Wed, Mar 28, 2012 at 4:01 PM, IGZ Nick <[email protected]> wrote:
>>>>> 
>>>>>> Ok.. you mean I can just use the newer schema to read the old schema
>>>>>> as well, by populating some default value for the missing field. I think
>>>>>> that should work, messy code though!
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> On Wed, Mar 28, 2012 at 3:53 PM, Bill Graham <[email protected]>wrote:
>>>>>> 
>>>>>>> If you evolved your schema to just add fields, then you should be
>>>>>>> able to
>>>>>>> use a single schema descriptor file to read both pre- and
>>>>>>> post-evolved data
>>>>>>> objects. This is because one of the rules of new fields in Avro is
>>>>>>> that
>>>>>>> they have to have a default value and be non-null. AvroStorage should
>>>>>>> pick
>>>>>>> that default field up for the old objects. If it doesn't, then that's
>>>>>>> a bug.
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Mar 28, 2012 at 3:26 PM, IGZ Nick <[email protected]>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> @Bill,
>>>>>>>> I did look at the option of providing input as a parameter while
>>>>>>>> initializing AvroStorage(). But even then, I'll still need to
>>>>>>> change my
>>>>>>>> script to handle the two files because I'll still need to have
>>>>>>> separate
>>>>>>>> schemas right?
>>>>>>>> 
>>>>>>>> @Stan,
>>>>>>>> Thanks for pointing me to it, it is a useful feature. But in my
>>>>>>> case, I
>>>>>>>> would never have two input files with different schemas. The input
>>>>>>> will
>>>>>>>> always have only one of the schemas, but I want my new script (with
>>>>>>> the
>>>>>>>> additional column) to be able to process the old data as well, even
>>>>>>> if the
>>>>>>>> input only contains data with the older schema.
>>>>>>>> 
>>>>>>>> On Wed, Mar 28, 2012 at 3:00 PM, Stan Rosenberg <
>>>>>>> [email protected]
>>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> There is a patch for Avro to deal with this use case:
>>>>>>>>> https://issues.apache.org/jira/browse/PIG-2579
>>>>>>>>> (See the attached pig example which loads two avro input files
>>>>>>> with
>>>>>>>>> different schemas.)
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> 
>>>>>>>>> stan
>>>>>>>>> 
>>>>>>>>> On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <[email protected]>
>>>>>>> wrote:
>>>>>>>>>> Hi guys,
>>>>>>>>>> 
>>>>>>>>>> I use Pig to process some clickstream data. I need to track a
>>>>>>> new
>>>>>>>> field,
>>>>>>>>> so
>>>>>>>>>> I added a new field to my avro schema, and changed my Pig script
>>>>>>>>>> accordingly. It works fine with the new files (which have that
>>>>>>> new
>>>>>>>>> column)
>>>>>>>>>> but it breaks when I run it on my old files which do not have
>>>>>>> that
>>>>>>>> column
>>>>>>>>>> in the schema (since avro stores schema in the data files
>>>>>>> itself). I
>>>>>>>> was
>>>>>>>>>> expecting that Pig will assume the field to be null if that
>>>>>>> particular
>>>>>>>>>> field does not exist. But now I am having to maintain separate
>>>>>>> scripts
>>>>>>>> to
>>>>>>>>>> process the old and new files. Is there any workaround this?
>>>>>>> Because I
>>>>>>>>>> figure I'll have to add new column frequently and I don't want
>>>>>>> to
>>>>>>>>> maintain
>>>>>>>>>> a separate script for each window where the schema is constant.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> *Note that I'm no longer using my Yahoo! email address. Please email
>>>>>>> me at
>>>>>>> [email protected] going forward.*
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> *Note that I'm no longer using my Yahoo! email address. Please email
>>>>> me at [email protected] going forward.*
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> *Note that I'm no longer using my Yahoo! email address. Please email me
>>> at [email protected] going forward.*
>>> 
>> 
>> 
> 
> 
> -- 
> *Note that I'm no longer using my Yahoo! email address. Please email me at
> [email protected] going forward.*

Re: Working with changing schemas (avro) in Pig

Reply via email to