Re: AvroStorage Default values are set to null even if they are specified

Cheolsoo Park Thu, 02 May 2013 22:01:20 -0700

Hi Viray,

Yes, that's a known bug. Here is what happens:


1) Let's say there are two schema X and Y.
2) AvroStorage creates a tuple whose size == max( sizeOf(X), sizeOf(Y) ).
3) Fields are filled in as values are read. But if no values are found,
those fields are left as null.

If you'd like to fix it, please take a look at PigAvroRecordReader.java:
http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/PigAvroRecordReader.java

In particular, see how mProtoTuple is initialized and updated.

Thanks,
Cheolsoo





On Thu, May 2, 2013 at 8:34 PM, Viraj Bhat <vi...@yahoo-inc.com> wrote:

> Hi Cheolsoo/Pig User Group,
>   I am using the Pig 0.11 piggybank - AvroStorage. When merging multiple
> schemas where default values have been specified in the avro schema; The
> AvroStorage puts nulls in the merged data set.
> Is this a known bug in the current implementation of the AvroStorage.
> Using an example provided by one of my colleagues. The final dataset should
> contain "NU", 0, "OU" for all values where the columns do not exist.
> ==> Employee3.avro <==
> {
> "type" : "record",
> "name" : "employee",
> "fields":[
>         {"name" : "name", "type" : "string", "default" : "NU"},
>         {"name" : "age", "type" : "int", "default" : 0 },
>         {"name" : "dept", "type": "string", "default" : "DU"}
> ]
> }
>
> ==> Employee4.avro <==
> {
> "type" : "record",
> "name" : "employee",
> "fields":[
>         {"name" : "name", "type" : "string", "default" : "NU"},
>         {"name" : "age", "type" : "int", "default" : 0},
>         {"name" : "dept", "type": "string", "default" : "DU"},
>         {"name" : "office", "type": "string", "default" : "OU"}
> ]
> }
>
> ==> Employee6.avro <==
> {
> "type" : "record",
> "name" : "employee",
> "fields":[
>         {"name" : "name", "type" : "string", "default" : "NU"},
>         {"name" : "lastname", "type": "string", "default" : "LNU"},
>         {"name" : "age", "type" : "int","default" : 0},
>         {"name" : "salary", "type": "int", "default" : 0},
>         {"name" : "dept", "type": "string","default" : "DU"},
>         {"name" : "office", "type": "string","default" : "OU"}
> ]
> }
>
> The pig script:
> employee = load '$input' using
> org.apache.pig.piggybank.storage.avro.AvroStorage('multiple_schemas');
> describe employee;
> dump employee;
>
> The call:
> dump_employees.pig employee{3,4,6}.ser
>
> The output:
> employee: {name: chararray,age: int,dept: chararray,lastname:
> chararray,salary: int,office: chararray}
>
> (Milo,30,DH,,,)
> (Asmya,34,PQ,,,)
> (Baljit,23,RS,,,)
> (Pune,60,Astrophysics,Warriors,5466,UTA)
> (Rajsathan,20,Biochemistry,Royals,1378,Stanford)
> (Chennai,50,Microbiology,Superkings,7338,Hopkins)
> (Mumbai,20,Applied Math,Indians,4468,UAH)
> (Praj,54,RMX,,,Champaign)
> (Buba,767,HD,,,Sunnyvale)
> (Manku,375,MS,,,New York)
> Regards
> Viraj
>
> -----Original Message-----
> From: Cheolsoo Park [mailto:piaozhe...@gmail.com]
> Sent: Tuesday, April 30, 2013 9:10 PM
> To: user@pig.apache.org
> Cc: Qi, Runping
> Subject: Re: Override input schema in AvroStorage
>
> Hi Steven,
>
> The new AvroStorage will let you specify the input schema:
> https://issues.apache.org/jira/browse/PIG-3015
>
> In fact, somebody made the same request in a comment of the jira that I am
> copying and pasting below:
>
> Furthermore, we occasionally have issues with pig jobs picking the old
> > schema when we have a schema update. Manually specifying the schema
> > would fix this and give us more flexibility in defining the data we
> > want pig to pull from a file.
>
>
> This jira is work in progress, but hopefully it will be in next major
> released.
>
> Thanks,
> Cheolsoo
>
>
>
> On Sat, Apr 27, 2013 at 3:24 PM, Enns, Steven <sae...@a9.com> wrote:
>
> > Resending now that I am subscribed :)
> >
> > On 4/25/13 4:01 PM, "Enns, Steven" <sae...@a9.com> wrote:
> >
> > >Hi everyone,
> > >
> > >I would like to override the input schema in AvroStorage to make a
> > >pig script robust to schema evolution.  For example, suppose a new
> > >field is added to an avro schema with a default value of null.  If
> > >the input to a pig script using this field includes both old and new
> > >data, AvroStorage will merge the input schemas from the old and new
> > >data.  However, if the input includes only old data, the new schema
> > >will not be available to AvroStorage and pig will fail to interpret
> > >the script with an error such as "projected field [newField] does not
> > >exist in schema".  If AvroStorage accepted an input schema, the
> > >script would be valid for both the new and old data.  Is there any plan
> to implement this?
> > >
> > >Thanks,
> > >Steve
> > >
> >
> >
>

Re: AvroStorage Default values are set to null even if they are specified

Reply via email to