Hi Viray, Yes, that's a known bug. Here is what happens:
1) Let's say there are two schema X and Y. 2) AvroStorage creates a tuple whose size == max( sizeOf(X), sizeOf(Y) ). 3) Fields are filled in as values are read. But if no values are found, those fields are left as null. If you'd like to fix it, please take a look at PigAvroRecordReader.java: http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/PigAvroRecordReader.java In particular, see how mProtoTuple is initialized and updated. Thanks, Cheolsoo On Thu, May 2, 2013 at 8:34 PM, Viraj Bhat <vi...@yahoo-inc.com> wrote: > Hi Cheolsoo/Pig User Group, > I am using the Pig 0.11 piggybank - AvroStorage. When merging multiple > schemas where default values have been specified in the avro schema; The > AvroStorage puts nulls in the merged data set. > Is this a known bug in the current implementation of the AvroStorage. > Using an example provided by one of my colleagues. The final dataset should > contain "NU", 0, "OU" for all values where the columns do not exist. > ==> Employee3.avro <== > { > "type" : "record", > "name" : "employee", > "fields":[ > {"name" : "name", "type" : "string", "default" : "NU"}, > {"name" : "age", "type" : "int", "default" : 0 }, > {"name" : "dept", "type": "string", "default" : "DU"} > ] > } > > ==> Employee4.avro <== > { > "type" : "record", > "name" : "employee", > "fields":[ > {"name" : "name", "type" : "string", "default" : "NU"}, > {"name" : "age", "type" : "int", "default" : 0}, > {"name" : "dept", "type": "string", "default" : "DU"}, > {"name" : "office", "type": "string", "default" : "OU"} > ] > } > > ==> Employee6.avro <== > { > "type" : "record", > "name" : "employee", > "fields":[ > {"name" : "name", "type" : "string", "default" : "NU"}, > {"name" : "lastname", "type": "string", "default" : "LNU"}, > {"name" : "age", "type" : "int","default" : 0}, > {"name" : "salary", "type": "int", "default" : 0}, > {"name" : "dept", "type": "string","default" : "DU"}, > {"name" : "office", "type": "string","default" : "OU"} > ] > } > > The pig script: > employee = load '$input' using > org.apache.pig.piggybank.storage.avro.AvroStorage('multiple_schemas'); > describe employee; > dump employee; > > The call: > dump_employees.pig employee{3,4,6}.ser > > The output: > employee: {name: chararray,age: int,dept: chararray,lastname: > chararray,salary: int,office: chararray} > > (Milo,30,DH,,,) > (Asmya,34,PQ,,,) > (Baljit,23,RS,,,) > (Pune,60,Astrophysics,Warriors,5466,UTA) > (Rajsathan,20,Biochemistry,Royals,1378,Stanford) > (Chennai,50,Microbiology,Superkings,7338,Hopkins) > (Mumbai,20,Applied Math,Indians,4468,UAH) > (Praj,54,RMX,,,Champaign) > (Buba,767,HD,,,Sunnyvale) > (Manku,375,MS,,,New York) > Regards > Viraj > > -----Original Message----- > From: Cheolsoo Park [mailto:piaozhe...@gmail.com] > Sent: Tuesday, April 30, 2013 9:10 PM > To: user@pig.apache.org > Cc: Qi, Runping > Subject: Re: Override input schema in AvroStorage > > Hi Steven, > > The new AvroStorage will let you specify the input schema: > https://issues.apache.org/jira/browse/PIG-3015 > > In fact, somebody made the same request in a comment of the jira that I am > copying and pasting below: > > Furthermore, we occasionally have issues with pig jobs picking the old > > schema when we have a schema update. Manually specifying the schema > > would fix this and give us more flexibility in defining the data we > > want pig to pull from a file. > > > This jira is work in progress, but hopefully it will be in next major > released. > > Thanks, > Cheolsoo > > > > On Sat, Apr 27, 2013 at 3:24 PM, Enns, Steven <sae...@a9.com> wrote: > > > Resending now that I am subscribed :) > > > > On 4/25/13 4:01 PM, "Enns, Steven" <sae...@a9.com> wrote: > > > > >Hi everyone, > > > > > >I would like to override the input schema in AvroStorage to make a > > >pig script robust to schema evolution. For example, suppose a new > > >field is added to an avro schema with a default value of null. If > > >the input to a pig script using this field includes both old and new > > >data, AvroStorage will merge the input schemas from the old and new > > >data. However, if the input includes only old data, the new schema > > >will not be available to AvroStorage and pig will fail to interpret > > >the script with an error such as "projected field [newField] does not > > >exist in schema". If AvroStorage accepted an input schema, the > > >script would be valid for both the new and old data. Is there any plan > to implement this? > > > > > >Thanks, > > >Steve > > > > > > > >