Moving it into core makes sense to me, as Avro is a format we should be 
supporting.

Alan.

On Aug 21, 2012, at 6:03 PM, Cheolsoo Park wrote:

> Hi Dan,
> 
> Glad to hear that it worked. I totally agree that AvroStorage can be
> improved. In fact, it was written for Pig 0.7, so it can be written much
> nicer now.
> 
> Only concern that I have is backward compatibility. That is, if I change
> syntax (I wanted so badly while working on AvroStorage recently), it will
> break backward compatibility. What I have been thinking is to
> rewrite AvroStorage in core Pig like HBaseStorage. For
> backward compatibility, we may keep the old version in Piggybank for a
> while and eventually retire it.
> 
> I am wondering what other people think. Please let me know if it is not a
> good idea to move AvroStorage to core Pig from Piggybank.
> 
> Thanks,
> Cheolsoo
> 
> On Tue, Aug 21, 2012 at 5:47 PM, Danfeng Li <[email protected]> wrote:
> 
>> Thanks, Cheolsoo. That solve my problems.
>> 
>> It will be nice if pig can do this automatically when there are multiple
>> avrostorage in the code. Otherwise, we have to manually track the numbers.
>> 
>> Dan
>> 
>> -----Original Message-----
>> From: Cheolsoo Park [mailto:[email protected]]
>> Sent: Tuesday, August 21, 2012 5:06 PM
>> To: [email protected]
>> Subject: Re: runtime exception when load and store multiple files using
>> avro in pig
>> 
>> Hi Danfeng,
>> 
>> The "long" is from the 1st AvroStorage store in your script. The
>> AvroStorage has very funny syntax regarding multiple stores. To apply
>> different avro schemas to multiple stores, you have to specify their
>> "index" as follows:
>> 
>> set1 = load 'input1.txt' using PigStorage('|') as ( ... ); *store set1
>> into 'set1' using
>> org.apache.pig.piggybank.storage.avro.AvroStorage('index', '1');*
>> 
>> set2 = load 'input2.txt' using PigStorage('|') as ( .. ); *store set2 into
>> 'set2' using org.apache.pig.piggybank.storage.avro.AvroStorage('index',
>> '2');*
>> 
>> As can be seen, I added the 'index' parameters.
>> 
>> What AvroStorage does is to construct the following string in the frontend:
>> 
>> "1#<1st avro schema>,2#<2nd avro schema>"
>> 
>> and pass it to backend via UdfContext. Now in backend, tasks parse this
>> string to get output schema for each store.
>> 
>> Thanks,
>> Cheolsoo
>> 
>> On Tue, Aug 21, 2012 at 4:38 PM, Danfeng Li <[email protected]>
>> wrote:
>> 
>>> I run into this strange problem when try to load multiple text
>>> formatted files and convert them into avro format using pig. However,
>>> if I read and convert one file at a time in separated runs, everything
>>> is fine. The error message is following
>>> 
>>> 2012-08-21 19:15:32,964 [main] ERROR
>>> org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to
>>> recreate exception from backed error:
>>> org.apache.avro.file.DataFileWriter$AppendWriteException:
>>> java.lang.RuntimeException: Datum 1980-01-01 00:00:00.000 is not in
>>> union ["null","long"]
>>>                at
>>> org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:263)
>>>                at
>>> 
>> org.apache.pig.piggybank.storage.avro.PigAvroRecordWriter.write(PigAvroRecordWriter.java:49)
>>>                at
>>> 
>> org.apache.pig.piggybank.storage.avro.AvroStorage.putNext(AvroStorage.java:612)
>>>                at
>>> 
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
>>>                at
>>> 
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
>>>                at
>>> 
>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:531)
>>>                at
>>> 
>> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>>>                at
>>> 
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.collect(PigMapOnly.java:48)
>>>                at
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGeneri
>>> cMapB
>>> 
>>> my code is
>>> set1 = load '$input_dir/set1.txt' using PigStorage('|') as (
>>>   id:long,
>>>   f1:long,
>>>   f2:chararray,
>>>   f3:float,
>>>   f4:float,
>>>   f5:float,
>>>   f6:float,
>>>   f7:float,
>>>   f8:float,
>>>   f9:float,
>>>   f10:float,
>>>   f11:float,
>>>   f12:float);
>>> store set1 into '$output_dir/set1.avro'
>>> using org.apache.pig.piggybank.storage.avro.AvroStorage();
>>> 
>>> set2 = load '$input_dir/set2.txt' using PigStorage('|') as (
>>>   id : int,
>>>   date : chararray);
>>> store set2 into '$output_dir/set2.avro'
>>> using org.apache.pig.piggybank.storage.avro.AvroStorage();
>>> 
>>> The first file is converted fine, but the 2nd one is failed. The error
>>> is coming from the 2nd field in the 2nd file, but the strange thing is
>>> that I don't even have "long" in my schema while the error message is
>>> showing ["null","long"].
>>> 
>>> I use pig 0.10.0 and avro-1.7.1.jar.
>>> 
>>> I wonder if this is a bug or I missed something.
>>> 
>>> Thanks.
>>> Dan
>>> 
>>> Here's set1.txt
>>> 
>>> 827352|740214|Long|26|0.08731795012183759|1661335.541733333|0|0|0.0010
>>> 827352|740214|Long|26|57865808239878|0.001059541098077884|0.0010595410
>>> 827352|740214|Long|26|98077821|0.0514156486228232|0.001043980181757539
>>> 
>>> 827353|740214|Short|12|-0.05967910581502997|-1135471.22271|0|0|-0.0011
>>> 827353|740214|Short|12|85620143839061|-0.001187497751909232|-0.0011874
>>> 827353|740214|Short|12|97751909183|-0.0747641932858414|-0.000130744900
>>> 827353|740214|Short|12|2148424
>>> 
>>> 827354|740214|Total|38|0.02763884430680765|19026277.40819863|0|0|-0.00
>>> 827354|740214|Total|38|01277543355991829|-0.0001279566538313473|-0.000
>>> 827354|740214|Total|38|1279566538313626|-0.02334854466301821|0.0009132
>>> 827354|740214|Total|38|352815426966
>>> 
>>> 827193|739576|Long|26|0.08731795012183759|1661335.541733333|0|0|0.0010
>>> 827193|739576|Long|26|57865808239878|0.001059541098077884|0.0010595410
>>> 827193|739576|Long|26|98077821|0.0514156486228232|0.001043980181757539
>>> 
>>> 827194|739576|Short|12|-0.05967910581502997|-1135471.22271|0|0|-0.0011
>>> 827194|739576|Short|12|85620143839061|-0.001187497751909232|-0.0011874
>>> 827194|739576|Short|12|97751909183|-0.0747641932858414|-0.000130744900
>>> 827194|739576|Short|12|2148424
>>> 
>>> 827195|739576|Total|38|0.02763884430680765|19026277.40819863|0|0|-0.00
>>> 827195|739576|Total|38|01277543355991829|-0.0001279566538313473|-0.000
>>> 827195|739576|Total|38|1279566538313626|-0.02334854466301821|0.0009132
>>> 827195|739576|Total|38|352815426966
>>> 
>>> 827355|740215|Long|51|1.776868012839072|113652088.7063555|0|0|0.019525
>>> 827355|740215|Long|51|47658695701|0.0195703176808393|0.019570317680839
>>> 827355|740215|Long|51|28|1.164818333642054|0
>>> 
>>> 827356|740215|Short|34|-2.360589090333165|-150988074.9471841|0|0|-0.00
>>> 827356|740215|Short|34|868330219442376|-0.008616238065508337|-0.008616
>>> 827356|740215|Short|34|238065508375|-0.5943698959308671|-0.02690679230
>>> 827356|740215|Short|34|502523
>>> 
>>> 827357|740215|Total|85|-0.5837210774940929|63962032.00527128|0|0|0.010
>>> 827357|740215|Total|85|84217439253325|0.01095407961533095|0.0109540796
>>> 827357|740215|Total|85|153309|0.5704484377111866|-0.02690679230502523
>>> 
>>> 827202|739590|Long|53|1.777568428360522|113696888.7063555|0|0|0.019525
>>> 827202|739590|Long|53|47658695701|0.0195703176808393|0.019570317680839
>>> 827202|739590|Long|53|28|1.156653489849146|0
>>> 
>>> Here's the set2.txt
>>> 1|1980-01-01 00:00:00.000
>>> 2|1980-01-02 00:00:00.000
>>> 3|1980-01-03 00:00:00.000
>>> 4|1980-01-04 00:00:00.000
>>> 5|1980-01-07 00:00:00.000
>>> 6|1980-01-08 00:00:00.000
>>> 7|1980-01-09 00:00:00.000
>>> 8|1980-01-10 00:00:00.000
>>> 9|1980-01-11 00:00:00.000
>>> 10|1980-01-14 00:00:00.000
>>> 
>>> 
>> 

Reply via email to