Adding user@avro

From: Suraj Satishkumar Sheth [mailto:[email protected]]
Sent: Wednesday, May 28, 2014 8:17 PM
To: [email protected]
Subject: RE: Issue with AvroPathperKeyTarget in crunch while writing data to 
multiple files for each of the keys of the PTable

Hi Josh,
Thanks for the quick response

Here are the logs :
org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync! at 
org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210) at 
org.apache.crunch.types.avro.AvroRecordReader.nextKeyValue(AvroRecordReader.java:66)
 at 
org.apache.crunch.impl.mr.run.CrunchRecordReader.nextKeyValue(CrunchRecordReader.java:157)
 at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483)
 at 
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
 at 
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139) at 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at 
org.apache.hadoop.mapred.Child$4.run(Child.java:268) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:415) at 
org.apache.hadoop.security.UserGroupInformation.d

Even when we read the output of AvroPathPerKeyTarget into a PCollection and try 
to count the number of records in the PCollection, we get the same error.
The strange thing is that this occurs rarely(once in 3-4 times) even when we 
try it on the same data multiple times.


The versions being used :
Avro – 1.7.5
Crunch - 0.8.2-hadoop2

Thanks and Regards,
Suraj Sheth

From: Josh Wills [mailto:[email protected]]
Sent: Wednesday, May 28, 2014 7:56 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: Issue with AvroPathperKeyTarget in crunch while writing data to 
multiple files for each of the keys of the PTable

That sounds super annoying. Which version are you using? There was this issue 
that is fixed in master, but not in any release yet. (I'm trying to get one out 
this week if at all possible.)

https://issues.apache.org/jira/browse/CRUNCH-316

Can you check your logs for that in-memory buffer error?

On Wed, May 28, 2014 at 7:11 AM, Suraj Satishkumar Sheth 
<[email protected]<mailto:[email protected]>> wrote:
Hi,
We have a use case where we have a PTable which consists of 30 keys and 
millions of values per key. We want to write the values for each of the keys 
into separate files.
Although, creating 30 different PTables using filter and then, writing each of 
them to HDFS is working for us, it is highly inefficient.

I have been trying to write data from a PTable into multiple files 
corresponding to the values of the keys using AvroPathPerKeyTarget.

So, the usage is something like this :
finalRecords.groupByKey().write(new AvroPathPerKeyTarget(outPath));

where finalRecords is a PTable whose keys are Strings and values are AVRO 
records

It is verified that the data contains exactly 30 unique keys. The amount of 
data is a few millions for a few keys while a few thousands for a few other 
keys.

Expectation : It will divide the data 30 parts and write them to the specified 
place in HDFS creating a directory for each key. We will be able to read the 
data as a PCollection<Avro> later for our next job.

Issue : It is able to create 30 different directories for the keys and all the 
directories have data of non-zero size.
       But, occasionally, a few files get corrupted. When we try to read it 
into a PCollection<Avro> and try to use it, it throws an error :
       Caused by: java.io.IOException: Invalid sync!

Symptoms : The issue occurs intermittently. It occurs once in 3-4 runs and only 
one or two files among 30 get corrupted in that run.
           The filesize of the corrupted Avro file is either very high or very 
low than expected. E.g. if we are expecting a file of 100MB, we will get a file 
of 30MB or 250MB if that is corrupted due to AvroPathPerKeyTarget.

We increased the number of reducers to 500, so that, no two keys(among 30 keys) 
go to the same reducer. Inspite of this change, we were able to see the error.

Any ideas/suggestions to fix this issue or explanation of this issue will be 
helpful.


Thanks and Regards,
Suraj Sheth



--
Director of Data Science
Cloudera<http://www.cloudera.com>
Twitter: @josh_wills<http://twitter.com/josh_wills>

Reply via email to