Hi Josh/Gabriel, This problem has been confounding us for a while. Do we need to pass a custom Partitioner or pass specific GroupByOptions into the groupBy to make it work with the AvroPathPerKeyTarget? I assume there is no need for that.
Thanks, Som On Wed, May 28, 2014 at 7:46 AM, Suraj Satishkumar Sheth <[email protected] > wrote: > Hi Josh, > > Thanks for the quick response > > > > Here are the logs : > > org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync! > at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210) at > org.apache.crunch.types.avro.AvroRecordReader.nextKeyValue(AvroRecordReader.java:66) > at > org.apache.crunch.impl.mr.run.CrunchRecordReader.nextKeyValue(CrunchRecordReader.java:157) > at > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483) > at > org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76) > at > org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139) at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at > org.apache.hadoop.mapred.Child$4.run(Child.java:268) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:415) at > org.apache.hadoop.security.UserGroupInformation.d > > > > Even when we read the output of AvroPathPerKeyTarget into a PCollection > and try to count the number of records in the PCollection, we get the same > error. > > The strange thing is that this occurs rarely(once in 3-4 times) even when > we try it on the same data multiple times. > > > > > > The versions being used : > > *Avro – 1.7.5* > > *Crunch - *0.8.2-hadoop2 > > > > Thanks and Regards, > > Suraj Sheth > > > > *From:* Josh Wills [mailto:[email protected]] > *Sent:* Wednesday, May 28, 2014 7:56 PM > *To:* [email protected] > *Subject:* Re: Issue with AvroPathperKeyTarget in crunch while writing > data to multiple files for each of the keys of the PTable > > > > That sounds super annoying. Which version are you using? There was this > issue that is fixed in master, but not in any release yet. (I'm trying to > get one out this week if at all possible.) > > > > https://issues.apache.org/jira/browse/CRUNCH-316 > > > > Can you check your logs for that in-memory buffer error? > > > > On Wed, May 28, 2014 at 7:11 AM, Suraj Satishkumar Sheth < > [email protected]> wrote: > > Hi, > > We have a use case where we have a PTable which consists of 30 keys and > millions of values per key. We want to write the values for each of the > keys into separate files. > > Although, creating 30 different PTables using filter and then, writing > each of them to HDFS is working for us, it is highly inefficient. > > > > I have been trying to write data from a PTable into multiple files > corresponding to the values of the keys using AvroPathPerKeyTarget. > > > > So, the usage is something like this : > > *finalRecords**.**groupByKey**().**write**(new** > AvroPathPerKeyTarget(outPath));* > > > > *where finalRecords is a PTable whose keys are Strings and values are AVRO > records* > > > > It is verified that the data contains exactly 30 unique keys. The amount > of data is a few millions for a few keys while a few thousands for a few > other keys. > > > > Expectation : It will divide the data 30 parts and write them to the > specified place in HDFS creating a directory for each key. We will be able > to read the data as a PCollection<Avro> later for our next job. > > > > Issue : It is able to create 30 different directories for the keys and all > the directories have data of non-zero size. > > But, occasionally, a few files get corrupted. When we try to read > it into a PCollection<Avro> and try to use it, it throws an error : > > * Caused by: java.io.IOException: Invalid sync!* > > > > *Symptoms : *The issue occurs intermittently. It occurs once in 3-4 runs > and only one or two files among 30 get corrupted in that run. > > The filesize of the corrupted Avro file is either very high or > very low than expected. E.g. if we are expecting a file of 100MB, we will > get a file of 30MB or 250MB if that is corrupted due to > AvroPathPerKeyTarget. > > > > We increased the number of reducers to 500, so that, no two keys(among 30 > keys) go to the same reducer. Inspite of this change, we were able to see > the error. > > > > Any ideas/suggestions to fix this issue or explanation of this issue will > be helpful. > > > > > > Thanks and Regards, > > Suraj Sheth > > > > > > -- > > Director of Data Science > > Cloudera <http://www.cloudera.com> > > Twitter: @josh_wills <http://twitter.com/josh_wills> >
