Hey Som, No, no need for a custom partitioner or special GroupByOptions when you're using the AvroPathPerKeyTarget. As you probably know, it's definitely a good idea to have all values under the same key next to each other in the PTable that is being output.
Any chance you could try this with a build from the current head of the 0.8 branch? It's named apache-crunch-0.8 in git. This really sounds like it's related to CRUNCH-316, so it would be good if we could check if that fix corrects this issue or not. - Gabriel On Thu, May 29, 2014 at 7:46 PM, Som Satpathy <[email protected]> wrote: > Hi Josh/Gabriel, > > This problem has been confounding us for a while. Do we need to pass a > custom Partitioner or pass specific GroupByOptions into the groupBy to make > it work with the AvroPathPerKeyTarget? I assume there is no need for that. > > Thanks, > Som > > > On Wed, May 28, 2014 at 7:46 AM, Suraj Satishkumar Sheth > <[email protected]> wrote: >> >> Hi Josh, >> >> Thanks for the quick response >> >> >> >> Here are the logs : >> >> org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync! >> at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210) at >> org.apache.crunch.types.avro.AvroRecordReader.nextKeyValue(AvroRecordReader.java:66) >> at >> org.apache.crunch.impl.mr.run.CrunchRecordReader.nextKeyValue(CrunchRecordReader.java:157) >> at >> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483) >> at >> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76) >> at >> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85) >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139) at >> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) at >> org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at >> org.apache.hadoop.mapred.Child$4.run(Child.java:268) at >> java.security.AccessController.doPrivileged(Native Method) at >> javax.security.auth.Subject.doAs(Subject.java:415) at >> org.apache.hadoop.security.UserGroupInformation.d >> >> >> >> Even when we read the output of AvroPathPerKeyTarget into a PCollection >> and try to count the number of records in the PCollection, we get the same >> error. >> >> The strange thing is that this occurs rarely(once in 3-4 times) even when >> we try it on the same data multiple times. >> >> >> >> >> >> The versions being used : >> >> Avro – 1.7.5 >> >> Crunch - 0.8.2-hadoop2 >> >> >> >> Thanks and Regards, >> >> Suraj Sheth >> >> >> >> From: Josh Wills [mailto:[email protected]] >> Sent: Wednesday, May 28, 2014 7:56 PM >> To: [email protected] >> Subject: Re: Issue with AvroPathperKeyTarget in crunch while writing data >> to multiple files for each of the keys of the PTable >> >> >> >> That sounds super annoying. Which version are you using? There was this >> issue that is fixed in master, but not in any release yet. (I'm trying to >> get one out this week if at all possible.) >> >> >> >> https://issues.apache.org/jira/browse/CRUNCH-316 >> >> >> >> Can you check your logs for that in-memory buffer error? >> >> >> >> On Wed, May 28, 2014 at 7:11 AM, Suraj Satishkumar Sheth >> <[email protected]> wrote: >> >> Hi, >> >> We have a use case where we have a PTable which consists of 30 keys and >> millions of values per key. We want to write the values for each of the keys >> into separate files. >> >> Although, creating 30 different PTables using filter and then, writing >> each of them to HDFS is working for us, it is highly inefficient. >> >> >> >> I have been trying to write data from a PTable into multiple files >> corresponding to the values of the keys using AvroPathPerKeyTarget. >> >> >> >> So, the usage is something like this : >> >> finalRecords.groupByKey().write(new AvroPathPerKeyTarget(outPath)); >> >> >> >> where finalRecords is a PTable whose keys are Strings and values are AVRO >> records >> >> >> >> It is verified that the data contains exactly 30 unique keys. The amount >> of data is a few millions for a few keys while a few thousands for a few >> other keys. >> >> >> >> Expectation : It will divide the data 30 parts and write them to the >> specified place in HDFS creating a directory for each key. We will be able >> to read the data as a PCollection<Avro> later for our next job. >> >> >> >> Issue : It is able to create 30 different directories for the keys and all >> the directories have data of non-zero size. >> >> But, occasionally, a few files get corrupted. When we try to read >> it into a PCollection<Avro> and try to use it, it throws an error : >> >> Caused by: java.io.IOException: Invalid sync! >> >> >> >> Symptoms : The issue occurs intermittently. It occurs once in 3-4 runs and >> only one or two files among 30 get corrupted in that run. >> >> The filesize of the corrupted Avro file is either very high or >> very low than expected. E.g. if we are expecting a file of 100MB, we will >> get a file of 30MB or 250MB if that is corrupted due to >> AvroPathPerKeyTarget. >> >> >> >> We increased the number of reducers to 500, so that, no two keys(among 30 >> keys) go to the same reducer. Inspite of this change, we were able to see >> the error. >> >> >> >> Any ideas/suggestions to fix this issue or explanation of this issue will >> be helpful. >> >> >> >> >> >> Thanks and Regards, >> >> Suraj Sheth >> >> >> >> >> >> -- >> >> Director of Data Science >> >> Cloudera >> >> Twitter: @josh_wills > >
