Hi Gabriel, Thanks for the quick response. Sure, let me try out the current head branch to see if CRUNCH-316 fixes it.
Thanks, Som On Thu, May 29, 2014 at 11:43 AM, Gabriel Reid <[email protected]> wrote: > Hey Som, > > No, no need for a custom partitioner or special GroupByOptions when > you're using the AvroPathPerKeyTarget. As you probably know, it's > definitely a good idea to have all values under the same key next to > each other in the PTable that is being output. > > Any chance you could try this with a build from the current head of > the 0.8 branch? It's named apache-crunch-0.8 in git. This really > sounds like it's related to CRUNCH-316, so it would be good if we > could check if that fix corrects this issue or not. > > - Gabriel > > > On Thu, May 29, 2014 at 7:46 PM, Som Satpathy <[email protected]> > wrote: > > Hi Josh/Gabriel, > > > > This problem has been confounding us for a while. Do we need to pass a > > custom Partitioner or pass specific GroupByOptions into the groupBy to > make > > it work with the AvroPathPerKeyTarget? I assume there is no need for > that. > > > > Thanks, > > Som > > > > > > On Wed, May 28, 2014 at 7:46 AM, Suraj Satishkumar Sheth > > <[email protected]> wrote: > >> > >> Hi Josh, > >> > >> Thanks for the quick response > >> > >> > >> > >> Here are the logs : > >> > >> org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync! > >> at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210) > at > >> > org.apache.crunch.types.avro.AvroRecordReader.nextKeyValue(AvroRecordReader.java:66) > >> at > >> > org.apache.crunch.impl.mr.run.CrunchRecordReader.nextKeyValue(CrunchRecordReader.java:157) > >> at > >> > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483) > >> at > >> > org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76) > >> at > >> > org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85) > >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139) at > >> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) at > >> org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at > >> org.apache.hadoop.mapred.Child$4.run(Child.java:268) at > >> java.security.AccessController.doPrivileged(Native Method) at > >> javax.security.auth.Subject.doAs(Subject.java:415) at > >> org.apache.hadoop.security.UserGroupInformation.d > >> > >> > >> > >> Even when we read the output of AvroPathPerKeyTarget into a PCollection > >> and try to count the number of records in the PCollection, we get the > same > >> error. > >> > >> The strange thing is that this occurs rarely(once in 3-4 times) even > when > >> we try it on the same data multiple times. > >> > >> > >> > >> > >> > >> The versions being used : > >> > >> Avro – 1.7.5 > >> > >> Crunch - 0.8.2-hadoop2 > >> > >> > >> > >> Thanks and Regards, > >> > >> Suraj Sheth > >> > >> > >> > >> From: Josh Wills [mailto:[email protected]] > >> Sent: Wednesday, May 28, 2014 7:56 PM > >> To: [email protected] > >> Subject: Re: Issue with AvroPathperKeyTarget in crunch while writing > data > >> to multiple files for each of the keys of the PTable > >> > >> > >> > >> That sounds super annoying. Which version are you using? There was this > >> issue that is fixed in master, but not in any release yet. (I'm trying > to > >> get one out this week if at all possible.) > >> > >> > >> > >> https://issues.apache.org/jira/browse/CRUNCH-316 > >> > >> > >> > >> Can you check your logs for that in-memory buffer error? > >> > >> > >> > >> On Wed, May 28, 2014 at 7:11 AM, Suraj Satishkumar Sheth > >> <[email protected]> wrote: > >> > >> Hi, > >> > >> We have a use case where we have a PTable which consists of 30 keys and > >> millions of values per key. We want to write the values for each of the > keys > >> into separate files. > >> > >> Although, creating 30 different PTables using filter and then, writing > >> each of them to HDFS is working for us, it is highly inefficient. > >> > >> > >> > >> I have been trying to write data from a PTable into multiple files > >> corresponding to the values of the keys using AvroPathPerKeyTarget. > >> > >> > >> > >> So, the usage is something like this : > >> > >> finalRecords.groupByKey().write(new AvroPathPerKeyTarget(outPath)); > >> > >> > >> > >> where finalRecords is a PTable whose keys are Strings and values are > AVRO > >> records > >> > >> > >> > >> It is verified that the data contains exactly 30 unique keys. The amount > >> of data is a few millions for a few keys while a few thousands for a few > >> other keys. > >> > >> > >> > >> Expectation : It will divide the data 30 parts and write them to the > >> specified place in HDFS creating a directory for each key. We will be > able > >> to read the data as a PCollection<Avro> later for our next job. > >> > >> > >> > >> Issue : It is able to create 30 different directories for the keys and > all > >> the directories have data of non-zero size. > >> > >> But, occasionally, a few files get corrupted. When we try to read > >> it into a PCollection<Avro> and try to use it, it throws an error : > >> > >> Caused by: java.io.IOException: Invalid sync! > >> > >> > >> > >> Symptoms : The issue occurs intermittently. It occurs once in 3-4 runs > and > >> only one or two files among 30 get corrupted in that run. > >> > >> The filesize of the corrupted Avro file is either very high > or > >> very low than expected. E.g. if we are expecting a file of 100MB, we > will > >> get a file of 30MB or 250MB if that is corrupted due to > >> AvroPathPerKeyTarget. > >> > >> > >> > >> We increased the number of reducers to 500, so that, no two keys(among > 30 > >> keys) go to the same reducer. Inspite of this change, we were able to > see > >> the error. > >> > >> > >> > >> Any ideas/suggestions to fix this issue or explanation of this issue > will > >> be helpful. > >> > >> > >> > >> > >> > >> Thanks and Regards, > >> > >> Suraj Sheth > >> > >> > >> > >> > >> > >> -- > >> > >> Director of Data Science > >> > >> Cloudera > >> > >> Twitter: @josh_wills > > > > >
