Re: Issue with AvroPathperKeyTarget in crunch while writing data to multiple files for each of the keys of the PTable

Som Satpathy Thu, 29 May 2014 17:02:29 -0700

Hi Gabriel,

Thanks for the quick response. Sure, let me try out the current head branch
to see if CRUNCH-316 fixes it.


Thanks,
Som


On Thu, May 29, 2014 at 11:43 AM, Gabriel Reid <[email protected]>
wrote:

> Hey Som,
>
> No, no need for a custom partitioner or special GroupByOptions when
> you're using the AvroPathPerKeyTarget. As you probably know, it's
> definitely a good idea to have all values under the same key next to
> each other in the PTable that is being output.
>
> Any chance you could try this with a build from the current head of
> the 0.8 branch? It's named apache-crunch-0.8 in git. This really
> sounds like it's related to CRUNCH-316, so it would be good if we
> could check if that fix corrects this issue or not.
>
> - Gabriel
>
>
> On Thu, May 29, 2014 at 7:46 PM, Som Satpathy <[email protected]>
> wrote:
> > Hi Josh/Gabriel,
> >
> > This problem has been confounding us for a while. Do we need to pass a
> > custom Partitioner or pass specific GroupByOptions into the groupBy to
> make
> > it work with the AvroPathPerKeyTarget? I assume there is no need for
> that.
> >
> > Thanks,
> > Som
> >
> >
> > On Wed, May 28, 2014 at 7:46 AM, Suraj Satishkumar Sheth
> > <[email protected]> wrote:
> >>
> >> Hi Josh,
> >>
> >> Thanks for the quick response
> >>
> >>
> >>
> >> Here are the logs :
> >>
> >> org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
> >> at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
> at
> >>
> org.apache.crunch.types.avro.AvroRecordReader.nextKeyValue(AvroRecordReader.java:66)
> >> at
> >>
> org.apache.crunch.impl.mr.run.CrunchRecordReader.nextKeyValue(CrunchRecordReader.java:157)
> >> at
> >>
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483)
> >> at
> >>
> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
> >> at
> >>
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
> >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139) at
> >> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) at
> >> org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at
> >> org.apache.hadoop.mapred.Child$4.run(Child.java:268) at
> >> java.security.AccessController.doPrivileged(Native Method) at
> >> javax.security.auth.Subject.doAs(Subject.java:415) at
> >> org.apache.hadoop.security.UserGroupInformation.d
> >>
> >>
> >>
> >> Even when we read the output of AvroPathPerKeyTarget into a PCollection
> >> and try to count the number of records in the PCollection, we get the
> same
> >> error.
> >>
> >> The strange thing is that this occurs rarely(once in 3-4 times) even
> when
> >> we try it on the same data multiple times.
> >>
> >>
> >>
> >>
> >>
> >> The versions being used :
> >>
> >> Avro – 1.7.5
> >>
> >> Crunch - 0.8.2-hadoop2
> >>
> >>
> >>
> >> Thanks and Regards,
> >>
> >> Suraj Sheth
> >>
> >>
> >>
> >> From: Josh Wills [mailto:[email protected]]
> >> Sent: Wednesday, May 28, 2014 7:56 PM
> >> To: [email protected]
> >> Subject: Re: Issue with AvroPathperKeyTarget in crunch while writing
> data
> >> to multiple files for each of the keys of the PTable
> >>
> >>
> >>
> >> That sounds super annoying. Which version are you using? There was this
> >> issue that is fixed in master, but not in any release yet. (I'm trying
> to
> >> get one out this week if at all possible.)
> >>
> >>
> >>
> >> https://issues.apache.org/jira/browse/CRUNCH-316
> >>
> >>
> >>
> >> Can you check your logs for that in-memory buffer error?
> >>
> >>
> >>
> >> On Wed, May 28, 2014 at 7:11 AM, Suraj Satishkumar Sheth
> >> <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> We have a use case where we have a PTable which consists of 30 keys and
> >> millions of values per key. We want to write the values for each of the
> keys
> >> into separate files.
> >>
> >> Although, creating 30 different PTables using filter and then, writing
> >> each of them to HDFS is working for us, it is highly inefficient.
> >>
> >>
> >>
> >> I have been trying to write data from a PTable into multiple files
> >> corresponding to the values of the keys using AvroPathPerKeyTarget.
> >>
> >>
> >>
> >> So, the usage is something like this :
> >>
> >> finalRecords.groupByKey().write(new AvroPathPerKeyTarget(outPath));
> >>
> >>
> >>
> >> where finalRecords is a PTable whose keys are Strings and values are
> AVRO
> >> records
> >>
> >>
> >>
> >> It is verified that the data contains exactly 30 unique keys. The amount
> >> of data is a few millions for a few keys while a few thousands for a few
> >> other keys.
> >>
> >>
> >>
> >> Expectation : It will divide the data 30 parts and write them to the
> >> specified place in HDFS creating a directory for each key. We will be
> able
> >> to read the data as a PCollection<Avro> later for our next job.
> >>
> >>
> >>
> >> Issue : It is able to create 30 different directories for the keys and
> all
> >> the directories have data of non-zero size.
> >>
> >>        But, occasionally, a few files get corrupted. When we try to read
> >> it into a PCollection<Avro> and try to use it, it throws an error :
> >>
> >>        Caused by: java.io.IOException: Invalid sync!
> >>
> >>
> >>
> >> Symptoms : The issue occurs intermittently. It occurs once in 3-4 runs
> and
> >> only one or two files among 30 get corrupted in that run.
> >>
> >>            The filesize of the corrupted Avro file is either very high
> or
> >> very low than expected. E.g. if we are expecting a file of 100MB, we
> will
> >> get a file of 30MB or 250MB if that is corrupted due to
> >> AvroPathPerKeyTarget.
> >>
> >>
> >>
> >> We increased the number of reducers to 500, so that, no two keys(among
> 30
> >> keys) go to the same reducer. Inspite of this change, we were able to
> see
> >> the error.
> >>
> >>
> >>
> >> Any ideas/suggestions to fix this issue or explanation of this issue
> will
> >> be helpful.
> >>
> >>
> >>
> >>
> >>
> >> Thanks and Regards,
> >>
> >> Suraj Sheth
> >>
> >>
> >>
> >>
> >>
> >> --
> >>
> >> Director of Data Science
> >>
> >> Cloudera
> >>
> >> Twitter: @josh_wills
> >
> >
>

Re: Issue with AvroPathperKeyTarget in crunch while writing data to multiple files for each of the keys of the PTable

Reply via email to