Re: Non Deterministic Record Drops

David Ortiz Mon, 27 Jul 2015 17:30:14 -0700

Out of curiosity, any reason you went with multiple reads as opposed to
just performing multiple operations on the same PTable? parallelDo returns
a new object rather than modifying the initial one, so a single collection
can start multiple execution flows.


On Mon, Jul 27, 2015, 8:11 PM Jeff Quinn <[email protected]> wrote:

> Hello,
>
> We have observed and replicated strange behavior with our crunch
> application while running on MapReduce via the AWS ElasticMapReduce
> service. Running a very simple job which is mostly map only, we see that an
> undetermined subset of records are getting dropped. Specifically, we
> expect 30,136,686 output records and have seen output on different trials
> (running over the same data with the same binary):
>
> 22,177,119 records
> 26,435,670 records
> 22,362,986 records
> 29,798,528 records
>
> These are all the things about our application which might be unusual and
> relevant:
>
> - We use a custom file input format, via From.formattedFile. It looks like
> this (basically a carbon copy
> of org.apache.hadoop.mapreduce.lib.input.TextInputFormat):
>
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.InputSplit;
> import org.apache.hadoop.mapreduce.RecordReader;
> import org.apache.hadoop.mapreduce.TaskAttemptContext;
> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
> import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
>
> import java.io.IOException;
>
> public class ByteOffsetInputFormat extends FileInputFormat<LongWritable, 
> Text> {
>
>   @Override
>   public RecordReader<LongWritable, Text> createRecordReader(
>       InputSplit split, TaskAttemptContext context) throws IOException,
>       InterruptedException {
>     return new LineRecordReader();
>   }
> }
>
> - We call org.apache.crunch.Pipeline#read using this InputFormat many times, 
> for the job in question it is called ~160 times as the input is ~100 
> different files. Each file ranges in size from 100MB-8GB. Our job only uses 
> this input format for all input files.
>
> - For some files org.apache.crunch.Pipeline#read is called twice one the same 
> file, and the resulting PTables are processed in different ways.
>
> - It is only the data from these files which org.apache.crunch.Pipeline#read 
> has been called on more than once during a job that have dropped records, all 
> other files consistently do not have dropped records
>
> Curious if any Crunch users have experienced similar behavior before, or if 
> any of these details about my job raise any red flags.
>
> Thanks!
>
> Jeff Quinn
>
> Data Engineer
>
> Nuna
>
>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.

Re: Non Deterministic Record Drops

Reply via email to