Out of curiosity, any reason you went with multiple reads as opposed to just performing multiple operations on the same PTable? parallelDo returns a new object rather than modifying the initial one, so a single collection can start multiple execution flows.
On Mon, Jul 27, 2015, 8:11 PM Jeff Quinn <[email protected]> wrote: > Hello, > > We have observed and replicated strange behavior with our crunch > application while running on MapReduce via the AWS ElasticMapReduce > service. Running a very simple job which is mostly map only, we see that an > undetermined subset of records are getting dropped. Specifically, we > expect 30,136,686 output records and have seen output on different trials > (running over the same data with the same binary): > > 22,177,119 records > 26,435,670 records > 22,362,986 records > 29,798,528 records > > These are all the things about our application which might be unusual and > relevant: > > - We use a custom file input format, via From.formattedFile. It looks like > this (basically a carbon copy > of org.apache.hadoop.mapreduce.lib.input.TextInputFormat): > > import org.apache.hadoop.io.LongWritable; > import org.apache.hadoop.io.Text; > import org.apache.hadoop.mapreduce.InputSplit; > import org.apache.hadoop.mapreduce.RecordReader; > import org.apache.hadoop.mapreduce.TaskAttemptContext; > import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; > import org.apache.hadoop.mapreduce.lib.input.LineRecordReader; > > import java.io.IOException; > > public class ByteOffsetInputFormat extends FileInputFormat<LongWritable, > Text> { > > @Override > public RecordReader<LongWritable, Text> createRecordReader( > InputSplit split, TaskAttemptContext context) throws IOException, > InterruptedException { > return new LineRecordReader(); > } > } > > - We call org.apache.crunch.Pipeline#read using this InputFormat many times, > for the job in question it is called ~160 times as the input is ~100 > different files. Each file ranges in size from 100MB-8GB. Our job only uses > this input format for all input files. > > - For some files org.apache.crunch.Pipeline#read is called twice one the same > file, and the resulting PTables are processed in different ways. > > - It is only the data from these files which org.apache.crunch.Pipeline#read > has been called on more than once during a job that have dropped records, all > other files consistently do not have dropped records > > Curious if any Crunch users have experienced similar behavior before, or if > any of these details about my job raise any red flags. > > Thanks! > > Jeff Quinn > > Data Engineer > > Nuna > > > *DISCLAIMER:* The contents of this email, including any attachments, may > contain information that is confidential, proprietary in nature, protected > health information (PHI), or otherwise protected by law from disclosure, > and is solely for the use of the intended recipient(s). If you are not the > intended recipient, you are hereby notified that any use, disclosure or > copying of this email, including any attachments, is unauthorized and > strictly prohibited. If you have received this email in error, please > notify the sender of this email. Please delete this and all copies of this > email from your system. Any opinions either expressed or implied in this > email and all attachments, are those of its author only, and do not > necessarily reflect those of Nuna Health, Inc.
