Hey Jeff, Are the counts determined by Counters? Or is it the length of the output files? Or both?
J On Mon, Jul 27, 2015 at 5:29 PM, David Ortiz <[email protected]> wrote: > Out of curiosity, any reason you went with multiple reads as opposed to > just performing multiple operations on the same PTable? parallelDo returns > a new object rather than modifying the initial one, so a single collection > can start multiple execution flows. > > On Mon, Jul 27, 2015, 8:11 PM Jeff Quinn <[email protected]> wrote: > >> Hello, >> >> We have observed and replicated strange behavior with our crunch >> application while running on MapReduce via the AWS ElasticMapReduce >> service. Running a very simple job which is mostly map only, we see that an >> undetermined subset of records are getting dropped. Specifically, we >> expect 30,136,686 output records and have seen output on different trials >> (running over the same data with the same binary): >> >> 22,177,119 records >> 26,435,670 records >> 22,362,986 records >> 29,798,528 records >> >> These are all the things about our application which might be unusual and >> relevant: >> >> - We use a custom file input format, via From.formattedFile. It looks >> like this (basically a carbon copy >> of org.apache.hadoop.mapreduce.lib.input.TextInputFormat): >> >> import org.apache.hadoop.io.LongWritable; >> import org.apache.hadoop.io.Text; >> import org.apache.hadoop.mapreduce.InputSplit; >> import org.apache.hadoop.mapreduce.RecordReader; >> import org.apache.hadoop.mapreduce.TaskAttemptContext; >> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; >> import org.apache.hadoop.mapreduce.lib.input.LineRecordReader; >> >> import java.io.IOException; >> >> public class ByteOffsetInputFormat extends FileInputFormat<LongWritable, >> Text> { >> >> @Override >> public RecordReader<LongWritable, Text> createRecordReader( >> InputSplit split, TaskAttemptContext context) throws IOException, >> InterruptedException { >> return new LineRecordReader(); >> } >> } >> >> - We call org.apache.crunch.Pipeline#read using this InputFormat many times, >> for the job in question it is called ~160 times as the input is ~100 >> different files. Each file ranges in size from 100MB-8GB. Our job only uses >> this input format for all input files. >> >> - For some files org.apache.crunch.Pipeline#read is called twice one the >> same file, and the resulting PTables are processed in different ways. >> >> - It is only the data from these files which org.apache.crunch.Pipeline#read >> has been called on more than once during a job that have dropped records, >> all other files consistently do not have dropped records >> >> Curious if any Crunch users have experienced similar behavior before, or if >> any of these details about my job raise any red flags. >> >> Thanks! >> >> Jeff Quinn >> >> Data Engineer >> >> Nuna >> >> >> *DISCLAIMER:* The contents of this email, including any attachments, may >> contain information that is confidential, proprietary in nature, protected >> health information (PHI), or otherwise protected by law from disclosure, >> and is solely for the use of the intended recipient(s). If you are not the >> intended recipient, you are hereby notified that any use, disclosure or >> copying of this email, including any attachments, is unauthorized and >> strictly prohibited. If you have received this email in error, please >> notify the sender of this email. Please delete this and all copies of this >> email from your system. Any opinions either expressed or implied in this >> email and all attachments, are those of its author only, and do not >> necessarily reflect those of Nuna Health, Inc. > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
