Multiple file inputs to Maps & HDFS issues

Everett Anderson Sat, 21 Nov 2015 17:45:06 -0800

Hi,

I'm trying to understand a little about how Crunch and Hadoop handle
multiple file inputs to a Map and if there are multiplicative I/O effects
that might trouble HDFS.


We have an extract/transform pipeline in Crunch that we run with the Hadoop
MapReduce pipeline implementation on AWS EMR 4.1 (Hadoop 2.6.0).

In our situation, a given file may contain many record types, one per line,
and we have DoFns and FilterFns that detect and separate out a given record
type.

Recently, as we've gotten more data, we've started running into what seems
like HDFS data node issues -- it seems like we're overloading them and then
they fail to replicate blocks, leading to job failures.

One failing case has 4 input files of about 50 GB each. Our Crunch dotfile
looks like this:



Not shown is the fact that in the middle there's a Crunch union of the 4
PTables, and it's on the union that the record-specific extractors (W1-W9)
run.

In this situation, is the unit of work / shard going into the Map a single
input split from any one of the 4 files?

Would Crunch or Hadoop re-read any of the files multiple times?

Do you see any situation in which more total I/O would be performed that
just the sum of the input file sizes and the sum of the outputs of W1-W9?

Thanks!

- Everett

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Multiple file inputs to Maps & HDFS issues

Reply via email to