Hello everyone. I’m currently doing some research involving Hadoop and Pig, 
evaluating the cost of data replication vs. the penalty of node failure with 
respect to job completion time. I’m current modifying a MapReduce simulator 
called “MRPerf” to accommodate a sequence of MapReduce jobs such as generated 
by Pig. I have some questions about how failures are handled in a Pig job, so I 
can reflect that in my implementation. I’m fairly new to Pig, so if some of 
these questions reflect misunderstanding, please let me know:

1) How does Pig/Hadoop handle failures which would require re-execution of part 
of all of some MapReduce job that was already completed earlier in the 
sequence? The only source of information I could find on this is in the paper 
“Making cloud intermediate data fault-tolerant” by Ko et. al. 
(http://portal.acm.org/citation.cfm?id=1807160), but I’m not sure whether this 
is accurate.

As an example:

Suppose a sequence of three MapReduce jobs have been generated, J1, J2 and J3, 
with a replication factor of 1 for the output of J1 and J2 (i.e. one instance 
of each HDFS block). Suppose jobs J1 and J2 complete, and J3 is almost done. 
Then a failure occurs on some node. Map and Reduce task outputs are lost from 
all jobs. It seems to me it would be necessary to re-execute some tasks, but 
not all, from all three jobs to complete the overall job. How does Pig/Hadoop 
handle such backtracking in this situation?

(This is based on understanding that when Pig produces a series MapReduce jobs, 
the replication factor for the intermediate data produced by each Reduce phase 
can be set individually. Correct me if I’m wrong here.)

2) On a related note, is the intermediate data produced by the Map tasks in an 
individual MapReduce job deleted after that job completes and the next in the 
sequence starts, or is it preserved until the whole Pig job finishes?

Any information or guidance would be greatly appreciated.

Thanks,
Drew

Reply via email to