Hi,
I am very new to PIG/Hadoop, I just started writing my first PIG script a
couple days ago. I ran into this problem.
My cluster has 9 nodes. I have to join two data sets big and small, each is
collected for 4 weeks. I first take two subsets of my data set (which is
for the first week of data), let's call them B1 and S1 for big and small
data sets of the first week. The entire data sets of 4 weeks is B4 and S4.
I ran my script on my cluster to join B1 and S1 and everything is fine. I
got my joined data. However, when I ran my script to join B4 and S4, the
script failed. B4 is 39GB, S4 is 300MB. B4 is skewed, some id appears more
frequently than others. I tried both 'using skewed' and 'using replicated'
modes for the join operation (by appending them to the end of the below
join clause), they both fail.
Here is my script and i think it is very simple:
*big = load 'bigdir/' using PigStorage(',') as (id:chararray,
data:chararray);*
*small = load 'smalldir/' using PigStorage(',') as
(t1:double,t2:double,data:chararray,id:chararray);
*
*J = JOIN big by id LEFT OUTER, small by id;
*
*store J into 'outputdir' using PigStorage(',');
*
On the web ui of the tracker, I see that the job has 40 reducers (I guess
since the total data is about 40GB, and each 1GB will need one reducer by
default of PIG and hadoop setting, so this is normal). If I use 'parallel
80' in the join operation above, then I see 80 reducers, and the join
operation still failed.
I checked file mapred-default.xml and found this:
<name>mapred.reduce.tasks</name>
<value>1</value>
If I set the value of parallel in join operation, it should overwrite this,
right?
On the tracker GUI, I see that for different runs, the number of completed
reducers changes from 4 to 10 (out of 40 total reducers). The tracker GUI
shows the reason for the failed reducers: "Task
attempt_201304081613_0046_r_000006_0 failed to report status for 600
seconds. Killing!"
*Could you please help?*
Thank you very much,
-Mua
--------------------------------------------------------------------------------------------------------------
Here is the error report from the console screen where I ran this script:
job_201304081613_0032 616 0 230 12 32 0 0
0 big MAP_ONLY
job_201304081613_0033 705 1 21 6 6 234 2
34 234 SAMPLER
Failed Jobs:
JobId Alias Feature Message Outputs
job_201304081613_0034 small SKEWED_JOIN Message: Job failed!
Error - # of failed Reduce Tasks exceeded allowed limit. FailedCount: 1.
LastFailedTask: task_201304081613_0034_r_000012
Input(s):
Successfully read 364285458 records (39528533645 bytes) from:
"hdfs://d0521b01:24990/user/abc/big/"
Failed to read data from "hdfs://d0521b01:24990/user/abc/small/"
Output(s):
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201304081613_0032 -> job_201304081613_0033,
job_201304081613_0033 -> job_201304081613_0034,
job_201304081613_0034 -> null,
null
2013-04-10 20:11:23,815 [main] WARN
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Encountered Warning
REDUCER_COUNT_LOW 1 time(s).
2013-04-10 20:11:23,815 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Some jobs have faile
d! Stop running all dependent jobs
2013-04-10 20:11:23,815 [main] ERROR org.apache.pig.tools.grunt.GruntParser
- ERROR 2997: Encountered IOException. java.io.IOException: Er
ror Recovery for block blk_312487981794332936_26563 failed because
recovery from primary datanode 10.6.25.31:54563 failed 6 times. Pipel
ine was 10.6.25.31:54563. Aborting...
Details at logfile: /homes/abc/pig-flatten/scripts/pig_1365627648226.log
2013-04-10 20:11:23,818 [main] ERROR org.apache.pig.tools.grunt.GruntParser
- ERROR 2244: Job failed, hadoop does not return any error mes
sage
Details at logfile: /homes/abc/pig-flatten/scripts/pig_1365627648226.log