Hi,
I'm encountering for a "simple" pig script, spilling issues. All map tasks and
reducers succeed pretty fast except the last reducer!
The last reducer always starts spilling after ~10mins and after trying it on
several datanodes in the end it failes.
Do you have any idea, how I could optimize the GROUP BY, so I don't run into
spilling issues.
Thanks in advance!
Below the pig script:
###
dataImport = LOAD <some data>;
generatedData = FOREACH dataImport GENERATE Field_A, Field_B, Field_C;
groupedData = GROUP generatedData BY (Field_B, Field_C);
result = FOREACH groupedData {
counter_1 = FILTER generatedData BY <some fields>;
counter_2 = FILTER generatedData BY <some fields>;
GENERATE
group.Field_B,
group.Field_C,
COUNT(counter_1),
COUNT(counter_2);
}
STORE result INTO <some path> USING PigStorage();
###
Regards,
Nebo