Hi,
I'm encountering for a "simple" pig script, spilling issues. All map tasks and 
reducers succeed pretty fast except the last reducer!
The last reducer always starts spilling after ~10mins and after trying it on 
several datanodes in the end it failes.

Do you have any idea, how I could optimize the GROUP BY, so I don't run into 
spilling issues.

Thanks in advance!

Below the pig script:
###
dataImport = LOAD <some data>;
generatedData = FOREACH dataImport GENERATE Field_A, Field_B, Field_C;
groupedData = GROUP generatedData BY (Field_B, Field_C);

result = FOREACH groupedData {
    counter_1 = FILTER generatedData BY <some fields>;
    counter_2 = FILTER generatedData BY <some fields>;
    GENERATE
        group.Field_B,
        group.Field_C,
        COUNT(counter_1),
        COUNT(counter_2);
    }

STORE result INTO <some path> USING PigStorage();
###

Regards,
Nebo

Reply via email to