Hi there, I've a big pig script which first generates some expensive intermediate result on which I run multiple group by statements and multiple stores. Something like this.
Register UDFs etc A = LOAD.... B = LOAD.... C = LOAD.... -- do lots of transformations with A and B and C get intermediate result INTER_RES result1 = FOREACH (GROUP INTER_RES BY (... STORE result1 INTO '.... result2 = FOREACH (GROUP INTER_RES BY (... STORE result2 INTO '.... result3 = FOREACH (GROUP INTER_RES BY (... STORE result3 INTO '.... result4 = FOREACH (GROUP INTER_RES BY (... STORE result4 INTO '.... ... ... Note the results which get stored are independent off each other. Meaning they are not getting used as an input for anything else further down and do also not alter the INTER_RES. Am I correct that pig would only need to LOAD A, B and C once? From what I can see on the command line output it looks like the expensive intermediate is computed every time for each store. I've done a quick test and if I do a STORE of the intermediate and LOAD that it seems to be faster. Is there a way to avoid this storing of the expensive intermediate? Cheers, -Marco