Hi!
I have a problem with the following Pig script:
DESCRIBE A;
A: {id: int, season: int, count: long}
foo = FOREACH (GROUP A BY season) {
sorted = ORDER A BY count DESC;
quantiles = FOREACH sorted GENERATE BagSplit(10, sorted);
GENERATE <???>;
};
DESCRIBE foo::quantiles;
foo::quantiles: {datafu.pig.bags.bagsplit_sorted_14586: {(data: {(id:
int, season: int, count: long)},index: int)}}
What I'd like to do is order A by "count" and then use DataFu's
BagSplit UDF to create equal splits (deciles). I'm very very new to
Pig and I think this can all be attributed to the fact that I
misunderstand bags and FOREACH - especially the nested variant.
I'd like my output to be:
{season, {(id, count), (id, count), ...}}
GENERATE quantiles: Is accepted but leads to "ERROR 2015: Invalid
physical operators in the physical plan" on execution.
GENERATE quantiles.$0: Same as above. In fact I can stick as many
".$0" at the end as I want to and it is always accepted but generates
an error when duming the data.
I'll reread the Pig Lating Basics tonight but if anyone has an idea
what I'm doing wrong or how I can achieve my goal I'd be very
grateful.
Thanks,
Lars