Hi,
I'm looking to perform a sum normalization (divide a score by the sum of
scores of my data) with pig.
1) My first problem is I can't find a great way to do that.
Any suggestion?
I have an answer but I'm not really proud of it...
------------------------------------------------------------------------------
score_list = LOAD 'scores' USING PigStorage(';')
AS (word: chararray, score: double);
score_list_ = FOREACH score_list GENERATE
word,
score,
0 AS joinField;
group_score = GROUP score_list ALL;
sum_score = FOREACH group_score GENERATE
0 AS joinField,
SUM(score_list.score) as scoreTotal;
score_with_sum = JOIN score_list_ BY joinField, sum_score BY joinField;
out = FOREACH score_with_sum GENERATE word, (score / scoreTotal);
DUMP out;
------------------------------------------------------------------------------
2) Secondly, I think there is a strange bug.
Considering the code above, if at the end I put only "GENERATE word" (and
not the scores), then it goes in some kind of infinite loop (repeating
"Spilling map output: record full = true"... in the log)
thanks,
tristan