Hi,

I'm looking to perform a sum normalization (divide a score by the sum of
scores of my data) with pig.

1) My first problem is I can't find a great way to do that.
Any suggestion?

I have an answer but I'm not really proud of it...
------------------------------------------------------------------------------
score_list = LOAD  'scores' USING PigStorage(';')
  AS (word: chararray, score: double);

score_list_ = FOREACH score_list GENERATE
  word,
  score,
  0 AS joinField;

group_score = GROUP score_list ALL;
sum_score = FOREACH group_score GENERATE
  0 AS joinField,
  SUM(score_list.score) as scoreTotal;

score_with_sum = JOIN score_list_ BY joinField, sum_score BY joinField;
out = FOREACH score_with_sum GENERATE word, (score / scoreTotal);
DUMP out;
------------------------------------------------------------------------------

2) Secondly, I think there is a strange bug.
Considering the code above, if at the end I put only "GENERATE word" (and
not the scores), then it goes in some kind of infinite loop (repeating
"Spilling map output: record full = true"... in the log)


thanks,

tristan

Reply via email to