Take a look of Pig scalar: http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#Casting+Relations+to+Scalars

Try this query:
score_list = LOAD  'scores' USING PigStorage(';')
  AS (word: chararray, score: double);

score_list_ = FOREACH score_list GENERATE
  word,
  score,
  0 AS joinField;

group_score = GROUP score_list ALL;
sum_score = FOREACH group_score GENERATE
  0 AS joinField,
  SUM(score_list.score) as scoreTotal;

out = FOREACH score_list_ GENERATE word, (score / sum_score.scoreTotal);
dump out;

For the bug you find, would you mind open a Jira ticket?

Thanks,
Daniel

On 06/14/2011 06:58 AM, Tristan Croiset wrote:
Hi,

I'm looking to perform a sum normalization (divide a score by the sum of
scores of my data) with pig.

1) My first problem is I can't find a great way to do that.
Any suggestion?

I have an answer but I'm not really proud of it...
------------------------------------------------------------------------------
score_list = LOAD  'scores' USING PigStorage(';')
   AS (word: chararray, score: double);

score_list_ = FOREACH score_list GENERATE
   word,
   score,
   0 AS joinField;

group_score = GROUP score_list ALL;
sum_score = FOREACH group_score GENERATE
   0 AS joinField,
   SUM(score_list.score) as scoreTotal;

score_with_sum = JOIN score_list_ BY joinField, sum_score BY joinField;
out = FOREACH score_with_sum GENERATE word, (score / scoreTotal);
DUMP out;
------------------------------------------------------------------------------

2) Secondly, I think there is a strange bug.
Considering the code above, if at the end I put only "GENERATE word" (and
not the scores), then it goes in some kind of infinite loop (repeating
"Spilling map output: record full = true"... in the log)


thanks,

tristan

Reply via email to