I'm trying to count N-gram occurrences as a percentage of total tuples, and I'm running into a problem that I assume has a simple solution I'm not thinking of. My script basically looks like:
log = LOAD blah AS (session_id:chararray, text:chararray...); ngramed = FOREACH log GENERATE flatten( org.apache.pig.tutorial.NGramGenerator(text) ) AS ngram; grpd = GROUP ngramed BY ngram; freq = FOREACH grpd GENERATE group AS ngram, COUNT(ngramed) AS count, COUNT(ngramed) / X AS percent; STORE freq INTO 'ngrams'; I'm trying to figure out how I can calculate X so that it represents the total number of tuples in log. I could "GROUP ALL log" and get a count of that, but how do I reference it in my FOREACH statement? Thanks for any help anyone can provide. -Mark
