I was afraid I'd have to do a join on a constant (using Pig 0.6 at the moment). That works wonderfully. Thanks!
On Fri, Oct 8, 2010 at 6:15 PM, Dmitriy Ryaboy <[email protected]> wrote: > In Pig 8, you can generate a one-line relation and later refer to it as a > scalar: > > counts = foreach (group ngramed all) generate COUNT(ngramed); > > percents = foreach grpd generate group as ngram, COUNT(ngramed) as count, > COUNT(ngramed) / (long) counts.total as percent; > > In earlier versions, the solution is to do a replicated join on a constant > (ugly, I know): > counts = foreach (group ngramed all) generate COUNT(ngramed); > grpd = join grpd by 1, counts by 1 using "replicated"; > percents = foreach grpd generate grpd::group as ngram, COUNT(grpd::ngramed) > as count, COUNT(grpd::ngramed) / (long) counts::total as percent; > > Untested, may break :) > > > On Fri, Oct 8, 2010 at 12:47 PM, Mark Stetzer <[email protected]> wrote: > >> I'm trying to count N-gram occurrences as a percentage of total >> tuples, and I'm running into a problem that I assume has a simple >> solution I'm not thinking of. My script basically looks like: >> >> log = LOAD blah AS (session_id:chararray, text:chararray...); >> ngramed = FOREACH log GENERATE flatten( >> org.apache.pig.tutorial.NGramGenerator(text) ) AS ngram; >> grpd = GROUP ngramed BY ngram; >> freq = FOREACH grpd GENERATE group AS ngram, COUNT(ngramed) AS count, >> COUNT(ngramed) / X AS percent; >> STORE freq INTO 'ngrams'; >> >> I'm trying to figure out how I can calculate X so that it represents >> the total number of tuples in log. I could "GROUP ALL log" and get a >> count of that, but how do I reference it in my FOREACH statement? >> >> Thanks for any help anyone can provide. >> >> -Mark >> >
