The nesting and the non-nested are indeed the same. To limit, you could do:
b = group a by (country, search_term); c = foreach b generate flatten(group) as (country, search_term), COUNT(a) as ct; d = foreach (group c by country) generate ORDER_UDF(TOP(b, 2, 10)); --I forget the order of commands, check documentation the ORDER_UDF doesn't exist. It's important to remember that bags have no order, so if you want a TUPLE in term order, you need to take in a bag, and then get the elements, sort them, and put them in a Tuple. If you need help with that, less us know. 2012/5/11 Mark <[email protected]> > Also, using your example, how could I limit the number of terms per > country? > > > On 5/11/12 9:47 AM, Mark wrote: > >> Thank you so much, that's pretty much what I was going for but with a >> slightly different output. >> >> Just to be clear... are these equivalent? >> >> b = foreach (group a by (country, search_term)) generate flatten(group) as >> (country, search_term), COUNT(a) as ct; >> >> >> b = group a by (country, search_term); >> c = foreach b generate flatten(group) as (country, search_term), COUNT(a) >> as ct; >> >> I'm guessing so... I didn't know you could combine/nest these statements. >> >> >> After experimenting with your example I'm pretty sure I understand >> everything that's going on. I can work with this format but I was wondering >> how would I massage this into something like: >> >> (country1, top term1, topterm2, topterm3, ...) >> (country2, top term1, topterm2, topterm3, ...) >> (country3, top term1, topterm2, topterm3, ...) >> >> Maybe it has to be something like this: >> >> (country1, (top term1, topterm2, topterm3, ...)) >> >> So one row per country with the first value being the country and the >> following values the top terms in order? Is this even possible with Pig? >> >> Thanks for the clarification. >> >> >> On 5/10/12 5:32 PM, Jonathan Coveney wrote: >> >>> a = load 'log' as (country:chararray, search_term:chararray); >>> b = foreach (group a by (country, search_term)) generate flatten(group) >>> as >>> (country, search_term), COUNT(a) as ct; >>> c = order b by country asc, ct desc; >>> >>> It sort of depends what format you want the output in, though. Note: if >>> you >>> know that the number of search terms is low you could do this in memory >>> and >>> do it in one m/r job, but this version will be scalable. >>> >>> If this solution doesn't make sense, I can help explain it. It's >>> important >>> to know what format you want the output in. This would give you every >>> country (in ascending alphabetical order), and then the search term and >>> count starting with the highest. >>> >>> 2012/5/10 Mark<[email protected]**> >>> >>> We have logs in the following format >>>> >>>> us, foo >>>> us, foo >>>> fr, fizz >>>> us, bar >>>> fr, baz >>>> fr, fizz >>>> us, foo >>>> fr, fizz >>>> >>>> Where the first column is a country and the second column is a search >>>> term. >>>> >>>> How in the world can I output the country followed by the top terms in >>>> order of occurrence... ie: >>>> >>>> us, (foo, bar) # Top term for 'us' is foo then bar then ... >>>> fr, (fizz, baz) # Top term for 'fr' is fizz then baz then ... >>>> >>>> Thanks >>>> >>>> >>>> >>>>
