Re: Please help with grouped count

Jonathan Coveney Fri, 11 May 2012 10:49:53 -0700

The nesting and the non-nested are indeed the same. To limit, you could do:


b = group a by (country, search_term);
c = foreach b generate flatten(group) as (country, search_term), COUNT(a)
as ct;
d = foreach (group c by country) generate ORDER_UDF(TOP(b, 2, 10)); --I
forget the order of commands, check documentation

the ORDER_UDF doesn't exist. It's important to remember that bags have no
order, so if you want a TUPLE in term order, you need to take in a bag, and
then get the elements, sort them, and put them in a Tuple. If you need help
with that, less us know.

2012/5/11 Mark <[email protected]>

> Also, using your example, how could I limit the number of terms per
> country?
>
>
> On 5/11/12 9:47 AM, Mark wrote:
>
>> Thank you so much, that's pretty much what I was going for but with a
>> slightly different output.
>>
>> Just to be clear... are these equivalent?
>>
>> b = foreach (group a by (country, search_term)) generate flatten(group) as
>> (country, search_term), COUNT(a) as ct;
>>
>>
>> b = group a by (country, search_term);
>> c = foreach b generate flatten(group) as (country, search_term), COUNT(a)
>> as ct;
>>
>> I'm guessing so... I didn't know you could combine/nest these statements.
>>
>>
>> After experimenting with your example I'm pretty sure I understand
>> everything that's going on. I can work with this format but I was wondering
>> how would I massage this into something like:
>>
>> (country1, top term1, topterm2, topterm3, ...)
>> (country2, top term1, topterm2, topterm3, ...)
>> (country3, top term1, topterm2, topterm3, ...)
>>
>> Maybe it has to be something like this:
>>
>> (country1, (top term1, topterm2, topterm3, ...))
>>
>> So one row per country with the first value being the country and the
>> following values the top terms in order? Is this even possible with Pig?
>>
>> Thanks for the clarification.
>>
>>
>> On 5/10/12 5:32 PM, Jonathan Coveney wrote:
>>
>>> a = load 'log' as (country:chararray, search_term:chararray);
>>> b = foreach (group a by (country, search_term)) generate flatten(group)
>>> as
>>> (country, search_term), COUNT(a) as ct;
>>> c = order b by country asc, ct desc;
>>>
>>> It sort of depends what format you want the output in, though. Note: if
>>> you
>>> know that the number of search terms is low you could do this in memory
>>> and
>>> do it in one m/r job, but this version will be scalable.
>>>
>>> If this solution doesn't make sense, I can help explain it. It's
>>> important
>>> to know what format you want the output in. This would give you every
>>> country (in ascending alphabetical order), and then the search term and
>>> count starting with the highest.
>>>
>>> 2012/5/10 Mark<[email protected]**>
>>>
>>>  We have logs in the following format
>>>>
>>>> us, foo
>>>> us, foo
>>>> fr, fizz
>>>> us, bar
>>>> fr, baz
>>>> fr, fizz
>>>> us, foo
>>>> fr, fizz
>>>>
>>>> Where the first column is a country and the second column is a search
>>>> term.
>>>>
>>>> How in the world can I output the country followed by the top terms in
>>>> order of occurrence... ie:
>>>>
>>>> us, (foo, bar)      # Top term for 'us' is foo then bar then ...
>>>> fr, (fizz, baz)      # Top term for 'fr' is fizz then baz then ...
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>

Re: Please help with grouped count

Reply via email to