Re: Please help with grouped count

Jonathan Coveney Fri, 11 May 2012 14:19:19 -0700

You would have to create one, yeah.

2012/5/11 Mark <[email protected]>


> Not entirely following you on the Order UDF.. doesn one exist or are you
> proposing I create this myself?
>
> Thanks.
>
>
> On 5/11/12 10:49 AM, Jonathan Coveney wrote:
>
>> The nesting and the non-nested are indeed the same. To limit, you could
>> do:
>>
>> b = group a by (country, search_term);
>> c = foreach b generate flatten(group) as (country, search_term), COUNT(a)
>> as ct;
>> d = foreach (group c by country) generate ORDER_UDF(TOP(b, 2, 10)); --I
>> forget the order of commands, check documentation
>>
>> the ORDER_UDF doesn't exist. It's important to remember that bags have no
>> order, so if you want a TUPLE in term order, you need to take in a bag,
>> and
>> then get the elements, sort them, and put them in a Tuple. If you need
>> help
>> with that, less us know.
>>
>> 2012/5/11 Mark<[email protected]**>
>>
>>  Also, using your example, how could I limit the number of terms per
>>> country?
>>>
>>>
>>> On 5/11/12 9:47 AM, Mark wrote:
>>>
>>>  Thank you so much, that's pretty much what I was going for but with a
>>>> slightly different output.
>>>>
>>>> Just to be clear... are these equivalent?
>>>>
>>>> b = foreach (group a by (country, search_term)) generate flatten(group)
>>>> as
>>>> (country, search_term), COUNT(a) as ct;
>>>>
>>>>
>>>> b = group a by (country, search_term);
>>>> c = foreach b generate flatten(group) as (country, search_term),
>>>> COUNT(a)
>>>> as ct;
>>>>
>>>> I'm guessing so... I didn't know you could combine/nest these
>>>> statements.
>>>>
>>>>
>>>> After experimenting with your example I'm pretty sure I understand
>>>> everything that's going on. I can work with this format but I was
>>>> wondering
>>>> how would I massage this into something like:
>>>>
>>>> (country1, top term1, topterm2, topterm3, ...)
>>>> (country2, top term1, topterm2, topterm3, ...)
>>>> (country3, top term1, topterm2, topterm3, ...)
>>>>
>>>> Maybe it has to be something like this:
>>>>
>>>> (country1, (top term1, topterm2, topterm3, ...))
>>>>
>>>> So one row per country with the first value being the country and the
>>>> following values the top terms in order? Is this even possible with Pig?
>>>>
>>>> Thanks for the clarification.
>>>>
>>>>
>>>> On 5/10/12 5:32 PM, Jonathan Coveney wrote:
>>>>
>>>>  a = load 'log' as (country:chararray, search_term:chararray);
>>>>> b = foreach (group a by (country, search_term)) generate flatten(group)
>>>>> as
>>>>> (country, search_term), COUNT(a) as ct;
>>>>> c = order b by country asc, ct desc;
>>>>>
>>>>> It sort of depends what format you want the output in, though. Note: if
>>>>> you
>>>>> know that the number of search terms is low you could do this in memory
>>>>> and
>>>>> do it in one m/r job, but this version will be scalable.
>>>>>
>>>>> If this solution doesn't make sense, I can help explain it. It's
>>>>> important
>>>>> to know what format you want the output in. This would give you every
>>>>> country (in ascending alphabetical order), and then the search term and
>>>>> count starting with the highest.
>>>>>
>>>>> 2012/5/10 Mark<[email protected]****>
>>>>>
>>>>>  We have logs in the following format
>>>>>
>>>>>> us, foo
>>>>>> us, foo
>>>>>> fr, fizz
>>>>>> us, bar
>>>>>> fr, baz
>>>>>> fr, fizz
>>>>>> us, foo
>>>>>> fr, fizz
>>>>>>
>>>>>> Where the first column is a country and the second column is a search
>>>>>> term.
>>>>>>
>>>>>> How in the world can I output the country followed by the top terms in
>>>>>> order of occurrence... ie:
>>>>>>
>>>>>> us, (foo, bar)      # Top term for 'us' is foo then bar then ...
>>>>>> fr, (fizz, baz)      # Top term for 'fr' is fizz then baz then ...
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>

Re: Please help with grouped count

Reply via email to