You would have to create one, yeah. 2012/5/11 Mark <[email protected]>
> Not entirely following you on the Order UDF.. doesn one exist or are you > proposing I create this myself? > > Thanks. > > > On 5/11/12 10:49 AM, Jonathan Coveney wrote: > >> The nesting and the non-nested are indeed the same. To limit, you could >> do: >> >> b = group a by (country, search_term); >> c = foreach b generate flatten(group) as (country, search_term), COUNT(a) >> as ct; >> d = foreach (group c by country) generate ORDER_UDF(TOP(b, 2, 10)); --I >> forget the order of commands, check documentation >> >> the ORDER_UDF doesn't exist. It's important to remember that bags have no >> order, so if you want a TUPLE in term order, you need to take in a bag, >> and >> then get the elements, sort them, and put them in a Tuple. If you need >> help >> with that, less us know. >> >> 2012/5/11 Mark<[email protected]**> >> >> Also, using your example, how could I limit the number of terms per >>> country? >>> >>> >>> On 5/11/12 9:47 AM, Mark wrote: >>> >>> Thank you so much, that's pretty much what I was going for but with a >>>> slightly different output. >>>> >>>> Just to be clear... are these equivalent? >>>> >>>> b = foreach (group a by (country, search_term)) generate flatten(group) >>>> as >>>> (country, search_term), COUNT(a) as ct; >>>> >>>> >>>> b = group a by (country, search_term); >>>> c = foreach b generate flatten(group) as (country, search_term), >>>> COUNT(a) >>>> as ct; >>>> >>>> I'm guessing so... I didn't know you could combine/nest these >>>> statements. >>>> >>>> >>>> After experimenting with your example I'm pretty sure I understand >>>> everything that's going on. I can work with this format but I was >>>> wondering >>>> how would I massage this into something like: >>>> >>>> (country1, top term1, topterm2, topterm3, ...) >>>> (country2, top term1, topterm2, topterm3, ...) >>>> (country3, top term1, topterm2, topterm3, ...) >>>> >>>> Maybe it has to be something like this: >>>> >>>> (country1, (top term1, topterm2, topterm3, ...)) >>>> >>>> So one row per country with the first value being the country and the >>>> following values the top terms in order? Is this even possible with Pig? >>>> >>>> Thanks for the clarification. >>>> >>>> >>>> On 5/10/12 5:32 PM, Jonathan Coveney wrote: >>>> >>>> a = load 'log' as (country:chararray, search_term:chararray); >>>>> b = foreach (group a by (country, search_term)) generate flatten(group) >>>>> as >>>>> (country, search_term), COUNT(a) as ct; >>>>> c = order b by country asc, ct desc; >>>>> >>>>> It sort of depends what format you want the output in, though. Note: if >>>>> you >>>>> know that the number of search terms is low you could do this in memory >>>>> and >>>>> do it in one m/r job, but this version will be scalable. >>>>> >>>>> If this solution doesn't make sense, I can help explain it. It's >>>>> important >>>>> to know what format you want the output in. This would give you every >>>>> country (in ascending alphabetical order), and then the search term and >>>>> count starting with the highest. >>>>> >>>>> 2012/5/10 Mark<[email protected]****> >>>>> >>>>> We have logs in the following format >>>>> >>>>>> us, foo >>>>>> us, foo >>>>>> fr, fizz >>>>>> us, bar >>>>>> fr, baz >>>>>> fr, fizz >>>>>> us, foo >>>>>> fr, fizz >>>>>> >>>>>> Where the first column is a country and the second column is a search >>>>>> term. >>>>>> >>>>>> How in the world can I output the country followed by the top terms in >>>>>> order of occurrence... ie: >>>>>> >>>>>> us, (foo, bar) # Top term for 'us' is foo then bar then ... >>>>>> fr, (fizz, baz) # Top term for 'fr' is fizz then baz then ... >>>>>> >>>>>> Thanks >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>
