Identifying the top 20% of a sorted relation

Erik Onnen Wed, 19 Jan 2011 10:32:01 -0800

I have a sorted relation that appears similar to the following where users
have a consumption count:


(user1, 1000)
(user99, 999)
(user2, 998)
(user3, 22)
(user4, 10)
...

I'd like to identify the top 20% of users based on the second field. I'm
able to get the aggregate sum of the second field easy enough, but I'm not
able to get my head around a mechanism to pick out the users who are just in
the top 20%. Best I can tell, I'd need something like an accumulator that
increments for each of the tuples and stops when it reaches 20% of the total
sum but that doesn't seem possible.

Anybody done anything similar from within PIG?

Identifying the top 20% of a sorted relation

Reply via email to