I have a sorted relation that appears similar to the following where users have a consumption count:
(user1, 1000) (user99, 999) (user2, 998) (user3, 22) (user4, 10) ... I'd like to identify the top 20% of users based on the second field. I'm able to get the aggregate sum of the second field easy enough, but I'm not able to get my head around a mechanism to pick out the users who are just in the top 20%. Best I can tell, I'd need something like an accumulator that increments for each of the tuples and stops when it reaches 20% of the total sum but that doesn't seem possible. Anybody done anything similar from within PIG?
