I will implement this if I need to, but it seems to me that SOMEBODY has to have run into this. I don't know if it's possible, but it's worth asking...
Basically I have a hadoop cluster of X servers, and one thing that I know is that for anything with key k, all of the values associated with that key will live on the same server. I've been told that the way to take advantage of this is to make a custom loader which extends CollectibleLoader (I think, it may be called something else), which then let's group operations be done on the map side. I know that Zebra implements this, but the cluster at hand is all flat files, and getting away from that is not an option. Without a special file format, is there a reasonable way to implement this? Has anyone done something like this? I think having this in the piggybank or pigloader, if it's possible, would be super useful for datasets like this. Thanks for the help Jon
