Same server or same split?

I don't know how you can guarantee anything about all the data being on the
same server given that you are working with HDFS.

If you mean same split, then you can do the following:

public class MyStorage extends PigStorage implements CollectableLoadFunc {

  // add constructors here that mimic PigStorage constructors and just call
into super(args)
  // don't forget the no-arg constructor

  public void ensureAllKeyInstancesInSameSplit() throws IOException {
    return;
  }
}

As a side note -- I wish this method signature returned a boolean, and
allowed a LoadFunc to decline, indicating that it can't ensure this
condition, in which case Pig could either display a meaningful message to
the user or default to a regular group-by. Ashutosh, thoughts?

D

On Fri, Jan 7, 2011 at 2:34 PM, Jonathan Coveney <[email protected]> wrote:

> I will implement this if I need to, but it seems to me that SOMEBODY has to
> have run into this. I don't know if it's possible, but it's worth asking...
>
> Basically I have a hadoop cluster of X servers, and one thing that I know
> is
> that for anything with key k, all of the values associated with that key
> will live on the same server. I've been told that the way to take advantage
> of this is to make a custom loader which extends CollectibleLoader (I
> think,
> it may be called something else), which then let's group operations be done
> on the map side.
>
> I know that Zebra implements this, but the cluster at hand is all flat
> files, and getting away from that is not an option. Without a special file
> format, is there a reasonable way to implement this? Has anyone done
> something like this? I think having this in the piggybank or pigloader, if
> it's possible, would be super useful for datasets like this.
>
> Thanks for the help
> Jon
>

Reply via email to