I will admit my ignorance on this...what does it mean to be on the same
split? This is an area I am still getting up to speed on.

As always, I appreciate the help.

2011/1/8 Dmitriy Ryaboy <[email protected]>

> Same server or same split?
>
> I don't know how you can guarantee anything about all the data being on the
> same server given that you are working with HDFS.
>
> If you mean same split, then you can do the following:
>
> public class MyStorage extends PigStorage implements CollectableLoadFunc {
>
>  // add constructors here that mimic PigStorage constructors and just call
> into super(args)
>  // don't forget the no-arg constructor
>
>  public void ensureAllKeyInstancesInSameSplit() throws IOException {
>    return;
>  }
> }
>
> As a side note -- I wish this method signature returned a boolean, and
> allowed a LoadFunc to decline, indicating that it can't ensure this
> condition, in which case Pig could either display a meaningful message to
> the user or default to a regular group-by. Ashutosh, thoughts?
>
> D
>
> On Fri, Jan 7, 2011 at 2:34 PM, Jonathan Coveney <[email protected]>
> wrote:
>
> > I will implement this if I need to, but it seems to me that SOMEBODY has
> to
> > have run into this. I don't know if it's possible, but it's worth
> asking...
> >
> > Basically I have a hadoop cluster of X servers, and one thing that I know
> > is
> > that for anything with key k, all of the values associated with that key
> > will live on the same server. I've been told that the way to take
> advantage
> > of this is to make a custom loader which extends CollectibleLoader (I
> > think,
> > it may be called something else), which then let's group operations be
> done
> > on the map side.
> >
> > I know that Zebra implements this, but the cluster at hand is all flat
> > files, and getting away from that is not an option. Without a special
> file
> > format, is there a reasonable way to implement this? Has anyone done
> > something like this? I think having this in the piggybank or pigloader,
> if
> > it's possible, would be super useful for datasets like this.
> >
> > Thanks for the help
> > Jon
> >
>

Reply via email to