I will admit my ignorance on this...what does it mean to be on the same split? This is an area I am still getting up to speed on.
As always, I appreciate the help. 2011/1/8 Dmitriy Ryaboy <[email protected]> > Same server or same split? > > I don't know how you can guarantee anything about all the data being on the > same server given that you are working with HDFS. > > If you mean same split, then you can do the following: > > public class MyStorage extends PigStorage implements CollectableLoadFunc { > > // add constructors here that mimic PigStorage constructors and just call > into super(args) > // don't forget the no-arg constructor > > public void ensureAllKeyInstancesInSameSplit() throws IOException { > return; > } > } > > As a side note -- I wish this method signature returned a boolean, and > allowed a LoadFunc to decline, indicating that it can't ensure this > condition, in which case Pig could either display a meaningful message to > the user or default to a regular group-by. Ashutosh, thoughts? > > D > > On Fri, Jan 7, 2011 at 2:34 PM, Jonathan Coveney <[email protected]> > wrote: > > > I will implement this if I need to, but it seems to me that SOMEBODY has > to > > have run into this. I don't know if it's possible, but it's worth > asking... > > > > Basically I have a hadoop cluster of X servers, and one thing that I know > > is > > that for anything with key k, all of the values associated with that key > > will live on the same server. I've been told that the way to take > advantage > > of this is to make a custom loader which extends CollectibleLoader (I > > think, > > it may be called something else), which then let's group operations be > done > > on the map side. > > > > I know that Zebra implements this, but the cluster at hand is all flat > > files, and getting away from that is not an option. Without a special > file > > format, is there a reasonable way to implement this? Has anyone done > > something like this? I think having this in the piggybank or pigloader, > if > > it's possible, would be super useful for datasets like this. > > > > Thanks for the help > > Jon > > >
