That clarified things perfectly, Dmitriy. As always, super helpful. I think I just need to look deeper into how the Zebra people did what they did, and dig into how we're actually storing these things. I imagine if it was trivial to implement partitions (and indexes for that matter) in pig it would have been done...
2011/1/8 Dmitriy Ryaboy <[email protected]> > What do you mean by data being on the same server? > > You said "I have a hadoop cluster of X servers, and one thing that I know > is > that for anything with key k, all of the values associated with that key > will live on the same server." > > I am not sure how you achieve this. > > Assuming your data is in a file or collection of files in HDFS, the file is > split up into blocks by the file system. Each block is semi-arbitrarily > assigned to some node in your cluster (3 nodes, usually, for replication). > You don't really control how these blocks are set up -- they are based on a > configured block size setting in HDFS; it's usually 64 or 128 megabytes. > > When pig reads your file, it creates a separate map task for each of these > blocks. If you can somehow guarantee that all the keys wind up in the same > block, great, you can do what I recommended earlier; however, this is > tricky > -- even if your input file is sorted by the group key, and so all of your > keys are next to each other, it is possible that a sequence of records with > the same key happens to span the semi-arbitrary block boundary, and these > records will wind up in separate map tasks. > > As I just explained it, I equated HDFS blocks with map tasks. That's an > oversimplification. In fact map tasks are created per an InputSplit. What > happens by default is that your InputFormat looks up the HDFS blocks and > creates a split per HDFS block; but it could do something different, > including guaranteeing that all keys are in the same split. Zebra does > something with its input format such that when you call > ensureAllInstances...() on the Zebra Loader, the InputFormat is adjusted in > a way that, well, ensures all keys are in the same split. > > Did that clarify things or confuse even more? :) > > D > > On Sat, Jan 8, 2011 at 3:31 PM, Jonathan Coveney <[email protected]> > wrote: > > > I will admit my ignorance on this...what does it mean to be on the same > > split? This is an area I am still getting up to speed on. > > > > As always, I appreciate the help. > > > > 2011/1/8 Dmitriy Ryaboy <[email protected]> > > > > > Same server or same split? > > > > > > I don't know how you can guarantee anything about all the data being on > > the > > > same server given that you are working with HDFS. > > > > > > If you mean same split, then you can do the following: > > > > > > public class MyStorage extends PigStorage implements > CollectableLoadFunc > > { > > > > > > // add constructors here that mimic PigStorage constructors and just > > call > > > into super(args) > > > // don't forget the no-arg constructor > > > > > > public void ensureAllKeyInstancesInSameSplit() throws IOException { > > > return; > > > } > > > } > > > > > > As a side note -- I wish this method signature returned a boolean, and > > > allowed a LoadFunc to decline, indicating that it can't ensure this > > > condition, in which case Pig could either display a meaningful message > to > > > the user or default to a regular group-by. Ashutosh, thoughts? > > > > > > D > > > > > > On Fri, Jan 7, 2011 at 2:34 PM, Jonathan Coveney <[email protected]> > > > wrote: > > > > > > > I will implement this if I need to, but it seems to me that SOMEBODY > > has > > > to > > > > have run into this. I don't know if it's possible, but it's worth > > > asking... > > > > > > > > Basically I have a hadoop cluster of X servers, and one thing that I > > know > > > > is > > > > that for anything with key k, all of the values associated with that > > key > > > > will live on the same server. I've been told that the way to take > > > advantage > > > > of this is to make a custom loader which extends CollectibleLoader (I > > > > think, > > > > it may be called something else), which then let's group operations > be > > > done > > > > on the map side. > > > > > > > > I know that Zebra implements this, but the cluster at hand is all > flat > > > > files, and getting away from that is not an option. Without a special > > > file > > > > format, is there a reasonable way to implement this? Has anyone done > > > > something like this? I think having this in the piggybank or > pigloader, > > > if > > > > it's possible, would be super useful for datasets like this. > > > > > > > > Thanks for the help > > > > Jon > > > > > > > > > >
