What do you mean by data being on the same server?

You said "I have a hadoop cluster of X servers, and one thing that I know is
that for anything with key k, all of the values associated with that key
will live on the same server."

I am not sure how you achieve this.

Assuming your data is in a file or collection of files in HDFS, the file is
split up into blocks by the file system. Each block is semi-arbitrarily
assigned to some node in your cluster (3 nodes, usually, for replication).
You don't really control how these blocks are set up -- they are based on a
configured block size setting in HDFS; it's usually 64 or 128 megabytes.

When pig reads your file, it creates a separate map task for each of these
blocks. If you can somehow guarantee that all the keys wind up in the same
block, great, you can do what I recommended earlier; however, this is tricky
-- even if your input file is sorted by the group key, and so all of your
keys are next to each other, it is possible that a sequence of records with
the same key happens to span the semi-arbitrary block boundary, and these
records will wind up in separate map tasks.

As I just explained it, I equated HDFS blocks with map tasks. That's an
oversimplification. In fact map tasks are created per an InputSplit. What
happens by default is that your InputFormat looks up the HDFS blocks and
creates a split per HDFS block; but it could do something different,
including guaranteeing that all keys are in the same split.  Zebra does
something with its input format such that when you call
ensureAllInstances...() on the Zebra Loader, the InputFormat is adjusted in
a way that, well, ensures all keys are in the same split.

Did that clarify things or confuse even more? :)

D

On Sat, Jan 8, 2011 at 3:31 PM, Jonathan Coveney <[email protected]> wrote:

> I will admit my ignorance on this...what does it mean to be on the same
> split? This is an area I am still getting up to speed on.
>
> As always, I appreciate the help.
>
> 2011/1/8 Dmitriy Ryaboy <[email protected]>
>
> > Same server or same split?
> >
> > I don't know how you can guarantee anything about all the data being on
> the
> > same server given that you are working with HDFS.
> >
> > If you mean same split, then you can do the following:
> >
> > public class MyStorage extends PigStorage implements CollectableLoadFunc
> {
> >
> >  // add constructors here that mimic PigStorage constructors and just
> call
> > into super(args)
> >  // don't forget the no-arg constructor
> >
> >  public void ensureAllKeyInstancesInSameSplit() throws IOException {
> >    return;
> >  }
> > }
> >
> > As a side note -- I wish this method signature returned a boolean, and
> > allowed a LoadFunc to decline, indicating that it can't ensure this
> > condition, in which case Pig could either display a meaningful message to
> > the user or default to a regular group-by. Ashutosh, thoughts?
> >
> > D
> >
> > On Fri, Jan 7, 2011 at 2:34 PM, Jonathan Coveney <[email protected]>
> > wrote:
> >
> > > I will implement this if I need to, but it seems to me that SOMEBODY
> has
> > to
> > > have run into this. I don't know if it's possible, but it's worth
> > asking...
> > >
> > > Basically I have a hadoop cluster of X servers, and one thing that I
> know
> > > is
> > > that for anything with key k, all of the values associated with that
> key
> > > will live on the same server. I've been told that the way to take
> > advantage
> > > of this is to make a custom loader which extends CollectibleLoader (I
> > > think,
> > > it may be called something else), which then let's group operations be
> > done
> > > on the map side.
> > >
> > > I know that Zebra implements this, but the cluster at hand is all flat
> > > files, and getting away from that is not an option. Without a special
> > file
> > > format, is there a reasonable way to implement this? Has anyone done
> > > something like this? I think having this in the piggybank or pigloader,
> > if
> > > it's possible, would be super useful for datasets like this.
> > >
> > > Thanks for the help
> > > Jon
> > >
> >
>

Reply via email to