On Fri, Oct 22, 2010 at 4:19 PM, Leo Alekseyev <[email protected]> wrote:

> What about the opposite problem?.. Suppose we are bulk-populating a
> blank table from scratch, then we have a bunch of data going into one
> region through one reducer.  One workaround is to import some data,
> then split the region into however many regions we want, then import
> the rest.  This sounds kludgy.  Is there a better approach?
>
>
For an entirely fresh table, you can set partition points manually for your
reducers (assuming you know your key space a priori) and then use
loadtable.rb to complete the load. We should really collapse this script
into the java tool and get rid of the ruby version. Additionally we should
document this alternate method in the bulk loads doc. Patches welcome ;-)



> --Leo
>
> On Wed, Oct 13, 2010 at 5:39 AM, Todd Lipcon <[email protected]> wrote:
> > On Mon, Oct 11, 2010 at 9:33 PM, Sean Bigdatafun
> > <[email protected]> wrote:
> >> Another potential "problem" of incremental bulk loader is that the
> number of
> >> reducers (for the bulk loading process) needs to be equal to the
> existing
> >> regions -- this seems to be unfeasible for very large table, say with
> 2000
> >> regions.
> >>
> >> Any comment on this? Thanks.
> >
> > Yes, this is currently problematic if you have a very large table
> > (2000 regions) and a small MR cluster (where 2000 reducers is too
> > many).
> >
> > It wouldn't be too difficult to amend the code so that each reducer is
> > responsible for a contiguous range of regions, and knows the split the
> > HFiles at region boundaries. Patches welcome :)
> >
> > -Todd
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to