subject:"Re\: Bulkload into empty table with configureIncrementalLoad\(\)"

Re: Bulkload into empty table with configureIncrementalLoad()

2013-09-19 Thread Jean-Daniel Cryans

You need to create the table with pre-splits, see
http://hbase.apache.org/book.html#perf.writing

J-D


On Thu, Sep 19, 2013 at 9:52 AM, Dolan Antenucci antenucc...@gmail.comwrote:

 I have about 1 billion values I am trying to load into a new HBase table
 (with just one column and column family), but am running into some issues.
  Currently I am trying to use MapReduce to import these by first converting
 them to HFiles and then using LoadIncrementalHFiles.doBulkLoad().  I also
 use HFileOutputFormat.configureIncrementalLoad() as part of my MR job.  My
 code is essentially the same as this example:

 https://github.com/Paschalis/HBase-Bulk-Load-Example/blob/master/src/cy/ac/ucy/paschalis/hbase/bulkimport/Driver.java

 The problem I'm running into is that only 1 reducer is created
 by configureIncrementalLoad(), and there is not enough space on this node
 to handle all this data.  configureIncrementalLoad() should start one
 reducer for every region the table has, so apparently the table only has 1
 region -- maybe because it is empty and brand new (my understanding of how
 regions work is not crystal clear)?  The cluster has 5 region servers, so
 I'd at least like that many reducers to handle this loading.

 On a side note, I also tried the command line tool, completebulkload, but
 am running into other issues with this (timeouts, possible heap issues) --
 probably due to only one server being assigned the task of inserting all
 the records (i.e. I look at the region servers' logs, and only one of the
 servers has log entries; the rest are idle).

 Any help is appreciated

 -Dolan Antenucci

Re: Bulkload into empty table with configureIncrementalLoad()

2013-09-19 Thread Dolan Antenucci

Thanks J-D. Any recommendations on how to determine what splits to use?
For the keys I'm using strings, so wasn't sure what to put for my startKey
and endKey. For number of regions, I have a table pre-populated with the
same data (not using bulk load), so I can see that it has 68 regions.

On Thu, Sep 19, 2013 at 12:55 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote:

You need to create the table with pre-splits, see
http://hbase.apache.org/book.html#perf.writing

J-D

On Thu, Sep 19, 2013 at 9:52 AM, Dolan Antenucci antenucc...@gmail.com
wrote:

I have about 1 billion values I am trying to load into a new HBase table
(with just one column and column family), but am running into some
issues.
Currently I am trying to use MapReduce to import these by first
converting
them to HFiles and then using LoadIncrementalHFiles.doBulkLoad(). I also
use HFileOutputFormat.configureIncrementalLoad() as part of my MR job.
My
code is essentially the same as this example:

https://github.com/Paschalis/HBase-Bulk-Load-Example/blob/master/src/cy/ac/ucy/paschalis/hbase/bulkimport/Driver.java

The problem I'm running into is that only 1 reducer is created
by configureIncrementalLoad(), and there is not enough space on this node
to handle all this data. configureIncrementalLoad() should start one
reducer for every region the table has, so apparently the table only has
1
region -- maybe because it is empty and brand new (my understanding of
how
regions work is not crystal clear)? The cluster has 5 region servers, so
I'd at least like that many reducers to handle this loading.

On a side note, I also tried the command line tool, completebulkload, but
am running into other issues with this (timeouts, possible heap issues)
--
probably due to only one server being assigned the task of inserting all
the records (i.e. I look at the region servers' logs, and only one of the
servers has log entries; the rest are idle).

Any help is appreciated

-Dolan Antenucci

Re: Bulkload into empty table with configureIncrementalLoad()

2013-09-19 Thread Dolan Antenucci

To follow up on my previous question about how best to do the pre-splits, I
ended up using to following when creating my table:

admin.createTable(desc, Bytes.toBytes(0), Bytes.toBytes(2147483647),
100);

This was somewhat of a stab in the dark, but I based it
on RegionSplitter.MD5StringSplit's documentation, which said: Row are long
values in the range = 7FFF. (Reminder: I'm using strings,
probably not uniformly distributed, as my row ID's).

It looks like about 80 of the regions received very little keys (many
received 0), and the other 20 received between 35m - 70m each. Glancing at
the nodes responsible for the 20 popular regions, it looks like a fairly
even distribution across my cluster, so overall I'm optimistic with the
result (performance at first glance seems fine too).

Question: is there something I can do to achieve an even better
distribution across my regions? As mentioned before, I have a table that I
populated via puts, so maybe this can be used to guide my pre-splits? I
did try passing the result of this table's HTable.getStartKeys() (as well
as getEndKeys()) in as the splits, but got an error along the lines of key
cannot be empty.

Thanks again for your help.

-Dolan Antenucci

On Thu, Sep 19, 2013 at 2:53 PM, Dolan Antenucci antenucc...@gmail.comwrote:

On Thu, Sep 19, 2013 at 12:55 PM, Jean-Daniel Cryans
jdcry...@apache.orgwrote:

You need to create the table with pre-splits, see
http://hbase.apache.org/book.html#perf.writing

J-D

On Thu, Sep 19, 2013 at 9:52 AM, Dolan Antenucci antenucc...@gmail.com
wrote:

I have about 1 billion values I am trying to load into a new HBase table
(with just one column and column family), but am running into some
issues.
Currently I am trying to use MapReduce to import these by first
converting
them to HFiles and then using LoadIncrementalHFiles.doBulkLoad(). I
also
use HFileOutputFormat.configureIncrementalLoad() as part of my MR job.
My
code is essentially the same as this example:

https://github.com/Paschalis/HBase-Bulk-Load-Example/blob/master/src/cy/ac/ucy/paschalis/hbase/bulkimport/Driver.java

The problem I'm running into is that only 1 reducer is created
by configureIncrementalLoad(), and there is not enough space on this
node
to handle all this data. configureIncrementalLoad() should start one
reducer for every region the table has, so apparently the table only
has 1
region -- maybe because it is empty and brand new (my understanding of
how
regions work is not crystal clear)? The cluster has 5 region servers,
so
I'd at least like that many reducers to handle this loading.

On a side note, I also tried the command line tool, completebulkload,
but
am running into other issues with this (timeouts, possible heap issues)
--
probably due to only one server being assigned the task of inserting all
the records (i.e. I look at the region servers' logs, and only one of
the
servers has log entries; the rest are idle).

Any help is appreciated

-Dolan Antenucci

Re: Bulkload into empty table with configureIncrementalLoad()

Re: Bulkload into empty table with configureIncrementalLoad()

Re: Bulkload into empty table with configureIncrementalLoad()

3 matches

Site Navigation

Mail list logo

Footer information