Re: Bulkload into empty table with configureIncrementalLoad()

2013-09-19 Thread Jean-Daniel Cryans
You need to create the table with pre-splits, see
http://hbase.apache.org/book.html#perf.writing

J-D


On Thu, Sep 19, 2013 at 9:52 AM, Dolan Antenucci antenucc...@gmail.comwrote:

 I have about 1 billion values I am trying to load into a new HBase table
 (with just one column and column family), but am running into some issues.
  Currently I am trying to use MapReduce to import these by first converting
 them to HFiles and then using LoadIncrementalHFiles.doBulkLoad().  I also
 use HFileOutputFormat.configureIncrementalLoad() as part of my MR job.  My
 code is essentially the same as this example:

 https://github.com/Paschalis/HBase-Bulk-Load-Example/blob/master/src/cy/ac/ucy/paschalis/hbase/bulkimport/Driver.java

 The problem I'm running into is that only 1 reducer is created
 by configureIncrementalLoad(), and there is not enough space on this node
 to handle all this data.  configureIncrementalLoad() should start one
 reducer for every region the table has, so apparently the table only has 1
 region -- maybe because it is empty and brand new (my understanding of how
 regions work is not crystal clear)?  The cluster has 5 region servers, so
 I'd at least like that many reducers to handle this loading.

 On a side note, I also tried the command line tool, completebulkload, but
 am running into other issues with this (timeouts, possible heap issues) --
 probably due to only one server being assigned the task of inserting all
 the records (i.e. I look at the region servers' logs, and only one of the
 servers has log entries; the rest are idle).

 Any help is appreciated

 -Dolan Antenucci



Re: Bulkload into empty table with configureIncrementalLoad()

2013-09-19 Thread Dolan Antenucci
Thanks J-D.  Any recommendations on how to determine what splits to use?
 For the keys I'm using strings, so wasn't sure what to put for my startKey
and endKey. For number of regions, I have a table pre-populated with the
same data (not using bulk load), so I can see that it has 68 regions.


On Thu, Sep 19, 2013 at 12:55 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote:

 You need to create the table with pre-splits, see
 http://hbase.apache.org/book.html#perf.writing

 J-D


 On Thu, Sep 19, 2013 at 9:52 AM, Dolan Antenucci antenucc...@gmail.com
 wrote:

  I have about 1 billion values I am trying to load into a new HBase table
  (with just one column and column family), but am running into some
 issues.
   Currently I am trying to use MapReduce to import these by first
 converting
  them to HFiles and then using LoadIncrementalHFiles.doBulkLoad().  I also
  use HFileOutputFormat.configureIncrementalLoad() as part of my MR job.
  My
  code is essentially the same as this example:
 
 
 https://github.com/Paschalis/HBase-Bulk-Load-Example/blob/master/src/cy/ac/ucy/paschalis/hbase/bulkimport/Driver.java
 
  The problem I'm running into is that only 1 reducer is created
  by configureIncrementalLoad(), and there is not enough space on this node
  to handle all this data.  configureIncrementalLoad() should start one
  reducer for every region the table has, so apparently the table only has
 1
  region -- maybe because it is empty and brand new (my understanding of
 how
  regions work is not crystal clear)?  The cluster has 5 region servers, so
  I'd at least like that many reducers to handle this loading.
 
  On a side note, I also tried the command line tool, completebulkload, but
  am running into other issues with this (timeouts, possible heap issues)
 --
  probably due to only one server being assigned the task of inserting all
  the records (i.e. I look at the region servers' logs, and only one of the
  servers has log entries; the rest are idle).
 
  Any help is appreciated
 
  -Dolan Antenucci
 



Re: Bulkload into empty table with configureIncrementalLoad()

2013-09-19 Thread Dolan Antenucci
To follow up on my previous question about how best to do the pre-splits, I
ended up using to following when creating my table:

admin.createTable(desc, Bytes.toBytes(0), Bytes.toBytes(2147483647),
100);

This was somewhat of a stab in the dark, but I based it
on RegionSplitter.MD5StringSplit's documentation, which said: Row are long
values in the range  = 7FFF. (Reminder: I'm using strings,
probably not uniformly distributed, as my row ID's).

It looks like about 80 of the regions received very little keys (many
received 0), and the other 20 received between 35m - 70m each.  Glancing at
the nodes responsible for the 20 popular regions, it looks like a fairly
even distribution across my cluster, so overall I'm optimistic with the
result (performance at first glance seems fine too).

Question: is there something I can do to achieve an even better
distribution across my regions?  As mentioned before, I have a table that I
populated via puts, so maybe this can be used to guide my pre-splits?  I
did try passing the result of this table's HTable.getStartKeys() (as well
as getEndKeys()) in as the splits, but got an error along the lines of key
cannot be empty.

Thanks again for your help.

-Dolan Antenucci


On Thu, Sep 19, 2013 at 2:53 PM, Dolan Antenucci antenucc...@gmail.comwrote:

 Thanks J-D.  Any recommendations on how to determine what splits to use?
  For the keys I'm using strings, so wasn't sure what to put for my startKey
 and endKey. For number of regions, I have a table pre-populated with the
 same data (not using bulk load), so I can see that it has 68 regions.


 On Thu, Sep 19, 2013 at 12:55 PM, Jean-Daniel Cryans 
 jdcry...@apache.orgwrote:

 You need to create the table with pre-splits, see
 http://hbase.apache.org/book.html#perf.writing

 J-D


 On Thu, Sep 19, 2013 at 9:52 AM, Dolan Antenucci antenucc...@gmail.com
 wrote:

  I have about 1 billion values I am trying to load into a new HBase table
  (with just one column and column family), but am running into some
 issues.
   Currently I am trying to use MapReduce to import these by first
 converting
  them to HFiles and then using LoadIncrementalHFiles.doBulkLoad().  I
 also
  use HFileOutputFormat.configureIncrementalLoad() as part of my MR job.
  My
  code is essentially the same as this example:
 
 
 https://github.com/Paschalis/HBase-Bulk-Load-Example/blob/master/src/cy/ac/ucy/paschalis/hbase/bulkimport/Driver.java
 
  The problem I'm running into is that only 1 reducer is created
  by configureIncrementalLoad(), and there is not enough space on this
 node
  to handle all this data.  configureIncrementalLoad() should start one
  reducer for every region the table has, so apparently the table only
 has 1
  region -- maybe because it is empty and brand new (my understanding of
 how
  regions work is not crystal clear)?  The cluster has 5 region servers,
 so
  I'd at least like that many reducers to handle this loading.
 
  On a side note, I also tried the command line tool, completebulkload,
 but
  am running into other issues with this (timeouts, possible heap issues)
 --
  probably due to only one server being assigned the task of inserting all
  the records (i.e. I look at the region servers' logs, and only one of
 the
  servers has log entries; the rest are idle).
 
  Any help is appreciated
 
  -Dolan Antenucci