Re: Bulkload into empty table with configureIncrementalLoad()
You need to create the table with pre-splits, see http://hbase.apache.org/book.html#perf.writing J-D On Thu, Sep 19, 2013 at 9:52 AM, Dolan Antenucci antenucc...@gmail.comwrote: I have about 1 billion values I am trying to load into a new HBase table (with just one column and column family), but am running into some issues. Currently I am trying to use MapReduce to import these by first converting them to HFiles and then using LoadIncrementalHFiles.doBulkLoad(). I also use HFileOutputFormat.configureIncrementalLoad() as part of my MR job. My code is essentially the same as this example: https://github.com/Paschalis/HBase-Bulk-Load-Example/blob/master/src/cy/ac/ucy/paschalis/hbase/bulkimport/Driver.java The problem I'm running into is that only 1 reducer is created by configureIncrementalLoad(), and there is not enough space on this node to handle all this data. configureIncrementalLoad() should start one reducer for every region the table has, so apparently the table only has 1 region -- maybe because it is empty and brand new (my understanding of how regions work is not crystal clear)? The cluster has 5 region servers, so I'd at least like that many reducers to handle this loading. On a side note, I also tried the command line tool, completebulkload, but am running into other issues with this (timeouts, possible heap issues) -- probably due to only one server being assigned the task of inserting all the records (i.e. I look at the region servers' logs, and only one of the servers has log entries; the rest are idle). Any help is appreciated -Dolan Antenucci
Re: Bulkload into empty table with configureIncrementalLoad()
Thanks J-D. Any recommendations on how to determine what splits to use? For the keys I'm using strings, so wasn't sure what to put for my startKey and endKey. For number of regions, I have a table pre-populated with the same data (not using bulk load), so I can see that it has 68 regions. On Thu, Sep 19, 2013 at 12:55 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote: You need to create the table with pre-splits, see http://hbase.apache.org/book.html#perf.writing J-D On Thu, Sep 19, 2013 at 9:52 AM, Dolan Antenucci antenucc...@gmail.com wrote: I have about 1 billion values I am trying to load into a new HBase table (with just one column and column family), but am running into some issues. Currently I am trying to use MapReduce to import these by first converting them to HFiles and then using LoadIncrementalHFiles.doBulkLoad(). I also use HFileOutputFormat.configureIncrementalLoad() as part of my MR job. My code is essentially the same as this example: https://github.com/Paschalis/HBase-Bulk-Load-Example/blob/master/src/cy/ac/ucy/paschalis/hbase/bulkimport/Driver.java The problem I'm running into is that only 1 reducer is created by configureIncrementalLoad(), and there is not enough space on this node to handle all this data. configureIncrementalLoad() should start one reducer for every region the table has, so apparently the table only has 1 region -- maybe because it is empty and brand new (my understanding of how regions work is not crystal clear)? The cluster has 5 region servers, so I'd at least like that many reducers to handle this loading. On a side note, I also tried the command line tool, completebulkload, but am running into other issues with this (timeouts, possible heap issues) -- probably due to only one server being assigned the task of inserting all the records (i.e. I look at the region servers' logs, and only one of the servers has log entries; the rest are idle). Any help is appreciated -Dolan Antenucci
Re: Bulkload into empty table with configureIncrementalLoad()
To follow up on my previous question about how best to do the pre-splits, I ended up using to following when creating my table: admin.createTable(desc, Bytes.toBytes(0), Bytes.toBytes(2147483647), 100); This was somewhat of a stab in the dark, but I based it on RegionSplitter.MD5StringSplit's documentation, which said: Row are long values in the range = 7FFF. (Reminder: I'm using strings, probably not uniformly distributed, as my row ID's). It looks like about 80 of the regions received very little keys (many received 0), and the other 20 received between 35m - 70m each. Glancing at the nodes responsible for the 20 popular regions, it looks like a fairly even distribution across my cluster, so overall I'm optimistic with the result (performance at first glance seems fine too). Question: is there something I can do to achieve an even better distribution across my regions? As mentioned before, I have a table that I populated via puts, so maybe this can be used to guide my pre-splits? I did try passing the result of this table's HTable.getStartKeys() (as well as getEndKeys()) in as the splits, but got an error along the lines of key cannot be empty. Thanks again for your help. -Dolan Antenucci On Thu, Sep 19, 2013 at 2:53 PM, Dolan Antenucci antenucc...@gmail.comwrote: Thanks J-D. Any recommendations on how to determine what splits to use? For the keys I'm using strings, so wasn't sure what to put for my startKey and endKey. For number of regions, I have a table pre-populated with the same data (not using bulk load), so I can see that it has 68 regions. On Thu, Sep 19, 2013 at 12:55 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote: You need to create the table with pre-splits, see http://hbase.apache.org/book.html#perf.writing J-D On Thu, Sep 19, 2013 at 9:52 AM, Dolan Antenucci antenucc...@gmail.com wrote: I have about 1 billion values I am trying to load into a new HBase table (with just one column and column family), but am running into some issues. Currently I am trying to use MapReduce to import these by first converting them to HFiles and then using LoadIncrementalHFiles.doBulkLoad(). I also use HFileOutputFormat.configureIncrementalLoad() as part of my MR job. My code is essentially the same as this example: https://github.com/Paschalis/HBase-Bulk-Load-Example/blob/master/src/cy/ac/ucy/paschalis/hbase/bulkimport/Driver.java The problem I'm running into is that only 1 reducer is created by configureIncrementalLoad(), and there is not enough space on this node to handle all this data. configureIncrementalLoad() should start one reducer for every region the table has, so apparently the table only has 1 region -- maybe because it is empty and brand new (my understanding of how regions work is not crystal clear)? The cluster has 5 region servers, so I'd at least like that many reducers to handle this loading. On a side note, I also tried the command line tool, completebulkload, but am running into other issues with this (timeouts, possible heap issues) -- probably due to only one server being assigned the task of inserting all the records (i.e. I look at the region servers' logs, and only one of the servers has log entries; the rest are idle). Any help is appreciated -Dolan Antenucci