RE: getSplits question
Ryan, Just to point out the obvious... On smaller tables where you don't get enough parallelism, you can manually force the table's regions to be split. My understanding that if/when the table grows it will then go back to splitting normally. This way if you have a 'small' look up table that is relatively static, you manually split it to the 'right' size for your cloud. If you are seeding a system, you can do the splits to get good parallelism and not overload a single region with inserts, then let it go back to its normal growth pattern and splits. This would solve the OP's issue and as you point out, not worry about getSplits(). Does this make sense, or am I missing something? -Mike Date: Wed, 9 Feb 2011 23:54:19 -0800 Subject: Re: getSplits question From: ryano...@gmail.com To: user@hbase.apache.org CC: hbase-u...@hadoop.apache.org By default each map gets the contents of 1 region. A region is by default a maximum of 256MB. There is no trivial way to generally bisect a region in half, in terms of row count, by just knowing what we known (start, end key). For very large tables that have 100 regions, this algorithm works really well and you get some good parallelism. If you want to see a lot of parallelism out of 1 region, you might have to work a lot harder. Or reduce your region size and have more regions. Be warned though, that more regions has performance hits in other areas (specifically server startup/shutdown/assignment times). So you probably dont want 50,000 32MB regions. -ryan On Wed, Feb 9, 2011 at 11:46 PM, Geoff Hendrey ghend...@decarta.com wrote: Oh, I definitely don't *need* my own to run mapreduce. However, if I want to control the number of records handled by each mapper (splitsize) and the startrow and endrow, then I thought I had to write my own getSplits(). Is there another way to accomplish this, because I do need the combination of controlled splitsize and start/endrow. -geoff -Original Message- From: Ryan Rawson [mailto:ryano...@gmail.com] Sent: Wednesday, February 09, 2011 11:43 PM To: user@hbase.apache.org Cc: hbase-u...@hadoop.apache.org Subject: Re: getSplits question You shouldn't need to write your own getSplits() method to run a map reduce, I never did at least... -ryan On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey ghend...@decarta.com wrote: Are endrows inclusive or exclusive? The docs say exclusive, but then the question arises as to how to form the last split for getSplits(). The code below runs fine, but I believe it is omitting some rows, perhaps b/c of the exclusive end row. For the final split, should the endrow be null? I tried that, and got what appeared to be a final split without an endrow at all. Would appreciate a pointer to the correct implementation of getSplits in which I desire to provide a startrow, endrow, and splitsize. Apparently this isn't it J : int splitSize = context.getConfiguration().getInt(splitsize, 1000); byte[] splitStop = null; String hostname = null; while ((results = resultScanner.next(splitSize)).length 0) { // System.out.println(results :-- +results); byte[] splitStart = results[0].getRow(); splitStop = results[results.length - 1].getRow(); //I think this is a problem...we don't actually include this row in the split since it's exclusive..revisit this and correct HRegionLocation location = table.getRegionLocation(splitStart); hostname = location.getServerAddress().getHostname(); InputSplit split = new TableSplit(table.getTableName(), splitStart, splitStop, hostname); splits.add(split); System.out.println(initializing splits: + split.toString()); } resultScanner.close(); -g
Re: RE: getSplits question
Yep, you're right on there. On Feb 10, 2011 8:15 AM, Michael Segel michael_se...@hotmail.com wrote: Ryan, Just to point out the obvious... On smaller tables where you don't get enough parallelism, you can manually force the table's regions to be split. My understanding that if/when the table grows it will then go back to splitting normally. This way if you have a 'small' look up table that is relatively static, you manually split it to the 'right' size for your cloud. If you are seeding a system, you can do the splits to get good parallelism and not overload a single region with inserts, then let it go back to its normal growth pattern and splits. This would solve the OP's issue and as you point out, not worry about getSplits(). Does this make sense, or am I missing something? -Mike Date: Wed, 9 Feb 2011 23:54:19 -0800 Subject: Re: getSplits question From: ryano...@gmail.com To: user@hbase.apache.org CC: hbase-u...@hadoop.apache.org By default each map gets the contents of 1 region. A region is by default a maximum of 256MB. There is no trivial way to generally bisect a region in half, in terms of row count, by just knowing what we known (start, end key). For very large tables that have 100 regions, this algorithm works really well and you get some good parallelism. If you want to see a lot of parallelism out of 1 region, you might have to work a lot harder. Or reduce your region size and have more regions. Be warned though, that more regions has performance hits in other areas (specifically server startup/shutdown/assignment times). So you probably dont want 50,000 32MB regions. -ryan On Wed, Feb 9, 2011 at 11:46 PM, Geoff Hendrey ghend...@decarta.com wrote: Oh, I definitely don't *need* my own to run mapreduce. However, if I want to control the number of records handled by each mapper (splitsize) and the startrow and endrow, then I thought I had to write my own getSplits(). Is there another way to accomplish this, because I do need the combination of controlled splitsize and start/endrow. -geoff -Original Message- From: Ryan Rawson [mailto:ryano...@gmail.com] Sent: Wednesday, February 09, 2011 11:43 PM To: user@hbase.apache.org Cc: hbase-u...@hadoop.apache.org Subject: Re: getSplits question You shouldn't need to write your own getSplits() method to run a map reduce, I never did at least... -ryan On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey ghend...@decarta.com wrote: Are endrows inclusive or exclusive? The docs say exclusive, but then the question arises as to how to form the last split for getSplits(). The code below runs fine, but I believe it is omitting some rows, perhaps b/c of the exclusive end row. For the final split, should the endrow be null? I tried that, and got what appeared to be a final split without an endrow at all. Would appreciate a pointer to the correct implementation of getSplits in which I desire to provide a startrow, endrow, and splitsize. Apparently this isn't it J : int splitSize = context.getConfiguration().getInt(splitsize, 1000); byte[] splitStop = null; String hostname = null; while ((results = resultScanner.next(splitSize)).length 0) { // System.out.println(results :-- +results); byte[] splitStart = results[0].getRow(); splitStop = results[results.length - 1].getRow(); //I think this is a problem...we don't actually include this row in the split since it's exclusive..revisit this and correct HRegionLocation location = table.getRegionLocation(splitStart); hostname = location.getServerAddress().getHostname(); InputSplit split = new TableSplit(table.getTableName(), splitStart, splitStop, hostname); splits.add(split); System.out.println(initializing splits: + split.toString()); } resultScanner.close(); -g
RE: getSplits question
I hunted around for some info on how to force a table to split, but I didn't find what I was looking for. Is there a command I can issue from the Hbase shell that would force every existing region to divide in half? That would be quite useful. If not, what's the next best way to force splits. thanks! -g -Original Message- From: Michael Segel [mailto:michael_se...@hotmail.com] Sent: Thursday, February 10, 2011 8:15 AM To: user@hbase.apache.org Cc: hbase-u...@hadoop.apache.org Subject: RE: getSplits question Ryan, Just to point out the obvious... On smaller tables where you don't get enough parallelism, you can manually force the table's regions to be split. My understanding that if/when the table grows it will then go back to splitting normally. This way if you have a 'small' look up table that is relatively static, you manually split it to the 'right' size for your cloud. If you are seeding a system, you can do the splits to get good parallelism and not overload a single region with inserts, then let it go back to its normal growth pattern and splits. This would solve the OP's issue and as you point out, not worry about getSplits(). Does this make sense, or am I missing something? -Mike Date: Wed, 9 Feb 2011 23:54:19 -0800 Subject: Re: getSplits question From: ryano...@gmail.com To: user@hbase.apache.org CC: hbase-u...@hadoop.apache.org By default each map gets the contents of 1 region. A region is by default a maximum of 256MB. There is no trivial way to generally bisect a region in half, in terms of row count, by just knowing what we known (start, end key). For very large tables that have 100 regions, this algorithm works really well and you get some good parallelism. If you want to see a lot of parallelism out of 1 region, you might have to work a lot harder. Or reduce your region size and have more regions. Be warned though, that more regions has performance hits in other areas (specifically server startup/shutdown/assignment times). So you probably dont want 50,000 32MB regions. -ryan On Wed, Feb 9, 2011 at 11:46 PM, Geoff Hendrey ghend...@decarta.com wrote: Oh, I definitely don't *need* my own to run mapreduce. However, if I want to control the number of records handled by each mapper (splitsize) and the startrow and endrow, then I thought I had to write my own getSplits(). Is there another way to accomplish this, because I do need the combination of controlled splitsize and start/endrow. -geoff -Original Message- From: Ryan Rawson [mailto:ryano...@gmail.com] Sent: Wednesday, February 09, 2011 11:43 PM To: user@hbase.apache.org Cc: hbase-u...@hadoop.apache.org Subject: Re: getSplits question You shouldn't need to write your own getSplits() method to run a map reduce, I never did at least... -ryan On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey ghend...@decarta.com wrote: Are endrows inclusive or exclusive? The docs say exclusive, but then the question arises as to how to form the last split for getSplits(). The code below runs fine, but I believe it is omitting some rows, perhaps b/c of the exclusive end row. For the final split, should the endrow be null? I tried that, and got what appeared to be a final split without an endrow at all. Would appreciate a pointer to the correct implementation of getSplits in which I desire to provide a startrow, endrow, and splitsize. Apparently this isn't it J : int splitSize = context.getConfiguration().getInt(splitsize, 1000); byte[] splitStop = null; String hostname = null; while ((results = resultScanner.next(splitSize)).length 0) { // System.out.println(results :-- +results); byte[] splitStart = results[0].getRow(); splitStop = results[results.length - 1].getRow(); //I think this is a problem...we don't actually include this row in the split since it's exclusive..revisit this and correct HRegionLocation location = table.getRegionLocation(splitStart); hostname = location.getServerAddress().getHostname(); InputSplit split = new TableSplit(table.getTableName(), splitStart, splitStop, hostname); splits.add(split); System.out.println(initializing splits: + split.toString()); } resultScanner.close(); -g
Re: getSplits question
There's the split command in the shel. HBaseAdmin has that same method. In the table's page from the master's web UI, there's a split button. Finally, when creating a table, you can pre-specify all the split keys with this method: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable(org.apache.hadoop.hbase.HTableDescriptor, byte[][]) J-D On Thu, Feb 10, 2011 at 8:48 AM, Geoff Hendrey ghend...@decarta.com wrote: I hunted around for some info on how to force a table to split, but I didn't find what I was looking for. Is there a command I can issue from the Hbase shell that would force every existing region to divide in half? That would be quite useful. If not, what's the next best way to force splits. thanks! -g -Original Message- From: Michael Segel [mailto:michael_se...@hotmail.com] Sent: Thursday, February 10, 2011 8:15 AM To: user@hbase.apache.org Cc: hbase-u...@hadoop.apache.org Subject: RE: getSplits question Ryan, Just to point out the obvious... On smaller tables where you don't get enough parallelism, you can manually force the table's regions to be split. My understanding that if/when the table grows it will then go back to splitting normally. This way if you have a 'small' look up table that is relatively static, you manually split it to the 'right' size for your cloud. If you are seeding a system, you can do the splits to get good parallelism and not overload a single region with inserts, then let it go back to its normal growth pattern and splits. This would solve the OP's issue and as you point out, not worry about getSplits(). Does this make sense, or am I missing something? -Mike Date: Wed, 9 Feb 2011 23:54:19 -0800 Subject: Re: getSplits question From: ryano...@gmail.com To: user@hbase.apache.org CC: hbase-u...@hadoop.apache.org By default each map gets the contents of 1 region. A region is by default a maximum of 256MB. There is no trivial way to generally bisect a region in half, in terms of row count, by just knowing what we known (start, end key). For very large tables that have 100 regions, this algorithm works really well and you get some good parallelism. If you want to see a lot of parallelism out of 1 region, you might have to work a lot harder. Or reduce your region size and have more regions. Be warned though, that more regions has performance hits in other areas (specifically server startup/shutdown/assignment times). So you probably dont want 50,000 32MB regions. -ryan On Wed, Feb 9, 2011 at 11:46 PM, Geoff Hendrey ghend...@decarta.com wrote: Oh, I definitely don't *need* my own to run mapreduce. However, if I want to control the number of records handled by each mapper (splitsize) and the startrow and endrow, then I thought I had to write my own getSplits(). Is there another way to accomplish this, because I do need the combination of controlled splitsize and start/endrow. -geoff -Original Message- From: Ryan Rawson [mailto:ryano...@gmail.com] Sent: Wednesday, February 09, 2011 11:43 PM To: user@hbase.apache.org Cc: hbase-u...@hadoop.apache.org Subject: Re: getSplits question You shouldn't need to write your own getSplits() method to run a map reduce, I never did at least... -ryan On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey ghend...@decarta.com wrote: Are endrows inclusive or exclusive? The docs say exclusive, but then the question arises as to how to form the last split for getSplits(). The code below runs fine, but I believe it is omitting some rows, perhaps b/c of the exclusive end row. For the final split, should the endrow be null? I tried that, and got what appeared to be a final split without an endrow at all. Would appreciate a pointer to the correct implementation of getSplits in which I desire to provide a startrow, endrow, and splitsize. Apparently this isn't it J : int splitSize = context.getConfiguration().getInt(splitsize, 1000); byte[] splitStop = null; String hostname = null; while ((results = resultScanner.next(splitSize)).length 0) { // System.out.println(results :-- +results); byte[] splitStart = results[0].getRow(); splitStop = results[results.length - 1].getRow(); //I think this is a problem...we don't actually include this row in the split since it's exclusive..revisit this and correct HRegionLocation location = table.getRegionLocation(splitStart); hostname = location.getServerAddress().getHostname(); InputSplit split = new TableSplit(table.getTableName(), splitStart, splitStop, hostname); splits.add(split); System.out.println(initializing splits: + split.toString
Re: getSplits question
You shouldn't need to write your own getSplits() method to run a map reduce, I never did at least... -ryan On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey ghend...@decarta.com wrote: Are endrows inclusive or exclusive? The docs say exclusive, but then the question arises as to how to form the last split for getSplits(). The code below runs fine, but I believe it is omitting some rows, perhaps b/c of the exclusive end row. For the final split, should the endrow be null? I tried that, and got what appeared to be a final split without an endrow at all. Would appreciate a pointer to the correct implementation of getSplits in which I desire to provide a startrow, endrow, and splitsize. Apparently this isn't it J : int splitSize = context.getConfiguration().getInt(splitsize, 1000); byte[] splitStop = null; String hostname = null; while ((results = resultScanner.next(splitSize)).length 0) { // System.out.println(results :-- +results); byte[] splitStart = results[0].getRow(); splitStop = results[results.length - 1].getRow(); //I think this is a problem...we don't actually include this row in the split since it's exclusive..revisit this and correct HRegionLocation location = table.getRegionLocation(splitStart); hostname = location.getServerAddress().getHostname(); InputSplit split = new TableSplit(table.getTableName(), splitStart, splitStop, hostname); splits.add(split); System.out.println(initializing splits: + split.toString()); } resultScanner.close(); -g
RE: getSplits question
Oh, I definitely don't *need* my own to run mapreduce. However, if I want to control the number of records handled by each mapper (splitsize) and the startrow and endrow, then I thought I had to write my own getSplits(). Is there another way to accomplish this, because I do need the combination of controlled splitsize and start/endrow. -geoff -Original Message- From: Ryan Rawson [mailto:ryano...@gmail.com] Sent: Wednesday, February 09, 2011 11:43 PM To: user@hbase.apache.org Cc: hbase-u...@hadoop.apache.org Subject: Re: getSplits question You shouldn't need to write your own getSplits() method to run a map reduce, I never did at least... -ryan On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey ghend...@decarta.com wrote: Are endrows inclusive or exclusive? The docs say exclusive, but then the question arises as to how to form the last split for getSplits(). The code below runs fine, but I believe it is omitting some rows, perhaps b/c of the exclusive end row. For the final split, should the endrow be null? I tried that, and got what appeared to be a final split without an endrow at all. Would appreciate a pointer to the correct implementation of getSplits in which I desire to provide a startrow, endrow, and splitsize. Apparently this isn't it J : int splitSize = context.getConfiguration().getInt(splitsize, 1000); byte[] splitStop = null; String hostname = null; while ((results = resultScanner.next(splitSize)).length 0) { // System.out.println(results :-- +results); byte[] splitStart = results[0].getRow(); splitStop = results[results.length - 1].getRow(); //I think this is a problem...we don't actually include this row in the split since it's exclusive..revisit this and correct HRegionLocation location = table.getRegionLocation(splitStart); hostname = location.getServerAddress().getHostname(); InputSplit split = new TableSplit(table.getTableName(), splitStart, splitStop, hostname); splits.add(split); System.out.println(initializing splits: + split.toString()); } resultScanner.close(); -g
Re: getSplits question
By default each map gets the contents of 1 region. A region is by default a maximum of 256MB. There is no trivial way to generally bisect a region in half, in terms of row count, by just knowing what we known (start, end key). For very large tables that have 100 regions, this algorithm works really well and you get some good parallelism. If you want to see a lot of parallelism out of 1 region, you might have to work a lot harder. Or reduce your region size and have more regions. Be warned though, that more regions has performance hits in other areas (specifically server startup/shutdown/assignment times). So you probably dont want 50,000 32MB regions. -ryan On Wed, Feb 9, 2011 at 11:46 PM, Geoff Hendrey ghend...@decarta.com wrote: Oh, I definitely don't *need* my own to run mapreduce. However, if I want to control the number of records handled by each mapper (splitsize) and the startrow and endrow, then I thought I had to write my own getSplits(). Is there another way to accomplish this, because I do need the combination of controlled splitsize and start/endrow. -geoff -Original Message- From: Ryan Rawson [mailto:ryano...@gmail.com] Sent: Wednesday, February 09, 2011 11:43 PM To: user@hbase.apache.org Cc: hbase-u...@hadoop.apache.org Subject: Re: getSplits question You shouldn't need to write your own getSplits() method to run a map reduce, I never did at least... -ryan On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey ghend...@decarta.com wrote: Are endrows inclusive or exclusive? The docs say exclusive, but then the question arises as to how to form the last split for getSplits(). The code below runs fine, but I believe it is omitting some rows, perhaps b/c of the exclusive end row. For the final split, should the endrow be null? I tried that, and got what appeared to be a final split without an endrow at all. Would appreciate a pointer to the correct implementation of getSplits in which I desire to provide a startrow, endrow, and splitsize. Apparently this isn't it J : int splitSize = context.getConfiguration().getInt(splitsize, 1000); byte[] splitStop = null; String hostname = null; while ((results = resultScanner.next(splitSize)).length 0) { // System.out.println(results :-- +results); byte[] splitStart = results[0].getRow(); splitStop = results[results.length - 1].getRow(); //I think this is a problem...we don't actually include this row in the split since it's exclusive..revisit this and correct HRegionLocation location = table.getRegionLocation(splitStart); hostname = location.getServerAddress().getHostname(); InputSplit split = new TableSplit(table.getTableName(), splitStart, splitStop, hostname); splits.add(split); System.out.println(initializing splits: + split.toString()); } resultScanner.close(); -g