RE: getSplits question

2011-02-10 Thread Michael Segel

Ryan,

Just to point out the obvious...

On smaller tables where you don't get enough parallelism, you can manually 
force the table's regions to be split.
My understanding that if/when the table grows it will then go back to splitting 
normally. 

This way if you have a 'small' look up table that is relatively static, you 
manually split it to the 'right' size for your cloud. 
If you are seeding a system, you can do the splits to get good parallelism and 
not overload a single region with inserts, then let it go back to its normal 
growth pattern and splits.

This would solve the OP's issue and as you point out, not worry about 
getSplits().

Does this make sense, or am I missing something?

-Mike

 Date: Wed, 9 Feb 2011 23:54:19 -0800
 Subject: Re: getSplits question
 From: ryano...@gmail.com
 To: user@hbase.apache.org
 CC: hbase-u...@hadoop.apache.org
 
 By default each map gets the contents of 1 region. A region is by
 default a maximum of 256MB. There is no trivial way to generally
 bisect a region in half, in terms of row count, by just knowing what
 we known (start, end key).
 
 For very large tables that have  100 regions, this algorithm works
 really well and you get some good parallelism.  If you want to see a
 lot of parallelism out of 1 region, you might have to work a lot
 harder.  Or reduce your region size and have more regions.  Be warned
 though, that more regions has performance hits in other areas
 (specifically server startup/shutdown/assignment times).  So you
 probably dont want 50,000 32MB regions.
 
 -ryan
 
 On Wed, Feb 9, 2011 at 11:46 PM, Geoff Hendrey ghend...@decarta.com wrote:
  Oh, I definitely don't *need* my own to run mapreduce. However, if I want 
  to control the number of records handled by each mapper (splitsize) and the 
  startrow and endrow, then I thought I had to write my own getSplits(). Is 
  there another way to accomplish this, because I do need the combination of 
  controlled splitsize and start/endrow.
 
  -geoff
 
  -Original Message-
  From: Ryan Rawson [mailto:ryano...@gmail.com]
  Sent: Wednesday, February 09, 2011 11:43 PM
  To: user@hbase.apache.org
  Cc: hbase-u...@hadoop.apache.org
  Subject: Re: getSplits question
 
  You shouldn't need to write your own getSplits() method to run a map
  reduce, I never did at least...
 
  -ryan
 
  On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey ghend...@decarta.com wrote:
  Are endrows inclusive or exclusive? The docs say exclusive, but then the
  question arises as to how to form the last split for getSplits(). The
  code below runs fine, but I believe it is omitting some rows, perhaps
  b/c of the exclusive end row. For the final split, should the endrow be
  null? I tried that, and got what appeared to be a final split without an
  endrow at all. Would appreciate a pointer to the correct implementation
  of getSplits in which I desire to provide a startrow, endrow, and
  splitsize. Apparently this isn't it J :
 
 
 
  int splitSize = context.getConfiguration().getInt(splitsize, 1000);
 
 byte[] splitStop = null;
 
 String hostname = null;
 
 while ((results = resultScanner.next(splitSize)).length
  0) {
 
 //   System.out.println(results
  :-- +results);
 
 byte[] splitStart = results[0].getRow();
 
 splitStop = results[results.length - 1].getRow();
  //I think this is a problem...we don't actually include this row in the
  split since it's exclusive..revisit this and correct
 
 HRegionLocation location =
  table.getRegionLocation(splitStart);
 
 hostname =
  location.getServerAddress().getHostname();
 
 InputSplit split = new
  TableSplit(table.getTableName(), splitStart, splitStop, hostname);
 
 splits.add(split);
 
 System.out.println(initializing splits:  +
  split.toString());
 
 }
 
 resultScanner.close();
 
 
 
 
 
  -g
 
 
 
  

Re: RE: getSplits question

2011-02-10 Thread Ryan Rawson
Yep, you're right on there.
On Feb 10, 2011 8:15 AM, Michael Segel michael_se...@hotmail.com wrote:

 Ryan,

 Just to point out the obvious...

 On smaller tables where you don't get enough parallelism, you can manually
force the table's regions to be split.
 My understanding that if/when the table grows it will then go back to
splitting normally.

 This way if you have a 'small' look up table that is relatively static,
you manually split it to the 'right' size for your cloud.
 If you are seeding a system, you can do the splits to get good parallelism
and not overload a single region with inserts, then let it go back to its
normal growth pattern and splits.

 This would solve the OP's issue and as you point out, not worry about
getSplits().

 Does this make sense, or am I missing something?

 -Mike

 Date: Wed, 9 Feb 2011 23:54:19 -0800
 Subject: Re: getSplits question
 From: ryano...@gmail.com
 To: user@hbase.apache.org
 CC: hbase-u...@hadoop.apache.org

 By default each map gets the contents of 1 region. A region is by
 default a maximum of 256MB. There is no trivial way to generally
 bisect a region in half, in terms of row count, by just knowing what
 we known (start, end key).

 For very large tables that have  100 regions, this algorithm works
 really well and you get some good parallelism. If you want to see a
 lot of parallelism out of 1 region, you might have to work a lot
 harder. Or reduce your region size and have more regions. Be warned
 though, that more regions has performance hits in other areas
 (specifically server startup/shutdown/assignment times). So you
 probably dont want 50,000 32MB regions.

 -ryan

 On Wed, Feb 9, 2011 at 11:46 PM, Geoff Hendrey ghend...@decarta.com
wrote:
  Oh, I definitely don't *need* my own to run mapreduce. However, if I
want to control the number of records handled by each mapper (splitsize) and
the startrow and endrow, then I thought I had to write my own getSplits().
Is there another way to accomplish this, because I do need the combination
of controlled splitsize and start/endrow.
 
  -geoff
 
  -Original Message-
  From: Ryan Rawson [mailto:ryano...@gmail.com]
  Sent: Wednesday, February 09, 2011 11:43 PM
  To: user@hbase.apache.org
  Cc: hbase-u...@hadoop.apache.org
  Subject: Re: getSplits question
 
  You shouldn't need to write your own getSplits() method to run a map
  reduce, I never did at least...
 
  -ryan
 
  On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey ghend...@decarta.com
wrote:
  Are endrows inclusive or exclusive? The docs say exclusive, but then
the
  question arises as to how to form the last split for getSplits(). The
  code below runs fine, but I believe it is omitting some rows, perhaps
  b/c of the exclusive end row. For the final split, should the endrow
be
  null? I tried that, and got what appeared to be a final split without
an
  endrow at all. Would appreciate a pointer to the correct
implementation
  of getSplits in which I desire to provide a startrow, endrow, and
  splitsize. Apparently this isn't it J :
 
 
 
  int splitSize = context.getConfiguration().getInt(splitsize, 1000);
 
  byte[] splitStop = null;
 
  String hostname = null;
 
  while ((results = resultScanner.next(splitSize)).length
  0) {
 
  // System.out.println(results
  :-- +results);
 
  byte[] splitStart = results[0].getRow();
 
  splitStop = results[results.length - 1].getRow();
  //I think this is a problem...we don't actually include this row in
the
  split since it's exclusive..revisit this and correct
 
  HRegionLocation location =
  table.getRegionLocation(splitStart);
 
  hostname =
  location.getServerAddress().getHostname();
 
  InputSplit split = new
  TableSplit(table.getTableName(), splitStart, splitStop, hostname);
 
  splits.add(split);
 
  System.out.println(initializing splits:  +
  split.toString());
 
  }
 
  resultScanner.close();
 
 
 
 
 
  -g
 
 
 



RE: getSplits question

2011-02-10 Thread Geoff Hendrey
I hunted around for some info on how to force a table to split, but I
didn't find what I was looking for. Is there a command I can issue from
the Hbase shell that would force every existing region to divide in
half? That would be quite useful. If not, what's the next best way to
force splits.

thanks!
-g

-Original Message-
From: Michael Segel [mailto:michael_se...@hotmail.com] 
Sent: Thursday, February 10, 2011 8:15 AM
To: user@hbase.apache.org
Cc: hbase-u...@hadoop.apache.org
Subject: RE: getSplits question


Ryan,

Just to point out the obvious...

On smaller tables where you don't get enough parallelism, you can
manually force the table's regions to be split.
My understanding that if/when the table grows it will then go back to
splitting normally. 

This way if you have a 'small' look up table that is relatively static,
you manually split it to the 'right' size for your cloud. 
If you are seeding a system, you can do the splits to get good
parallelism and not overload a single region with inserts, then let it
go back to its normal growth pattern and splits.

This would solve the OP's issue and as you point out, not worry about
getSplits().

Does this make sense, or am I missing something?

-Mike

 Date: Wed, 9 Feb 2011 23:54:19 -0800
 Subject: Re: getSplits question
 From: ryano...@gmail.com
 To: user@hbase.apache.org
 CC: hbase-u...@hadoop.apache.org
 
 By default each map gets the contents of 1 region. A region is by
 default a maximum of 256MB. There is no trivial way to generally
 bisect a region in half, in terms of row count, by just knowing what
 we known (start, end key).
 
 For very large tables that have  100 regions, this algorithm works
 really well and you get some good parallelism.  If you want to see a
 lot of parallelism out of 1 region, you might have to work a lot
 harder.  Or reduce your region size and have more regions.  Be warned
 though, that more regions has performance hits in other areas
 (specifically server startup/shutdown/assignment times).  So you
 probably dont want 50,000 32MB regions.
 
 -ryan
 
 On Wed, Feb 9, 2011 at 11:46 PM, Geoff Hendrey ghend...@decarta.com
wrote:
  Oh, I definitely don't *need* my own to run mapreduce. However, if I
want to control the number of records handled by each mapper (splitsize)
and the startrow and endrow, then I thought I had to write my own
getSplits(). Is there another way to accomplish this, because I do need
the combination of controlled splitsize and start/endrow.
 
  -geoff
 
  -Original Message-
  From: Ryan Rawson [mailto:ryano...@gmail.com]
  Sent: Wednesday, February 09, 2011 11:43 PM
  To: user@hbase.apache.org
  Cc: hbase-u...@hadoop.apache.org
  Subject: Re: getSplits question
 
  You shouldn't need to write your own getSplits() method to run a map
  reduce, I never did at least...
 
  -ryan
 
  On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey
ghend...@decarta.com wrote:
  Are endrows inclusive or exclusive? The docs say exclusive, but
then the
  question arises as to how to form the last split for getSplits().
The
  code below runs fine, but I believe it is omitting some rows,
perhaps
  b/c of the exclusive end row. For the final split, should the
endrow be
  null? I tried that, and got what appeared to be a final split
without an
  endrow at all. Would appreciate a pointer to the correct
implementation
  of getSplits in which I desire to provide a startrow, endrow, and
  splitsize. Apparently this isn't it J :
 
 
 
  int splitSize = context.getConfiguration().getInt(splitsize,
1000);
 
 byte[] splitStop = null;
 
 String hostname = null;
 
 while ((results =
resultScanner.next(splitSize)).length
  0) {
 
 //   System.out.println(results
  :-- +results);
 
 byte[] splitStart = results[0].getRow();
 
 splitStop = results[results.length -
1].getRow();
  //I think this is a problem...we don't actually include this row in
the
  split since it's exclusive..revisit this and correct
 
 HRegionLocation location =
  table.getRegionLocation(splitStart);
 
 hostname =
  location.getServerAddress().getHostname();
 
 InputSplit split = new
  TableSplit(table.getTableName(), splitStart, splitStop, hostname);
 
 splits.add(split);
 
 System.out.println(initializing splits:  +
  split.toString());
 
 }
 
 resultScanner.close();
 
 
 
 
 
  -g
 
 
 
  


Re: getSplits question

2011-02-10 Thread Jean-Daniel Cryans
There's the split command in the shel.

HBaseAdmin has that same method.

In the table's page from the master's web UI, there's a split button.

Finally, when creating a table, you can pre-specify all the split keys
with this method:
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable(org.apache.hadoop.hbase.HTableDescriptor,
byte[][])

J-D

On Thu, Feb 10, 2011 at 8:48 AM, Geoff Hendrey ghend...@decarta.com wrote:
 I hunted around for some info on how to force a table to split, but I
 didn't find what I was looking for. Is there a command I can issue from
 the Hbase shell that would force every existing region to divide in
 half? That would be quite useful. If not, what's the next best way to
 force splits.

 thanks!
 -g

 -Original Message-
 From: Michael Segel [mailto:michael_se...@hotmail.com]
 Sent: Thursday, February 10, 2011 8:15 AM
 To: user@hbase.apache.org
 Cc: hbase-u...@hadoop.apache.org
 Subject: RE: getSplits question


 Ryan,

 Just to point out the obvious...

 On smaller tables where you don't get enough parallelism, you can
 manually force the table's regions to be split.
 My understanding that if/when the table grows it will then go back to
 splitting normally.

 This way if you have a 'small' look up table that is relatively static,
 you manually split it to the 'right' size for your cloud.
 If you are seeding a system, you can do the splits to get good
 parallelism and not overload a single region with inserts, then let it
 go back to its normal growth pattern and splits.

 This would solve the OP's issue and as you point out, not worry about
 getSplits().

 Does this make sense, or am I missing something?

 -Mike

 Date: Wed, 9 Feb 2011 23:54:19 -0800
 Subject: Re: getSplits question
 From: ryano...@gmail.com
 To: user@hbase.apache.org
 CC: hbase-u...@hadoop.apache.org

 By default each map gets the contents of 1 region. A region is by
 default a maximum of 256MB. There is no trivial way to generally
 bisect a region in half, in terms of row count, by just knowing what
 we known (start, end key).

 For very large tables that have  100 regions, this algorithm works
 really well and you get some good parallelism.  If you want to see a
 lot of parallelism out of 1 region, you might have to work a lot
 harder.  Or reduce your region size and have more regions.  Be warned
 though, that more regions has performance hits in other areas
 (specifically server startup/shutdown/assignment times).  So you
 probably dont want 50,000 32MB regions.

 -ryan

 On Wed, Feb 9, 2011 at 11:46 PM, Geoff Hendrey ghend...@decarta.com
 wrote:
  Oh, I definitely don't *need* my own to run mapreduce. However, if I
 want to control the number of records handled by each mapper (splitsize)
 and the startrow and endrow, then I thought I had to write my own
 getSplits(). Is there another way to accomplish this, because I do need
 the combination of controlled splitsize and start/endrow.
 
  -geoff
 
  -Original Message-
  From: Ryan Rawson [mailto:ryano...@gmail.com]
  Sent: Wednesday, February 09, 2011 11:43 PM
  To: user@hbase.apache.org
  Cc: hbase-u...@hadoop.apache.org
  Subject: Re: getSplits question
 
  You shouldn't need to write your own getSplits() method to run a map
  reduce, I never did at least...
 
  -ryan
 
  On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey
 ghend...@decarta.com wrote:
  Are endrows inclusive or exclusive? The docs say exclusive, but
 then the
  question arises as to how to form the last split for getSplits().
 The
  code below runs fine, but I believe it is omitting some rows,
 perhaps
  b/c of the exclusive end row. For the final split, should the
 endrow be
  null? I tried that, and got what appeared to be a final split
 without an
  endrow at all. Would appreciate a pointer to the correct
 implementation
  of getSplits in which I desire to provide a startrow, endrow, and
  splitsize. Apparently this isn't it J :
 
 
 
  int splitSize = context.getConfiguration().getInt(splitsize,
 1000);
 
                 byte[] splitStop = null;
 
                 String hostname = null;
 
                 while ((results =
 resultScanner.next(splitSize)).length
  0) {
 
                     //   System.out.println(results
  :-- +results);
 
                     byte[] splitStart = results[0].getRow();
 
                     splitStop = results[results.length -
 1].getRow();
  //I think this is a problem...we don't actually include this row in
 the
  split since it's exclusive..revisit this and correct
 
                     HRegionLocation location =
  table.getRegionLocation(splitStart);
 
                     hostname =
  location.getServerAddress().getHostname();
 
                     InputSplit split = new
  TableSplit(table.getTableName(), splitStart, splitStop, hostname);
 
                     splits.add(split);
 
                     System.out.println(initializing splits:  +
  split.toString

Re: getSplits question

2011-02-09 Thread Ryan Rawson
You shouldn't need to write your own getSplits() method to run a map
reduce, I never did at least...

-ryan

On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey ghend...@decarta.com wrote:
 Are endrows inclusive or exclusive? The docs say exclusive, but then the
 question arises as to how to form the last split for getSplits(). The
 code below runs fine, but I believe it is omitting some rows, perhaps
 b/c of the exclusive end row. For the final split, should the endrow be
 null? I tried that, and got what appeared to be a final split without an
 endrow at all. Would appreciate a pointer to the correct implementation
 of getSplits in which I desire to provide a startrow, endrow, and
 splitsize. Apparently this isn't it J :



 int splitSize = context.getConfiguration().getInt(splitsize, 1000);

                byte[] splitStop = null;

                String hostname = null;

                while ((results = resultScanner.next(splitSize)).length
 0) {

                    //   System.out.println(results
 :-- +results);

                    byte[] splitStart = results[0].getRow();

                    splitStop = results[results.length - 1].getRow();
 //I think this is a problem...we don't actually include this row in the
 split since it's exclusive..revisit this and correct

                    HRegionLocation location =
 table.getRegionLocation(splitStart);

                    hostname =
 location.getServerAddress().getHostname();

                    InputSplit split = new
 TableSplit(table.getTableName(), splitStart, splitStop, hostname);

                    splits.add(split);

                    System.out.println(initializing splits:  +
 split.toString());

                }

                resultScanner.close();





 -g




RE: getSplits question

2011-02-09 Thread Geoff Hendrey
Oh, I definitely don't *need* my own to run mapreduce. However, if I want to 
control the number of records handled by each mapper (splitsize) and the 
startrow and endrow, then I thought I had to write my own getSplits(). Is there 
another way to accomplish this, because I do need the combination of controlled 
splitsize and start/endrow.

-geoff

-Original Message-
From: Ryan Rawson [mailto:ryano...@gmail.com] 
Sent: Wednesday, February 09, 2011 11:43 PM
To: user@hbase.apache.org
Cc: hbase-u...@hadoop.apache.org
Subject: Re: getSplits question

You shouldn't need to write your own getSplits() method to run a map
reduce, I never did at least...

-ryan

On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey ghend...@decarta.com wrote:
 Are endrows inclusive or exclusive? The docs say exclusive, but then the
 question arises as to how to form the last split for getSplits(). The
 code below runs fine, but I believe it is omitting some rows, perhaps
 b/c of the exclusive end row. For the final split, should the endrow be
 null? I tried that, and got what appeared to be a final split without an
 endrow at all. Would appreciate a pointer to the correct implementation
 of getSplits in which I desire to provide a startrow, endrow, and
 splitsize. Apparently this isn't it J :



 int splitSize = context.getConfiguration().getInt(splitsize, 1000);

                byte[] splitStop = null;

                String hostname = null;

                while ((results = resultScanner.next(splitSize)).length
 0) {

                    //   System.out.println(results
 :-- +results);

                    byte[] splitStart = results[0].getRow();

                    splitStop = results[results.length - 1].getRow();
 //I think this is a problem...we don't actually include this row in the
 split since it's exclusive..revisit this and correct

                    HRegionLocation location =
 table.getRegionLocation(splitStart);

                    hostname =
 location.getServerAddress().getHostname();

                    InputSplit split = new
 TableSplit(table.getTableName(), splitStart, splitStop, hostname);

                    splits.add(split);

                    System.out.println(initializing splits:  +
 split.toString());

                }

                resultScanner.close();





 -g




Re: getSplits question

2011-02-09 Thread Ryan Rawson
By default each map gets the contents of 1 region. A region is by
default a maximum of 256MB. There is no trivial way to generally
bisect a region in half, in terms of row count, by just knowing what
we known (start, end key).

For very large tables that have  100 regions, this algorithm works
really well and you get some good parallelism.  If you want to see a
lot of parallelism out of 1 region, you might have to work a lot
harder.  Or reduce your region size and have more regions.  Be warned
though, that more regions has performance hits in other areas
(specifically server startup/shutdown/assignment times).  So you
probably dont want 50,000 32MB regions.

-ryan

On Wed, Feb 9, 2011 at 11:46 PM, Geoff Hendrey ghend...@decarta.com wrote:
 Oh, I definitely don't *need* my own to run mapreduce. However, if I want to 
 control the number of records handled by each mapper (splitsize) and the 
 startrow and endrow, then I thought I had to write my own getSplits(). Is 
 there another way to accomplish this, because I do need the combination of 
 controlled splitsize and start/endrow.

 -geoff

 -Original Message-
 From: Ryan Rawson [mailto:ryano...@gmail.com]
 Sent: Wednesday, February 09, 2011 11:43 PM
 To: user@hbase.apache.org
 Cc: hbase-u...@hadoop.apache.org
 Subject: Re: getSplits question

 You shouldn't need to write your own getSplits() method to run a map
 reduce, I never did at least...

 -ryan

 On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey ghend...@decarta.com wrote:
 Are endrows inclusive or exclusive? The docs say exclusive, but then the
 question arises as to how to form the last split for getSplits(). The
 code below runs fine, but I believe it is omitting some rows, perhaps
 b/c of the exclusive end row. For the final split, should the endrow be
 null? I tried that, and got what appeared to be a final split without an
 endrow at all. Would appreciate a pointer to the correct implementation
 of getSplits in which I desire to provide a startrow, endrow, and
 splitsize. Apparently this isn't it J :



 int splitSize = context.getConfiguration().getInt(splitsize, 1000);

                byte[] splitStop = null;

                String hostname = null;

                while ((results = resultScanner.next(splitSize)).length
 0) {

                    //   System.out.println(results
 :-- +results);

                    byte[] splitStart = results[0].getRow();

                    splitStop = results[results.length - 1].getRow();
 //I think this is a problem...we don't actually include this row in the
 split since it's exclusive..revisit this and correct

                    HRegionLocation location =
 table.getRegionLocation(splitStart);

                    hostname =
 location.getServerAddress().getHostname();

                    InputSplit split = new
 TableSplit(table.getTableName(), splitStart, splitStop, hostname);

                    splits.add(split);

                    System.out.println(initializing splits:  +
 split.toString());

                }

                resultScanner.close();





 -g