several doubts about region split?

2013-07-17 Thread yonghu
Dear all,

From the HBase reference book, it mentions that when RegionServer splits
regions, it will offline the split region and then adds the daughter
regions to META, opens daughters on the parent's hosting RegionServer and
then reports the split to the Master.

I have a several questions:

1. What does offline means? Does it mean the region which will be splitted
is not available anymore? What happened to the read and write requests to
that region?

2. From the description, if I understand right it means that now the
RegionServer will contain two Regions (One RegionServer for both daughter
and parent regions ) instead of one RegionSever for daughter and one for
parent. If it is, what are the benefits of this approach? Hot-spot problem
is still there. Moreover, this approach will be a big problem if we use the
HBase default split approach. Suppose we bulk load data into HBase cluster,
initially every write request will be accepted by only one RegionServer.
After some write requests, the RegionServer cannot response any write
request as it reaches its disk volume threshold. Hence, some data must be
removed from one RegionSever to the other RegionServer. The question is
that why we don't do it at the region split time?

Thanks!

Yong


Re: several doubts about region split?

2013-07-17 Thread Ted Yu
bq. Does it mean the region which will be splitted is not available anymore?

Right.

bq. What happened to the read and write requests to that region?

The requests wouldn't be served by the hosting region server until daughter
regions become online.

Will try to dig up answer to question #2.
In short, load balancer is supposed to offload one of the daughter regions
if continuous write load incurs.

Cheers

On Wed, Jul 17, 2013 at 6:53 AM, yonghu yongyong...@gmail.com wrote:

 Dear all,

 From the HBase reference book, it mentions that when RegionServer splits
 regions, it will offline the split region and then adds the daughter
 regions to META, opens daughters on the parent's hosting RegionServer and
 then reports the split to the Master.

 I have a several questions:

 1. What does offline means? Does it mean the region which will be splitted
 is not available anymore? What happened to the read and write requests to
 that region?

 2. From the description, if I understand right it means that now the
 RegionServer will contain two Regions (One RegionServer for both daughter
 and parent regions ) instead of one RegionSever for daughter and one for
 parent. If it is, what are the benefits of this approach? Hot-spot problem
 is still there. Moreover, this approach will be a big problem if we use the
 HBase default split approach. Suppose we bulk load data into HBase cluster,
 initially every write request will be accepted by only one RegionServer.
 After some write requests, the RegionServer cannot response any write
 request as it reaches its disk volume threshold. Hence, some data must be
 removed from one RegionSever to the other RegionServer. The question is
 that why we don't do it at the region split time?

 Thanks!

 Yong



Re: several doubts about region split?

2013-07-17 Thread yonghu
Thanks for your quick response!

For the question one, what will be the latency? How long we need to wait
until the daughter regions are again online?

regards!

Yong



On Wed, Jul 17, 2013 at 4:05 PM, Ted Yu yuzhih...@gmail.com wrote:

 bq. Does it mean the region which will be splitted is not available
 anymore?

 Right.

 bq. What happened to the read and write requests to that region?

 The requests wouldn't be served by the hosting region server until daughter
 regions become online.

 Will try to dig up answer to question #2.
 In short, load balancer is supposed to offload one of the daughter regions
 if continuous write load incurs.

 Cheers

 On Wed, Jul 17, 2013 at 6:53 AM, yonghu yongyong...@gmail.com wrote:

  Dear all,
 
  From the HBase reference book, it mentions that when RegionServer splits
  regions, it will offline the split region and then adds the daughter
  regions to META, opens daughters on the parent's hosting RegionServer and
  then reports the split to the Master.
 
  I have a several questions:
 
  1. What does offline means? Does it mean the region which will be
 splitted
  is not available anymore? What happened to the read and write requests to
  that region?
 
  2. From the description, if I understand right it means that now the
  RegionServer will contain two Regions (One RegionServer for both daughter
  and parent regions ) instead of one RegionSever for daughter and one for
  parent. If it is, what are the benefits of this approach? Hot-spot
 problem
  is still there. Moreover, this approach will be a big problem if we use
 the
  HBase default split approach. Suppose we bulk load data into HBase
 cluster,
  initially every write request will be accepted by only one RegionServer.
  After some write requests, the RegionServer cannot response any write
  request as it reaches its disk volume threshold. Hence, some data must be
  removed from one RegionSever to the other RegionServer. The question is
  that why we don't do it at the region split time?
 
  Thanks!
 
  Yong
 



Re: several doubts about region split?

2013-07-17 Thread Jean-Daniel Cryans
Inline.

J-D

On Wed, Jul 17, 2013 at 7:10 AM, yonghu yongyong...@gmail.com wrote:
 Thanks for your quick response!

 For the question one, what will be the latency? How long we need to wait
 until the daughter regions are again online?

Usually a matter of 1-2 seconds.


 regards!

 Yong



 On Wed, Jul 17, 2013 at 4:05 PM, Ted Yu yuzhih...@gmail.com wrote:

 bq. Does it mean the region which will be splitted is not available
 anymore?

 Right.

 bq. What happened to the read and write requests to that region?

 The requests wouldn't be served by the hosting region server until daughter
 regions become online.

 Will try to dig up answer to question #2.
 In short, load balancer is supposed to offload one of the daughter regions
 if continuous write load incurs.

 Cheers

 On Wed, Jul 17, 2013 at 6:53 AM, yonghu yongyong...@gmail.com wrote:

  Dear all,
 
  From the HBase reference book, it mentions that when RegionServer splits
  regions, it will offline the split region and then adds the daughter
  regions to META, opens daughters on the parent's hosting RegionServer and
  then reports the split to the Master.
 
  I have a several questions:
 
  1. What does offline means? Does it mean the region which will be
 splitted
  is not available anymore? What happened to the read and write requests to
  that region?
 
  2. From the description, if I understand right it means that now the
  RegionServer will contain two Regions (One RegionServer for both daughter
  and parent regions ) instead of one RegionSever for daughter and one for
  parent. If it is, what are the benefits of this approach? Hot-spot
 problem
  is still there.

It's not a load problem it's a data problem. We're splitting when we
have enough data. Then HBase relies on the master doing some balancing
on the cluster.

Moreover, this approach will be a big problem if we use the
  HBase default split approach. Suppose we bulk load data into HBase
 cluster,
  initially every write request will be accepted by only one RegionServer.
  After some write requests, the RegionServer cannot response any write
  request as it reaches its disk volume threshold. Hence, some data must be
  removed from one RegionSever to the other RegionServer. The question is
  that why we don't do it at the region split time?

Since you read the reference book, you will also find in there that we
recommend never bulk loading data into a table with only 1 region. You
should always create your tables with pre-defined splits if you plan
on importing a lot of data.

J-D