several doubts about region split?
Dear all, From the HBase reference book, it mentions that when RegionServer splits regions, it will offline the split region and then adds the daughter regions to META, opens daughters on the parent's hosting RegionServer and then reports the split to the Master. I have a several questions: 1. What does offline means? Does it mean the region which will be splitted is not available anymore? What happened to the read and write requests to that region? 2. From the description, if I understand right it means that now the RegionServer will contain two Regions (One RegionServer for both daughter and parent regions ) instead of one RegionSever for daughter and one for parent. If it is, what are the benefits of this approach? Hot-spot problem is still there. Moreover, this approach will be a big problem if we use the HBase default split approach. Suppose we bulk load data into HBase cluster, initially every write request will be accepted by only one RegionServer. After some write requests, the RegionServer cannot response any write request as it reaches its disk volume threshold. Hence, some data must be removed from one RegionSever to the other RegionServer. The question is that why we don't do it at the region split time? Thanks! Yong
Re: several doubts about region split?
bq. Does it mean the region which will be splitted is not available anymore? Right. bq. What happened to the read and write requests to that region? The requests wouldn't be served by the hosting region server until daughter regions become online. Will try to dig up answer to question #2. In short, load balancer is supposed to offload one of the daughter regions if continuous write load incurs. Cheers On Wed, Jul 17, 2013 at 6:53 AM, yonghu yongyong...@gmail.com wrote: Dear all, From the HBase reference book, it mentions that when RegionServer splits regions, it will offline the split region and then adds the daughter regions to META, opens daughters on the parent's hosting RegionServer and then reports the split to the Master. I have a several questions: 1. What does offline means? Does it mean the region which will be splitted is not available anymore? What happened to the read and write requests to that region? 2. From the description, if I understand right it means that now the RegionServer will contain two Regions (One RegionServer for both daughter and parent regions ) instead of one RegionSever for daughter and one for parent. If it is, what are the benefits of this approach? Hot-spot problem is still there. Moreover, this approach will be a big problem if we use the HBase default split approach. Suppose we bulk load data into HBase cluster, initially every write request will be accepted by only one RegionServer. After some write requests, the RegionServer cannot response any write request as it reaches its disk volume threshold. Hence, some data must be removed from one RegionSever to the other RegionServer. The question is that why we don't do it at the region split time? Thanks! Yong
Re: several doubts about region split?
Thanks for your quick response! For the question one, what will be the latency? How long we need to wait until the daughter regions are again online? regards! Yong On Wed, Jul 17, 2013 at 4:05 PM, Ted Yu yuzhih...@gmail.com wrote: bq. Does it mean the region which will be splitted is not available anymore? Right. bq. What happened to the read and write requests to that region? The requests wouldn't be served by the hosting region server until daughter regions become online. Will try to dig up answer to question #2. In short, load balancer is supposed to offload one of the daughter regions if continuous write load incurs. Cheers On Wed, Jul 17, 2013 at 6:53 AM, yonghu yongyong...@gmail.com wrote: Dear all, From the HBase reference book, it mentions that when RegionServer splits regions, it will offline the split region and then adds the daughter regions to META, opens daughters on the parent's hosting RegionServer and then reports the split to the Master. I have a several questions: 1. What does offline means? Does it mean the region which will be splitted is not available anymore? What happened to the read and write requests to that region? 2. From the description, if I understand right it means that now the RegionServer will contain two Regions (One RegionServer for both daughter and parent regions ) instead of one RegionSever for daughter and one for parent. If it is, what are the benefits of this approach? Hot-spot problem is still there. Moreover, this approach will be a big problem if we use the HBase default split approach. Suppose we bulk load data into HBase cluster, initially every write request will be accepted by only one RegionServer. After some write requests, the RegionServer cannot response any write request as it reaches its disk volume threshold. Hence, some data must be removed from one RegionSever to the other RegionServer. The question is that why we don't do it at the region split time? Thanks! Yong
Re: several doubts about region split?
Inline. J-D On Wed, Jul 17, 2013 at 7:10 AM, yonghu yongyong...@gmail.com wrote: Thanks for your quick response! For the question one, what will be the latency? How long we need to wait until the daughter regions are again online? Usually a matter of 1-2 seconds. regards! Yong On Wed, Jul 17, 2013 at 4:05 PM, Ted Yu yuzhih...@gmail.com wrote: bq. Does it mean the region which will be splitted is not available anymore? Right. bq. What happened to the read and write requests to that region? The requests wouldn't be served by the hosting region server until daughter regions become online. Will try to dig up answer to question #2. In short, load balancer is supposed to offload one of the daughter regions if continuous write load incurs. Cheers On Wed, Jul 17, 2013 at 6:53 AM, yonghu yongyong...@gmail.com wrote: Dear all, From the HBase reference book, it mentions that when RegionServer splits regions, it will offline the split region and then adds the daughter regions to META, opens daughters on the parent's hosting RegionServer and then reports the split to the Master. I have a several questions: 1. What does offline means? Does it mean the region which will be splitted is not available anymore? What happened to the read and write requests to that region? 2. From the description, if I understand right it means that now the RegionServer will contain two Regions (One RegionServer for both daughter and parent regions ) instead of one RegionSever for daughter and one for parent. If it is, what are the benefits of this approach? Hot-spot problem is still there. It's not a load problem it's a data problem. We're splitting when we have enough data. Then HBase relies on the master doing some balancing on the cluster. Moreover, this approach will be a big problem if we use the HBase default split approach. Suppose we bulk load data into HBase cluster, initially every write request will be accepted by only one RegionServer. After some write requests, the RegionServer cannot response any write request as it reaches its disk volume threshold. Hence, some data must be removed from one RegionSever to the other RegionServer. The question is that why we don't do it at the region split time? Since you read the reference book, you will also find in there that we recommend never bulk loading data into a table with only 1 region. You should always create your tables with pre-defined splits if you plan on importing a lot of data. J-D