Thanks Asaf, I made the response inline. On Thu, Jun 20, 2013 at 9:32 AM, Asaf Mesika <[email protected]> wrote:
> On Thu, Jun 20, 2013 at 12:59 AM, yun peng <[email protected]> wrote: > > > Thanks for the reply. The idea is interesting, but in practice, our > client > > don't know in advance how many data should be put to one RS. The data > write > > is redirected to next RS, only when current RS is initialising a flush() > > and begins to block the stream.. > > > > Can a single RS handle the load of the duration until HBase splits the > region and load balancing kicks in and moves the region another server? > > Right, currently the timeseries data (i.e., with sequential rowkey) is meta data in our system, and is not that heavy weight... it can be handled by a single RS... > > The real problem is not about splitting existing region, but instead > about > > adding a new region (or new key range). > > In the original example, before node n3 overflows, the system is like > > n1 [0,4], > > n2 [5,9], > > n3 [10,14] > > then n3 start to flush() (say Memstore.size = 5) which may block the > write > > stream to n3. We want the subsequent write stream to redirect back to, > say > > n1. so now n1 is accepting 15, 16... for range [15,19]. > > > Flush does not block HTable.put() or HTable.batch(), unless your system is > not tuned and your flushes are slow. > > If I understand right, flush() need to sort data, build index and sequentially write to disk.. which I think should, if not block, atleast interfere a lot with the thread for in-memory write (plus WAL). A drop in write throughput can be expected. > > > As I understand it right, the above behaviour should change HBase's > normal > > way to manage region-key mapping. And we want to know how much effort to > > put to change HBase? > > > Well, as I understand it - you write to n3, to a specific region (say > 10,inf). Once you pass the max size, it splits into (10,14) and (15,inf). > If now n3 RS has more than the average regions per RS, one region will move > to another RS. It may be (10,14) or (15,inf). > > For example, is it possible to specify the "max size" of split to be equal to Memstore.size so that flush and split (actually just updating range from [10,inf) to [10,14] in .META table, without actual data split) can co-occur? Given this possible, is it even possible to mandatorily indicate the new interval [15, inf) should be mapped to next RS (i.e., not based on # of regions on RS n3). > > Besides, I found Chapter 9 Advanced usage in Definitive Book talks a bit > > about this issue. And they are based on the idea of adding prefix or > hash. > > In their terminology, we need the "sequential key" approach, but with > > managed region mapping. > > > Why do you need the sequential key approach? Let's say you have a group > data correlated in some way but is scattered in 2-3 RS. You can always > write a coprocessor to run some logic close to the data, and then run it > again on the merged data in the client side, right? > > I agree with you on this general idea. Let me think a bit... > > > > > Yun > > > > > > > > On Wed, Jun 19, 2013 at 5:26 PM, Asaf Mesika <[email protected]> > > wrote: > > > > > You can use prefix split policy. Put the Same prefix for the data you > > need > > > in the same region and thus achieve locality of this data and also > haves > > a > > > good load of your data and avoid split policy. > > > I'm not sure you really need the requirement you described below > unless I > > > didn't follow your business requirements very well > > > > > > On Thursday, June 20, 2013, yun peng wrote: > > > > > > > It is our requirement that one batch of data writes (say of Memstore > > > size) > > > > should be in one RS. And > > > > salting prefix, while even the load, may not have this property. > > > > > > > > Our problem is really how to manipulate/customise the mapping of row > > key > > > > (or row key range) to the region servers, > > > > so that after one region overflows and starts to flush, the write > > stream > > > > can be automatically redirected to next region server, > > > > like in a round robin way? > > > > > > > > Is it possible to customize such policy on hmaster? Or there is a > > > similiar > > > > way as what CoProcessor does on region servers... > > > > > > > > > > > > On Wed, Jun 19, 2013 at 4:58 PM, Asaf Mesika <[email protected] > > > <javascript:;>> > > > > wrote: > > > > > > > > > The new splitted region might be moved due to load balancing. > Aren't > > > you > > > > > experiencing the classic hot spotting? Only 1 RS getting all write > > > > traffic? > > > > > Just place a preceding byte before the time stamp and round robin > > each > > > > put > > > > > on values 1-num of region servers. > > > > > > > > > > On Wednesday, June 19, 2013, yun peng wrote: > > > > > > > > > > > Hi, All, > > > > > > Our use case requires to persist a stream into system like HBase. > > The > > > > > > stream data is in format of <timestamp, value>. In other word, > > > > timestamp > > > > > is > > > > > > used as rowkey. We want to explore whether HBase is suitable for > > such > > > > > kind > > > > > > of data. > > > > > > > > > > > > The problem is that the domain of row key (or timestamp) grow > > > > constantly. > > > > > > For example, given 3 nodes, n1 n2 n3, they are resp. hosting row > > key > > > > > > partition [0,4], [5, 9], [10,12]. Currently it is the last node > n3 > > > who > > > > is > > > > > > busy receiving upcoming writes (of row key 13 and 14). This > > continues > > > > > until > > > > > > the region reaches max size 5 (that is, partition grows to > [10,14]) > > > and > > > > > > potentially splits. > > > > > > > > > > > > I am not expert on HBase split, but I am wondering after split, > > will > > > > the > > > > > > new writes still go to node n3 (for [10,14]) or the write stream > > can > > > be > > > > > > intelligently redirected to other less busy node, like n1. > > > > > > > > > > > > In case HBase can't do things like this, how easy is it to extend > > > HBase > > > > > for > > > > > > such functionality? Thanks... > > > > > > Yun > > > > > > > > > > > > > > > > > > > > >
