I are planning to create a "scheduled task list" table in our hbase cluster. 
Essentially we will define a table with key timestamp and then the row contents 
will be all the tasks that need to be processed within that second (or whatever 
time period). I am trying to do the "reasonably wide rows" design mentioned in 
the hbasecon opentsdb talk. A couple of questions:

1. Should we use append or put to create tasks? Since these rows will not live 
forever, storage space in not a concern, read/write performance is more 
important. As concurrency increases I would guess the row lock may become an 
issue in append? Can appends be batched by the client or do they execute 
immediately?

2. I am a little worried about hotspots. This basic design may cause issues in 
terms of the table's performance. Many tasks will execute and reschedule 
themselves using the same interval, t + 1 hour for example. So many the writes 
may all go to the same block.  Also, we have a lot of other data so I am 
worried it may impact performance of unrelated data if the region server gets 
too busy servicing the task list table. I can think of 2 strategies to avoid 
this. One would be to create N different tables and read/write tasks to them 
randomly. This may spread load across servers, but there is no guarantee hbase 
will place the tables on different region servers, correct? The other would be 
to prefix the timestamp row key with a random leading byte. Then when reading 
from the task list table, consumers could scan from any/all possible values of 
the random byte + current timestamp to obtain tasks. Both strategies seem like 
they could spread out load, but at the cost of more work/complexity to read 
tasks from the table. Do either of those approaches make sense? 

On the read side, it seems like a similar problem exists in that all consumers 
will be reading rows based on the current timestamp. Is this good because the 
block will very likely be cached or bad because the region server may become 
overloaded? I have a feeling the answer is going to be "it depends". :)

I did see the previous posts on queues and the tips there - use zookeeper for 
coordination, schedule major compactions, etc. Sorry if these questions are 
basic, I am pretty new to hbase. Thanks!

Reply via email to