On May 29, 2012, at 1:24 PM, Em wrote:

>> But you're trading time & space at write time for extremely fast
>> speeds at write time.
> You ment "extremely fast speeds at read time", don't you?

Ha, yes, thanks. That's what I meant.

> However this means that Sheldon has to do at least two requests to fetch
> his latest tweets.
> First: Get the latest columns (aka tweets) of his row and second do a
> multiget to fetch their content. Okay, that's more than one request but
> you get what I mean. Am I right?

Yup, unless you denormalize the tweet bodies as well--then you just read the 
current user's record and you have everything you need (with the downside of 
massive data duplication).

> Although I got this little overhead here, does read perform at low
> latency in a large scale environment? I think that a RDMBS has to do
> something equal, doesn't it?

Yes, but the relational DB has it all on one node, whereas in a distributed 
database, it's as many RPC calls as you have nodes, or something on the order 
of that (see Nicolas's explanation, which is better than mine). 

The key difference is that that read access time for HBase is still constant 
when you have PB of data (whereas with a PB of data, the relational DB has long 
since fallen over). The underlying reason for that is that HBase basically took 
the top level of the seek operation (which is a fast memory access into the top 
node of a B+ tree, if you're talking about an RDBMS on one machine) and made it 
a cross machine lookup ("which server would have this data?"). So it's 
fundamentally a lot slower than an RDBMS, but still constant w/r/t your overall 
data size. If you remember your algorithms class, constant factors fall out 
when N gets larger, and you only care about the big O. :)

> Maybe it makes sense to have a desing of a key so that all the  relevant
> tweets for a user are placed at the same region?

Again, that works in a denormalized sense, where you lead each row key with the 
current user (the tweet recipient) and copy the tweets there. If you are saying 
that you'd somehow find a magic formula so that people who follow each other 
happen to be on the same region server, you can forget it. Facebook and Twitter 
have tried partitioning schemes like that for their social graph data, and 
found there's no magic boundaries, even countries. (I know, citation needed, 
and I can't find it just now, but I'm sure I read that somewhere. :) But it 
makes sense; I can follow anybody, anybody can follow me, so on average my 
followers and followees would be randomly sprayed across all region servers. I 
either denormalize stuff (so my row contains everything I need) or I commit to 
doing GETs to all other region servers when I need that data. And if I really 
need it to be low latency, it's not a winning bet to have every read spray 
across a lot of servers like that. (That's just my opinion, perhaps I'm wrong 
about that.)

> Maybe one better does a tall-schema, since row-locking can impact
> performance a lot for people who follow a lot of other persons but
> that's not the topic here.

Yes, very true. We also haven't broached the subject of how long writes would 
take if you denormalize, and whether they should be asynchronous (for everyone, 
or just for the Wil Wheatons). If you put all of my incoming tweets into a 
single row for me, that's potentially lock-wait city. Twitter has a follow 
limit, maybe that's why. :)

Ian

Reply via email to