Hi Ian,

answers between the lines:

Am 29.05.2012 21:26, schrieb Ian Varley:
>> However this means that Sheldon has to do at least two requests to fetch
>> his latest tweets.
>> First: Get the latest columns (aka tweets) of his row and second do a
>> multiget to fetch their content. Okay, that's more than one request but
>> you get what I mean. Am I right?
> 
> Yup, unless you denormalize the tweet bodies as well--then you just read the 
> current user's record and you have everything you need (with the downside of 
> massive data duplication).
Well, I think this would be bad practice for editable stuff like tweets.
They can be deleted, updated etc.. Furthermore at some point the data
duplication will come at its expense - big data or not.
Do you aggree?

>> Although I got this little overhead here, does read perform at low
>> latency in a large scale environment? I think that a RDMBS has to do
>> something equal, doesn't it?
> 
> Yes, but the relational DB has it all on one node, whereas in a distributed 
> database, it's as many RPC calls as you have nodes, or something on the order 
> of that (see Nicolas's explanation, which is better than mine). 

Nicolas..? :)

> The key difference is that that read access time for HBase is still constant 
> when you have PB of data (whereas with a PB of data, the relational DB has 
> long since fallen over). The underlying reason for that is that HBase 
> basically took the top level of the seek operation (which is a fast memory 
> access into the top node of a B+ tree, if you're talking about an RDBMS on 
> one machine) and made it a cross machine lookup ("which server would have 
> this data?"). So it's fundamentally a lot slower than an RDBMS, but still 
> constant w/r/t your overall data size. If you remember your algorithms class, 
> constant factors fall out when N gets larger, and you only care about the big 
> O. :)

Yes :) I am interested in the underlying algorithms of HBase's
hash-algorithms. I think it's an interesting approach.

>> Maybe it makes sense to have a desing of a key so that all the  relevant
>> tweets for a user are placed at the same region?
> 
> Again, that works in a denormalized sense, where you lead each row key with 
> the current user (the tweet recipient) and copy the tweets there. If you are 
> saying that you'd somehow find a magic formula so that people who follow each 
> other happen to be on the same region server, you can forget it. Facebook and 
> Twitter have tried partitioning schemes like that for their social graph 
> data, and found there's no magic boundaries, even countries. (I know, 
> citation needed, and I can't find it just now, but I'm sure I read that 
> somewhere. :) But it makes sense; I can follow anybody, anybody can follow 
> me, so on average my followers and followees would be randomly sprayed across 
> all region servers. I either denormalize stuff (so my row contains everything 
> I need) or I commit to doing GETs to all other region servers when I need 
> that data. And if I really need it to be low latency, it's not a winning bet 
> to have every read spray across a lot of servers like that. (That's just my 
> opinion, perha
 p
s I'm wrong about that.)

I think we missunderstood eachother.
If we go down the line of data duplication, why not generate keys in a
way that all of the user's tweet-stream's tweets end up at the same
region server?
This way, doing a multiget you only call one or two region servers for
all your tweets.

>> Maybe one better does a tall-schema, since row-locking can impact
>> performance a lot for people who follow a lot of other persons but
>> that's not the topic here.
> 
> Yes, very true. We also haven't broached the subject of how long writes would 
> take if you denormalize, and whether they should be asynchronous (for 
> everyone, or just for the Wil Wheatons). If you put all of my incoming tweets 
> into a single row for me, that's potentially lock-wait city. Twitter has a 
> follow limit, maybe that's why. :)
Oh, do they?
However, as far as I know they are using MySQL for their tweets.


So, to generalize our results:
There are two majore approaches to design data streams which are coming
from n users and were streamed by m users (with m beeing egual or larger
than n).

One approach is to create an index of the interesting data per user
where each column represents the key to the information of interest
(wide-schema) or each row which is associated by a key-prefix with the
user represents a pointer to the data (tall-schema).
If you want to access that data one has to design a HBase-side join.

On the other hand the second approach is to make usage of massive data
duplication. That means writing the same stuff to every user so that
every user is able to access the data immediatly without the need of
multigets (given that one uses a wide-schema-design). This saves
requests and latency at the cost of writes, througput and network traffic.

Are there other approaches for designing data-streams in HBase?

Do you aggree with that?

HBase seems to be kind of over-engineering, if you do not need that scale.

Regards,
Em

Reply via email to