On May 29, 2012, at 3:25 PM, Em wrote:

Yup, unless you denormalize the tweet bodies as well--then you just read the 
current user's record and you have everything you need (with the downside of 
massive data duplication).
Well, I think this would be bad practice for editable stuff like tweets.
They can be deleted, updated etc.. Furthermore at some point the data
duplication will come at its expense - big data or not.
Do you aggree?

You're probably right, yes. For the real Twitter, denormalizing every tweet to 
every follower would likely be ruinously expensive. For something like a 
message-based app (like, say, chat), it's much more plausible because there's a 
fixed and limited number of participants who'd see any given message.

The important point here, though, is the "YMMV" part ("Your Mileage May Vary", 
for those not familiar with the idiom). This kind of stuff is exceedingly 
difficult to give easy, general solutions to because every case is different. 
How many people does the average user follow? How big are the message bodies? 
What's the ratio of CPUs to disk spindles? Etc.

Yes, but the relational DB has it all on one node, whereas in a distributed 
database, it's as many RPC calls as you have nodes, or something on the order 
of that (see Nicolas's explanation, which is better than mine).

Nicolas..? :)

Ah, sorry, I mean "N Keywal" who replied upthread. His name is Nicolas. :) His 
response was about under what situations the GET requests would be batched or 
sent separately.

The key difference is that that read access time for HBase is still constant 
when you have PB of data (whereas with a PB of data, the relational DB has long 
since fallen over). The underlying reason for that is that HBase basically took 
the top level of the seek operation (which is a fast memory access into the top 
node of a B+ tree, if you're talking about an RDBMS on one machine) and made it 
a cross machine lookup ("which server would have this data?"). So it's 
fundamentally a lot slower than an RDBMS, but still constant w/r/t your overall 
data size. If you remember your algorithms class, constant factors fall out 
when N gets larger, and you only care about the big O. :)

Yes :) I am interested in the underlying algorithms of HBase's
hash-algorithms. I think it's an interesting approach.

It has no hash-join algorithms built in, if that's what you mean. But you can 
read the client access code path, and see the anatomy of a scan--how it finds 
out which regions to hit, creates requests to all of them, combines the 
results, etc.

If you mean the overall approach of HBase, as contrasted with the B+ trees used 
by relational databases: this is based on the ideas of "Log Structured Merge 
Trees"; there original paper on it is here:

http://scholar.google.com/scholar?cluster=5832040552580693098&hl=en&as_sdt=0,5&as_vis=1


If we go down the line of data duplication, why not generate keys in a
way that all of the user's tweet-stream's tweets end up at the same
region server?
This way, doing a multiget you only call one or two region servers for
all your tweets.

What would be an example of this? Say we have tweets from 4 users. The 
normalized "tweet" table might have data like:

TweetID   Time User Body
--------- ---- ---- -----------------
123       T1   A    Hello world
234       T2   B    OMG BIEBER!
345       T3   C    RT @B OMG BIEBER!
456       T4   A    I hate twitter
567       T5   B    ZOMG MORE BIEBER!
etc.

If we denorm just the IDs by follower (let's say everyone follows everyone 
here, and you don't see your own tweets), we'd get:

Follower Time Followee  TweetID
-------- ---- --------- -----------
A        T2   B         234
A        T3   C         345
A        T5   B         567
B        T1   A         123
B        T3   C         345
B        T4   A         456
C        T1   A         123
C        T2   B         234
C        T4   A         456
C        T5   B         567

Here, if I know who I am (the follower) I can easily scan a time range of all 
the tweets I should see, and it'll almost always be in one region (with the 
rare exception where it happens to fall over the region server break, which 
would be handled transparently by the client of course). So if I'm user C, and 
I scan from T1 to T5, I'd get (A,123), (B,234), (A,456), and (B,567) back.

Of course, I then want to do a GET for the body of each tweet. You're asking, 
why not organize the tweet table in such a way as to make it more likely that 
these 4 tweets will be co-located?

The answer to that in the generic case is that you can't; the IDs of the tweets 
are completely orthogonal to the followers & followees, so you can't make any 
expectations about that kind of colocation. It looks easy with 3 users, but 
imagine there are hundreds of millions of users; what are the chances I follow 
people whose tweets happen to be co-located?

In this case, we could do a little better by making the tweet ID into a UUID 
with a time component, such that tweets around the same time would be grouped 
together in the table. The problem with this, of course, is that now all insert 
activity is going to "hot spot" one area of the table (i.e. one region server). 
So what you really want is sort of the opposite; you want all the incoming 
traffic to be nicely distributed so the write load is spread out. Thus, you 
actually probably don't want the tweets you'd see in a given page to be 
co-located on the same server, unless it's specifically YOUR copy of the tweets 
(which gets back to the denormalization question above).

As you can see, there's no "easy" answer here either. This is why you get paid 
the big bucks. :)


Yes, very true. We also haven't broached the subject of how long writes would 
take if you denormalize, and whether they should be asynchronous (for everyone, 
or just for the Wil Wheatons). If you put all of my incoming tweets into a 
single row for me, that's potentially lock-wait city. Twitter has a follow 
limit, maybe that's why. :)
Oh, do they?
However, as far as I know they are using MySQL for their tweets.

Indeed. But MySQL doesn't have secret sauce there, either; when you shard it, 
it's a lot like HBase, except without the auto-sharding, and with DDL commands 
that claim to work but don't if you have a lot of data. ;)


So, to generalize our results:
There are two majore approaches to design data streams which are coming
from n users and were streamed by m users (with m beeing egual or larger
than n).

One approach is to create an index of the interesting data per user
where each column represents the key to the information of interest
(wide-schema) or each row which is associated by a key-prefix with the
user represents a pointer to the data (tall-schema).
If you want to access that data one has to design a HBase-side join.

On the other hand the second approach is to make usage of massive data
duplication. That means writing the same stuff to every user so that
every user is able to access the data immediatly without the need of
multigets (given that one uses a wide-schema-design). This saves
requests and latency at the cost of writes, througput and network traffic.

Are there other approaches for designing data-streams in HBase?

Do you aggree with that?

Yeah, I think you've said it pretty well. Others may have better ways to put 
it, or things I'm not thinking about. Anybody else want to chime in?

HBase seems to be kind of over-engineering, if you do not need that scale.

Yup! Which, I think I alluded to early in the thread.

Ian

Reply via email to