On May 29, 2012, at 3:25 PM, Em wrote:
Yup, unless you denormalize the tweet bodies as well--then you just read the
current user's record and you have everything you need (with the downside of
massive data duplication).
Well, I think this would be bad practice for editable stuff like tweets.
They can be deleted, updated etc.. Furthermore at some point the data
duplication will come at its expense - big data or not.
Do you aggree?
You're probably right, yes. For the real Twitter, denormalizing every tweet to
every follower would likely be ruinously expensive. For something like a
message-based app (like, say, chat), it's much more plausible because there's a
fixed and limited number of participants who'd see any given message.
The important point here, though, is the "YMMV" part ("Your Mileage May Vary",
for those not familiar with the idiom). This kind of stuff is exceedingly
difficult to give easy, general solutions to because every case is different.
How many people does the average user follow? How big are the message bodies?
What's the ratio of CPUs to disk spindles? Etc.
Yes, but the relational DB has it all on one node, whereas in a distributed
database, it's as many RPC calls as you have nodes, or something on the order
of that (see Nicolas's explanation, which is better than mine).
Nicolas..? :)
Ah, sorry, I mean "N Keywal" who replied upthread. His name is Nicolas. :) His
response was about under what situations the GET requests would be batched or
sent separately.
The key difference is that that read access time for HBase is still constant
when you have PB of data (whereas with a PB of data, the relational DB has long
since fallen over). The underlying reason for that is that HBase basically took
the top level of the seek operation (which is a fast memory access into the top
node of a B+ tree, if you're talking about an RDBMS on one machine) and made it
a cross machine lookup ("which server would have this data?"). So it's
fundamentally a lot slower than an RDBMS, but still constant w/r/t your overall
data size. If you remember your algorithms class, constant factors fall out
when N gets larger, and you only care about the big O. :)
Yes :) I am interested in the underlying algorithms of HBase's
hash-algorithms. I think it's an interesting approach.
It has no hash-join algorithms built in, if that's what you mean. But you can
read the client access code path, and see the anatomy of a scan--how it finds
out which regions to hit, creates requests to all of them, combines the
results, etc.
If you mean the overall approach of HBase, as contrasted with the B+ trees used
by relational databases: this is based on the ideas of "Log Structured Merge
Trees"; there original paper on it is here:
http://scholar.google.com/scholar?cluster=5832040552580693098&hl=en&as_sdt=0,5&as_vis=1
If we go down the line of data duplication, why not generate keys in a
way that all of the user's tweet-stream's tweets end up at the same
region server?
This way, doing a multiget you only call one or two region servers for
all your tweets.
What would be an example of this? Say we have tweets from 4 users. The
normalized "tweet" table might have data like:
TweetID Time User Body
--------- ---- ---- -----------------
123 T1 A Hello world
234 T2 B OMG BIEBER!
345 T3 C RT @B OMG BIEBER!
456 T4 A I hate twitter
567 T5 B ZOMG MORE BIEBER!
etc.
If we denorm just the IDs by follower (let's say everyone follows everyone
here, and you don't see your own tweets), we'd get:
Follower Time Followee TweetID
-------- ---- --------- -----------
A T2 B 234
A T3 C 345
A T5 B 567
B T1 A 123
B T3 C 345
B T4 A 456
C T1 A 123
C T2 B 234
C T4 A 456
C T5 B 567
Here, if I know who I am (the follower) I can easily scan a time range of all
the tweets I should see, and it'll almost always be in one region (with the
rare exception where it happens to fall over the region server break, which
would be handled transparently by the client of course). So if I'm user C, and
I scan from T1 to T5, I'd get (A,123), (B,234), (A,456), and (B,567) back.
Of course, I then want to do a GET for the body of each tweet. You're asking,
why not organize the tweet table in such a way as to make it more likely that
these 4 tweets will be co-located?
The answer to that in the generic case is that you can't; the IDs of the tweets
are completely orthogonal to the followers & followees, so you can't make any
expectations about that kind of colocation. It looks easy with 3 users, but
imagine there are hundreds of millions of users; what are the chances I follow
people whose tweets happen to be co-located?
In this case, we could do a little better by making the tweet ID into a UUID
with a time component, such that tweets around the same time would be grouped
together in the table. The problem with this, of course, is that now all insert
activity is going to "hot spot" one area of the table (i.e. one region server).
So what you really want is sort of the opposite; you want all the incoming
traffic to be nicely distributed so the write load is spread out. Thus, you
actually probably don't want the tweets you'd see in a given page to be
co-located on the same server, unless it's specifically YOUR copy of the tweets
(which gets back to the denormalization question above).
As you can see, there's no "easy" answer here either. This is why you get paid
the big bucks. :)
Yes, very true. We also haven't broached the subject of how long writes would
take if you denormalize, and whether they should be asynchronous (for everyone,
or just for the Wil Wheatons). If you put all of my incoming tweets into a
single row for me, that's potentially lock-wait city. Twitter has a follow
limit, maybe that's why. :)
Oh, do they?
However, as far as I know they are using MySQL for their tweets.
Indeed. But MySQL doesn't have secret sauce there, either; when you shard it,
it's a lot like HBase, except without the auto-sharding, and with DDL commands
that claim to work but don't if you have a lot of data. ;)
So, to generalize our results:
There are two majore approaches to design data streams which are coming
from n users and were streamed by m users (with m beeing egual or larger
than n).
One approach is to create an index of the interesting data per user
where each column represents the key to the information of interest
(wide-schema) or each row which is associated by a key-prefix with the
user represents a pointer to the data (tall-schema).
If you want to access that data one has to design a HBase-side join.
On the other hand the second approach is to make usage of massive data
duplication. That means writing the same stuff to every user so that
every user is able to access the data immediatly without the need of
multigets (given that one uses a wide-schema-design). This saves
requests and latency at the cost of writes, througput and network traffic.
Are there other approaches for designing data-streams in HBase?
Do you aggree with that?
Yeah, I think you've said it pretty well. Others may have better ways to put
it, or things I'm not thinking about. Anybody else want to chime in?
HBase seems to be kind of over-engineering, if you do not need that scale.
Yup! Which, I think I alluded to early in the thread.
Ian