A few more responses:

On May 29, 2012, at 10:54 AM, Em wrote:

> In fact, everything you model with a Key-Value-storage like HBase,
> Cassandra etc. can be modeled as an RDMBS-scheme.
> Since a lot of people, like me, are coming from that edge, we must
> re-learn several basic things.
> It starts with understanding that you model a K-V-storage the way you
> want to access the data, not as the data relates to eachother (in
> general terms) and ends with translating the connections of data into a
> K-V-schema as good as possible.

Yes, that's a good way of putting it. I did a talk at HBaseCon this year that 
deals with some of these questions. The video isn't up yet, but the slides are 
here:

http://www.slideshare.net/cloudera/5-h-base-schemahbasecon2012

>> 3. You could also let a higher client layer worry about this. For
>> example, your data layer query just returns a student with a list of
>> their course IDs, and then another process in your client code looks
>> up each course by ID to get the name. You can then put an external
>> caching layer (like memcached) in the middle and make things a lot
>> faster (though that does put the burden on you to have the code path
>> for changing course info also flush the relevant cache entries). 
> Hm, in what way does this give me an advantage over using HBase -
> assuming that the number of courses is small enough to fit in RAM - ?
> I know that Memcached is optimized for this purpose and might have much
> faster response times - no doubts.
> However, from a conceptual point of view: Why does Memcached handles the
> K-V-distribution more efficiently than a HBase with warmed caches?
> Hopefully this question isn't that hard :).

The only architectural advantage I can think of is that reads in HBase still 
have to check the memstore and all relevant file blocks. So even if that's warm 
in the HBase cache, it's not quite as lightweight. That said, I guess I was 
also sort of assuming that table you're looking up into would be the smaller 
one. The details of which is better here would depend on many things; YMMV. I'd 
personally go with #2 as a default and them optimize based on real workloads, 
rather than over engineering it. 

> Without ever doing it, you never get a real feeling of when to use the
> right tool.
> Using a good tool for the wrong problem can be an interesting
> experience, since you learn some of the do's and don'ts of the software
> you use.

Very good points. Definitely not trying to discourage learning! :) I just 
always feel compelled to make that caveat early on, while people are making 
technology evaluations. You can learn to crochet with a rail gun, but I 
wouldn't recommend it, unless what you really want to learn is about rail guns, 
not crocheting. :) Sounds like you really want to learn about rail guns. 

> Since I am a reader of the MEAP-edition of HBase in Action, I am aware
> of the TwitBase-example application presented in that book.
> I am very interested in seeing the author presenting a solution for
> efficiently accessing the Tweets of the persons I follow.
> This is an n:m-relation.
> You got n users with m tweets and each user is seeing his own tweets as
> well as the tweets of followed persons in descending order by timestamp.
> This must be done with a join within an RDMBs (and maybe in HBase also),
> since I can not think of another scalable way of doing so.

Don't forget about denormalization! You can put copies of the tweets, or at 
least copies the unique ids of the tweets, into each follower's stream. Yes, 
that means when Wil Wheaton tweets for the 1000th time about comic con, you get 
a million copies of the tweet (or ID). But you're trading time & space at write 
time for extremely fast speeds at write time. Whether this makes sense depends 
on a zillion other factors, there's no hard & fast rule.

> However, if you do this by a Join, this means that a person with 40.000
> followers needs a batch-request consisting of 40.000 GET-objects. That's
> huge and I bet that this is everything but not fast nor scalable. It
> sounds like broken by design when designing for Big Data.

I guess you could say it's "broken by design" (though I'd argue for 
"unavailable by design" ;). Full join use cases (like you'd do for, say, 
analytics) don't work well with big data, taking a naive approach. 

But at least for the twitter app, that's not actually what you're doing. 
Instead, you've typically got a pattern like *pagination*: you'd never be 
requesting 40K objects, at most you'd be requesting 20 or 40 objects (a page 
worth), along some dimension like time. You're optimizing for getting those 
really fast, with the knowledge that the stream itself is so big, you'd never 
offer the user a way to act on it in any kind of bulk way. If you want to do 
processing like that, you do it asynchronously (e.g. via map/reduce).

That's actually a really interesting way to see the difference between 
relational DBs and big data non-relational ones: relational databases promise 
you easy whole-set operations, and big data nosql databases don't, because they 
assume the "whole set" will be too big for that.

Let's say your product manager for this twitter-like thing said "You know what 
would be awesome? A widget that shows the average lengths of all tweets in the 
system, in real time!". And if this product manager was really slick, they 
might even say "It's easy, look, I even wrote the SQL for it: 'SELECT 
avg(length(body)) FROM tweets'. I'm a genius."

In a SQL database, this query is going to do a full table scan to get this 
average, every time you ask for it. It would be in HBase, too; the difference 
is just that HBase makes you be more explicit about it: no simple declarative 
"SELECT avg ..." syntax, you have to whip out your iterators and see that 
you're scanning a billion rows to get your average.

Would it be nice for HBase to ALSO offer declarative and simple ways to do 
things like joins, averages, etc? Maybe; but as soon as you dip a toe into 
this, you kind of have to jump in with both feet. Would you add indexes to make 
joins faster? Would those indexes be written transactionally with the records 
in the base tables, even across region server boundaries? Would you build a 
query optimizer to know how to execute the joins? Would you allow really 
complex N-way joins? The list goes on. (I'm not suggesting you're asking for 
any of this, I'm just giving some context for why HBase doesn't have any of it, 
yet.)

> Therefore I am interested in general best practices for such problems.

I find the best practice for all such designs to be a simple game of make 
believe. Pretend you have an INFINITE (not just large) number of records in 
various dimensions, and then think about what kinds of data storage and access 
would let you still serve things with really low latency.

That said, it's just a "good practice", not a "best practice", because this is 
not a simple design space. (Any fans of the Cynefin framework out there?).

Ian

Reply via email to