Hi,
It may be common sense to say that datamodels depend on your application
and search patterns...
I just want to take the opportunity to point that even for a very
standard application such as email, there is no unique datamodel:
- The facebook messaging platform where they use a column per different
word (if we can trust [1] on quora). Quite original :)
- The IBM BlueRunner (built on cassandra, but datamodel idea can be
applied to cassandra) - see slide 13
- The 'hbase definitive guide' where various approach are compared,
especially regarding the attachment.
- The (under developement) Apache James implementation, where we have to
deal on IMAP protocol where very specific queries on email content...
must be honored. We have there a first basic implementation but we must
go to more evolved datamodel to ensure performance on search.
Back to your question, I would simply design the datamodel to honor
performant queries, even if data must be duplicated, for example to
maintain secondary indexes tables, and avoid Scan the most possible.
Also, if you have flat-wide tables, there are more risk your scans will
have to span across multiple regions, that can be located on different
servers, slowing the process.
Thx.
Eric
[1]
http://www.quora.com/How-does-Facebook-handle-generic-search-functionality-on-HBase-for-their-new-messaging-platform
On 01/09/11 13:38, Buttler, David wrote:
The "HBase: The Definitive Guide" answer seems pretty, um, definitive to me.
The only reason I would even consider going against that advice is if I had solid
knowledge that it was impossible for a user to have more than 100,000 emails. But even
then it seems like a difficult design decision to justify. How does that design help you
do something?
Dave
-----Original Message-----
From: Srikanth P. Shreenivas [mailto:[email protected]]
Sent: Thursday, September 01, 2011 11:53 AM
To: [email protected]
Subject: Tall-Narrow vs. Flat-Wide Tables
Hi,
HBase: The Definitive Guide book's chapter 9 talks about Tall-Narrow vs
Flat-wide tables. (http://ofps.oreilly.com/titles/9781449396107/advanced.html)
It seems to propose that Tall-Narrow tables (more rows, less columns) is better design.
One of the issue it talks about with "Flat-wide" tables (less rows and more
columns) is
...
In addition, HBase can only split at row boundaries, which also enforces the
recommendation to go with tall-narrow tables. Imagine you have all emails of a
user in a single row. This will work for the majority of users, but there will
be outliers that will have magnitudes of emails more in their inbox. So much so
that a single row could outgrow the maximum file/region size and work against
the region split facility.
...
So, my query is that is it a bad idea to have a table as given in above example
wherein emails are stored by adding columns. I seem to have a similar table
in my application, wherein I have a region size of 1GB and cell value of 10KB.
So, will I run into region-split issue mentioned above after 100000 (1GB / 10KB
= 100000) columns.
Regards,
Srikanth
________________________________
http://www.mindtree.com/email/disclaimer.html
--
Eric
http://about.echarles.net