Hi,

It may be common sense to say that datamodels depend on your application and search patterns...

I just want to take the opportunity to point that even for a very standard application such as email, there is no unique datamodel:

- The facebook messaging platform where they use a column per different word (if we can trust [1] on quora). Quite original :) - The IBM BlueRunner (built on cassandra, but datamodel idea can be applied to cassandra) - see slide 13 - The 'hbase definitive guide' where various approach are compared, especially regarding the attachment. - The (under developement) Apache James implementation, where we have to deal on IMAP protocol where very specific queries on email content... must be honored. We have there a first basic implementation but we must go to more evolved datamodel to ensure performance on search.

Back to your question, I would simply design the datamodel to honor performant queries, even if data must be duplicated, for example to maintain secondary indexes tables, and avoid Scan the most possible.

Also, if you have flat-wide tables, there are more risk your scans will have to span across multiple regions, that can be located on different servers, slowing the process.

Thx.

Eric

[1] http://www.quora.com/How-does-Facebook-handle-generic-search-functionality-on-HBase-for-their-new-messaging-platform


On 01/09/11 13:38, Buttler, David wrote:
The "HBase: The Definitive Guide" answer seems pretty, um, definitive to me.  
The only reason I would even consider going against that advice is if I had solid 
knowledge that it was impossible for a user to have more than 100,000 emails.  But even 
then it seems like a difficult design decision to justify.  How does that design help you 
do something?

Dave

-----Original Message-----
From: Srikanth P. Shreenivas [mailto:[email protected]]
Sent: Thursday, September 01, 2011 11:53 AM
To: [email protected]
Subject: Tall-Narrow vs. Flat-Wide Tables

Hi,

HBase: The Definitive Guide book's chapter 9 talks about Tall-Narrow vs 
Flat-wide tables. (http://ofps.oreilly.com/titles/9781449396107/advanced.html)

It seems to propose that Tall-Narrow tables (more rows, less columns) is better design.  
One of the issue it talks about with "Flat-wide" tables (less rows and more 
columns) is
...
In addition, HBase can only split at row boundaries, which also enforces the 
recommendation to go with tall-narrow tables. Imagine you have all emails of a 
user in a single row. This will work for the majority of users, but there will 
be outliers that will have magnitudes of emails more in their inbox. So much so 
that a single row could outgrow the maximum file/region size and work against 
the region split facility.
...

So, my query is that is it a bad idea to have a table as given in above example 
wherein emails are stored by adding columns.   I seem to have a similar table 
in my application, wherein I have a region size of 1GB and cell value of 10KB.  
So, will I run into region-split issue mentioned above after 100000 (1GB / 10KB 
= 100000)  columns.

Regards,
Srikanth

________________________________

http://www.mindtree.com/email/disclaimer.html

--
Eric
http://about.echarles.net

Reply via email to