I'm sorry if this has already been answered, but I'll share my $0.02 anyway...
First, you and everyone have to stop thinking of hbase in terms of a relational model. Because Hbase doesn't have the concept of joins, you can not think of relationships. If you have two tables where the primary key of both tables are the same (primary key == foreign key) then you can put the two relational tables within the same table but in different column families. Having said that... usually in this case you would store them in the same column family, however there may be a valid reason to separate them. (Row size and access patterns may make it a good idea to separate them out for performance reasons). In your example, which is a classic example... don't think in terms of relational but think in terms of a hierarchy data base. Think in terms of Dick Pick. (See: http://en.wikipedia.org/wiki/Pick_operating_system ) Pick or Pick like systems were Revelation (Now I'm showing my age...) , U2 aka Universe (VMark was acquired by Informix, then IBM, then spun off recently... I think) is a good example of how to model within HBase. So if we look at your example... how do you plan on accessing the information? Since everyone likes to talk 'agile' ... think about your story lines.... "A grasshopper walks in to a bar ... " (Sorry bad joke). In your case... A customer logs in to your website. Starts to place an order.... When the customer logs in... you know his customer id. So you can keep the customer information in a separate table since you're not looking up the data immediately. You can then use the customer id as part of the key for your order table. In fact, I'd make it the first part of the composite key since the customer may not always remember their order numbers and you will want to search the order table (yes, its clear that you want a table for your orders too.) So your key could be customer id + order num. Then when a customer wants to find his orders, you can do a scan with a start key and end key based on customer id. Beyond this... there are strategies on how you create your customer id to get better utilization of your cloud. NOTE THE FOLLOWING: This example and design does not use indexing. If you want 'real time' performance, you'll need to incorporate indexing in to your design, however that's another story... HTH -Mike > Date: Mon, 29 Nov 2010 13:40:58 -0800 > Subject: Schema design, one-to-many question > From: [email protected] > To: [email protected] > > I have read comments on modeling one-to-many relationships in HBase and > wanted to get some feedback. I have millions of customers, and each customer > can make zero to thousands of orders. I want to store all of this data in > HBase. The data is always accessed by customer. > > It seems there are a few schema design approaches. > > Approach 1: Orders table. One row per order. Customer data is either > denormalized, or the customer ID is stored for lookup in a customer data > cache. Table will have billions of rows of a few columns each. > > key: customer ID + order ID > family 1: customer (customer:id) > family 2: order (order:id, order:amount, order:date, etc.) > > Approach 2: Customer table. One row per customer. All orders are stored in a > column family with order ID in the column name. Millions of rows with > potentially thousands of columns each. > > key: customer ID > family 1: customer (customer:id, customer:name, customer:city, etc.) > family 2: order (order:id_<id of order>, order:amount_<id of order>, > order:date_<id of order>) > > Approach 3: Same as #2, but store the order data as a serialized blob > instead of in separate columns: > > key: customer ID > family 1: customer (customer:id, customer:name, customer:city, etc.) > family 2: order (order:<id of order>) > > Approach 4: Not sure if this is viable, but same as #2 but use versions in > the order family to store multiple orders. > > key: customer ID > family 1: customer (customer:idm customer:name, customer:city, etc.) > family 2: order (order:id, order:amount, order:date, etc.) - 1000 versions > > I am thinking approach #1 is probably the correct approach, but #2 and #3 > (and #4?) would be more efficient from an application standpoint, as > everything is processed by customer and I won't need a customer data cache > or worry about updating denormalized data. Does anyone have feedback as to > what approaches work for them for data sets like this, and why?
