I am using 0.89 currently, does it include those optimizations set for 0.90? If 
so, great news, the wide table approach is what I preferred.

On Nov 29, 2010, at 4:14 PM, Jonathan Gray wrote:

> Hey Bryan,
> 
> All of these approaches could work and seem sane.
> 
> My preference these days would be the wide-table approach (#2, 3, 4) rather 
> than the tall table.  Previously #1 was more efficient but in 0.90 and beyond 
> the same optimizations exist for both tall and wide tables.
> 
> For #2, I would probably structure the qualifier as <id_of_order>_fieldname 
> (rather than the other way around).  Then the fields for a given order are 
> continuous (rather than grouping by the fieldname).
> 
> If you have some existing serialization method you are using in your 
> application, #3 would make sense.
> 
> #4 wouldn't be ideal because HBase sorts on column before version, so fields 
> for a given order would not be continuous thus reads would be inefficient.  
> This is similar to the issue with the ordering of id/field in #2.
> 
> The most important thing is to design this so you have efficient reads.  I 
> imagine one of the important queries is something like "get me all the info 
> for this order".  If so, it would be important that all fields for an order 
> are together.
> 
> JG
> 
>> -----Original Message-----
>> From: Bryan Keller [mailto:[email protected]]
>> Sent: Monday, November 29, 2010 1:41 PM
>> To: [email protected]
>> Subject: Schema design, one-to-many question
>> 
>> I have read comments on modeling one-to-many relationships in HBase and
>> wanted to get some feedback. I have millions of customers, and each
>> customer
>> can make zero to thousands of orders. I want to store all of this data in
>> HBase. The data is always accessed by customer.
>> 
>> It seems there are a few schema design approaches.
>> 
>> Approach 1: Orders table. One row per order. Customer data is either
>> denormalized, or the customer ID is stored for lookup in a customer data
>> cache. Table will have billions of rows of a few columns each.
>> 
>> key: customer ID + order ID
>> family 1: customer (customer:id)
>> family 2: order (order:id, order:amount, order:date, etc.)
>> 
>> Approach 2: Customer table. One row per customer. All orders are stored in
>> a
>> column family with order ID in the column name. Millions of rows with
>> potentially thousands of columns each.
>> 
>> key: customer ID
>> family 1: customer (customer:id, customer:name, customer:city, etc.)
>> family 2: order (order:id_<id of order>, order:amount_<id of order>,
>> order:date_<id of order>)
>> 
>> Approach 3: Same as #2, but store the order data as a serialized blob
>> instead of in separate columns:
>> 
>> key: customer ID
>> family 1: customer (customer:id, customer:name, customer:city, etc.)
>> family 2: order (order:<id of order>)
>> 
>> Approach 4: Not sure if this is viable, but same as #2 but use versions in
>> the order family to store multiple orders.
>> 
>> key: customer ID
>> family 1: customer (customer:idm customer:name, customer:city, etc.)
>> family 2: order (order:id, order:amount, order:date, etc.) - 1000 versions
>> 
>> I am thinking approach #1 is probably the correct approach, but #2 and #3
>> (and #4?) would be more efficient from an application standpoint, as
>> everything is processed by customer and I won't need a customer data cache
>> or worry about updating denormalized data. Does anyone have feedback as to
>> what approaches work for them for data sets like this, and why?

Reply via email to