I am using 0.89 currently, does it include those optimizations set for 0.90? If so, great news, the wide table approach is what I preferred.
On Nov 29, 2010, at 4:14 PM, Jonathan Gray wrote: > Hey Bryan, > > All of these approaches could work and seem sane. > > My preference these days would be the wide-table approach (#2, 3, 4) rather > than the tall table. Previously #1 was more efficient but in 0.90 and beyond > the same optimizations exist for both tall and wide tables. > > For #2, I would probably structure the qualifier as <id_of_order>_fieldname > (rather than the other way around). Then the fields for a given order are > continuous (rather than grouping by the fieldname). > > If you have some existing serialization method you are using in your > application, #3 would make sense. > > #4 wouldn't be ideal because HBase sorts on column before version, so fields > for a given order would not be continuous thus reads would be inefficient. > This is similar to the issue with the ordering of id/field in #2. > > The most important thing is to design this so you have efficient reads. I > imagine one of the important queries is something like "get me all the info > for this order". If so, it would be important that all fields for an order > are together. > > JG > >> -----Original Message----- >> From: Bryan Keller [mailto:[email protected]] >> Sent: Monday, November 29, 2010 1:41 PM >> To: [email protected] >> Subject: Schema design, one-to-many question >> >> I have read comments on modeling one-to-many relationships in HBase and >> wanted to get some feedback. I have millions of customers, and each >> customer >> can make zero to thousands of orders. I want to store all of this data in >> HBase. The data is always accessed by customer. >> >> It seems there are a few schema design approaches. >> >> Approach 1: Orders table. One row per order. Customer data is either >> denormalized, or the customer ID is stored for lookup in a customer data >> cache. Table will have billions of rows of a few columns each. >> >> key: customer ID + order ID >> family 1: customer (customer:id) >> family 2: order (order:id, order:amount, order:date, etc.) >> >> Approach 2: Customer table. One row per customer. All orders are stored in >> a >> column family with order ID in the column name. Millions of rows with >> potentially thousands of columns each. >> >> key: customer ID >> family 1: customer (customer:id, customer:name, customer:city, etc.) >> family 2: order (order:id_<id of order>, order:amount_<id of order>, >> order:date_<id of order>) >> >> Approach 3: Same as #2, but store the order data as a serialized blob >> instead of in separate columns: >> >> key: customer ID >> family 1: customer (customer:id, customer:name, customer:city, etc.) >> family 2: order (order:<id of order>) >> >> Approach 4: Not sure if this is viable, but same as #2 but use versions in >> the order family to store multiple orders. >> >> key: customer ID >> family 1: customer (customer:idm customer:name, customer:city, etc.) >> family 2: order (order:id, order:amount, order:date, etc.) - 1000 versions >> >> I am thinking approach #1 is probably the correct approach, but #2 and #3 >> (and #4?) would be more efficient from an application standpoint, as >> everything is processed by customer and I won't need a customer data cache >> or worry about updating denormalized data. Does anyone have feedback as to >> what approaches work for them for data sets like this, and why?
