but they are other writers insert new transactions into the table when customers do new transactions.
On Thu, Jul 12, 2012 at 1:13 PM, Ian Varley <[email protected]> wrote: > Hi Xiaobo - > > For HBase, this is doable; you could have a single table in HBase where each > row is a customer (with the customerid as the rowkey), and columns for each > of the 300 attributes that are directly part of the customer entity. This is > sparse, so you'd only take up space for the attributes that actually exist > for each customer. > > You could then have (possibly in another column family, but not necessarily) > an additional column for each transaction, where the column name is composed > of a date concatenated with the transaction id, in which you store the 30 > attributes as serialized into a single byte array in the cell value. (Or, you > could alternately do each attribute as its own column but there's no > advantage to doing so, since presumably a transaction is roughly like an > immutable event that you wouldn't typically change just a single attribute > of.) A schema for this (if spelled out in an xml representation) could be: > > <table name="customer"> > <key> > <column name="customerid"> > </key> > <columnfamily name="1"> > <column name="customer_attribute_1" /> > <column name="customer_attribute_2" /> > ... > <column name="customer_attribute_300" /> > </columnFamily> > <columnFamily name="2"> > <entity name="transaction" values="serialized"> > <key> > <column name="transaction_date" type="date"> > <column name="transaction_id" /> > </key> > <column name="transaction_attribute_1" /> > <column name="transaction_attribute_2" /> > ... > <column name="transaction_attribute_30" /> > </entity> > </columnFamily> > </table> > > (This isn't real HBase syntax, it's just an abstract way to show you the > structure.) In practice, HBase isn't doing anything "special" with the entity > that lives nested inside your table; it's just a matter of convention, that > you could "see" it that way. The customer-level attributes (like, say, > "customer_name" and "customer_address") would be literal column names (aka > column qualifiers) embedded in your code, whereas the transaction-oriented > columns would be created at runtime with column names like "2012-07-11 > 12:34:56_TXN12345", and values that are simply collection objects (containing > the 30 attributes) serialized into a byte array. > > In this scenario, you get fast access to any customer by ID, and further to a > range of transactions by date (using, say, a column pagination filter). This > would perform roughly equivalently regardless of how many customers are in > the table, or how many transactions exist for each customer. What you'd lose > on this design would be the ability to get a single transaction for a single > customer by ID (since you're storing them by date). But if you need that, you > could actually store it both ways. You also might be introducing some extra > contention on concurrent transaction PUT requests for a single client, > because they'd have to fight over a lock for the row (but that's probably not > a big deal, since it's only contentious within each customer). > > You might find my presentation on designing HBase schemas (from this year's > HBaseCon) useful: > > http://www.hbasecon.com/sessions/hbase-schema-design-2/ > > Ian > > On Jul 11, 2012, at 10:58 PM, Xiaobo Gu wrote: > > Hi, > > I have technical problem, and wander whether HBase or Cassandra > support Embedded table data model, or can somebody show me a way to do > this: > > 1.We have a very large customer entity table which have 100 milliion > rows, each customer row has about 300 attributes(columns). > 2.Each customer do about 1000 transactions per year, each transaction > has about 30 attributes(columns), and we just save one year > transactions for each customer > > We want a data model that we can get the customer entity with all the > transactions which he did for a single client call within a fixed time > window, according to the customer id (which is the primary key of the > customer table). We do the following in RDBMS, > A customer table with customerid as the primary key, A transaction > table with customer id as a secondary index, and join them , or we > must do two separate calls, and because we have so many concurrent > readers and these two tables are became so large, the RDBMS system > performs poor. > > > Can we embedded the transactions inside the customer table in HBase or > Cassandra? > > > Regards, > > Xiaobo Gu >
