Re: Embedded table data model

Ian Varley Wed, 11 Jul 2012 22:14:03 -0700

Hi Xiaobo -

For HBase, this is doable; you could have a single table in HBase where each 
row is a customer (with the customerid as the rowkey), and columns for each of 
the 300 attributes that are directly part of the customer entity. This is 
sparse, so you'd only take up space for the attributes that actually exist for 
each customer.


You could then have (possibly in another column family, but not necessarily) an 
additional column for each transaction, where the column name is composed of a 
date concatenated with the transaction id, in which you store the 30 attributes 
as serialized into a single byte array in the cell value. (Or, you could 
alternately do each attribute as its own column but there's no advantage to 
doing so, since presumably a transaction is roughly like an immutable event 
that you wouldn't typically change just a single attribute of.) A schema for 
this (if spelled out in an xml representation) could be:

<table name="customer">
  <key>
    <column name="customerid">
  </key>
  <columnfamily name="1">
    <column name="customer_attribute_1" />
    <column name="customer_attribute_2" />
    ...
    <column name="customer_attribute_300" />
  </columnFamily>
  <columnFamily name="2">
    <entity name="transaction" values="serialized">
      <key>
        <column name="transaction_date" type="date">
        <column name="transaction_id" />
      </key>
      <column name="transaction_attribute_1" />
      <column name="transaction_attribute_2" />
      ...
      <column name="transaction_attribute_30" />
    </entity>
  </columnFamily>
</table>

(This isn't real HBase syntax, it's just an abstract way to show you the 
structure.) In practice, HBase isn't doing anything "special" with the entity 
that lives nested inside your table; it's just a matter of convention, that you 
could "see" it that way. The customer-level attributes (like, say, 
"customer_name" and "customer_address") would be literal column names (aka 
column qualifiers) embedded in your code, whereas the transaction-oriented 
columns would be created at runtime with column names like "2012-07-11 
12:34:56_TXN12345", and values that are simply collection objects (containing 
the 30 attributes) serialized into a byte array.

In this scenario, you get fast access to any customer by ID, and further to a 
range of transactions by date (using, say, a column pagination filter). This 
would perform roughly equivalently regardless of how many customers are in the 
table, or how many transactions exist for each customer. What you'd lose on 
this design would be the ability to get a single transaction for a single 
customer by ID (since you're storing them by date). But if you need that, you 
could actually store it both ways. You also might be introducing some extra 
contention on concurrent transaction PUT requests for a single client, because 
they'd have to fight over a lock for the row (but that's probably not a big 
deal, since it's only contentious within each customer).

You might find my presentation on designing HBase schemas (from this year's 
HBaseCon) useful:

http://www.hbasecon.com/sessions/hbase-schema-design-2/

Ian

On Jul 11, 2012, at 10:58 PM, Xiaobo Gu wrote:

Hi,

I have technical problem, and wander whether HBase or Cassandra
support Embedded table data model, or can somebody show me a way to do
this:

1.We have a very large customer entity table which have 100 milliion
rows, each customer row has about 300 attributes(columns).
2.Each customer do about 1000 transactions per year, each transaction
has about 30 attributes(columns), and we just save one year
transactions for each customer

We want a data model that  we can get the customer entity with all the
transactions which he did for a single client call within a fixed time
window, according to the customer id (which is the primary key of the
customer table). We do the following in RDBMS,
A customer table with customerid as the primary key, A transaction
table with customer id as a secondary index, and join them , or we
must do two separate  calls, and because we have so many concurrent
readers and these two tables are became so large, the RDBMS system
performs poor.


Can we embedded the transactions inside the customer table in HBase or
Cassandra?


Regards,

Xiaobo Gu

Re: Embedded table data model

Reply via email to