but they are other writers insert new transactions into the table when
customers do new transactions.

On Thu, Jul 12, 2012 at 1:13 PM, Ian Varley <[email protected]> wrote:
> Hi Xiaobo -
>
> For HBase, this is doable; you could have a single table in HBase where each 
> row is a customer (with the customerid as the rowkey), and columns for each 
> of the 300 attributes that are directly part of the customer entity. This is 
> sparse, so you'd only take up space for the attributes that actually exist 
> for each customer.
>
> You could then have (possibly in another column family, but not necessarily) 
> an additional column for each transaction, where the column name is composed 
> of a date concatenated with the transaction id, in which you store the 30 
> attributes as serialized into a single byte array in the cell value. (Or, you 
> could alternately do each attribute as its own column but there's no 
> advantage to doing so, since presumably a transaction is roughly like an 
> immutable event that you wouldn't typically change just a single attribute 
> of.) A schema for this (if spelled out in an xml representation) could be:
>
> <table name="customer">
>   <key>
>     <column name="customerid">
>   </key>
>   <columnfamily name="1">
>     <column name="customer_attribute_1" />
>     <column name="customer_attribute_2" />
>     ...
>     <column name="customer_attribute_300" />
>   </columnFamily>
>   <columnFamily name="2">
>     <entity name="transaction" values="serialized">
>       <key>
>         <column name="transaction_date" type="date">
>         <column name="transaction_id" />
>       </key>
>       <column name="transaction_attribute_1" />
>       <column name="transaction_attribute_2" />
>       ...
>       <column name="transaction_attribute_30" />
>     </entity>
>   </columnFamily>
> </table>
>
> (This isn't real HBase syntax, it's just an abstract way to show you the 
> structure.) In practice, HBase isn't doing anything "special" with the entity 
> that lives nested inside your table; it's just a matter of convention, that 
> you could "see" it that way. The customer-level attributes (like, say, 
> "customer_name" and "customer_address") would be literal column names (aka 
> column qualifiers) embedded in your code, whereas the transaction-oriented 
> columns would be created at runtime with column names like "2012-07-11 
> 12:34:56_TXN12345", and values that are simply collection objects (containing 
> the 30 attributes) serialized into a byte array.
>
> In this scenario, you get fast access to any customer by ID, and further to a 
> range of transactions by date (using, say, a column pagination filter). This 
> would perform roughly equivalently regardless of how many customers are in 
> the table, or how many transactions exist for each customer. What you'd lose 
> on this design would be the ability to get a single transaction for a single 
> customer by ID (since you're storing them by date). But if you need that, you 
> could actually store it both ways. You also might be introducing some extra 
> contention on concurrent transaction PUT requests for a single client, 
> because they'd have to fight over a lock for the row (but that's probably not 
> a big deal, since it's only contentious within each customer).
>
> You might find my presentation on designing HBase schemas (from this year's 
> HBaseCon) useful:
>
> http://www.hbasecon.com/sessions/hbase-schema-design-2/
>
> Ian
>
> On Jul 11, 2012, at 10:58 PM, Xiaobo Gu wrote:
>
> Hi,
>
> I have technical problem, and wander whether HBase or Cassandra
> support Embedded table data model, or can somebody show me a way to do
> this:
>
> 1.We have a very large customer entity table which have 100 milliion
> rows, each customer row has about 300 attributes(columns).
> 2.Each customer do about 1000 transactions per year, each transaction
> has about 30 attributes(columns), and we just save one year
> transactions for each customer
>
> We want a data model that  we can get the customer entity with all the
> transactions which he did for a single client call within a fixed time
> window, according to the customer id (which is the primary key of the
> customer table). We do the following in RDBMS,
> A customer table with customerid as the primary key, A transaction
> table with customer id as a secondary index, and join them , or we
> must do two separate  calls, and because we have so many concurrent
> readers and these two tables are became so large, the RDBMS system
> performs poor.
>
>
> Can we embedded the transactions inside the customer table in HBase or
> Cassandra?
>
>
> Regards,
>
> Xiaobo Gu
>

Reply via email to