Re: Embedded table data model

Guxiaobo Fri, 13 Jul 2012 05:02:54 -0700

Hi Ian,
What is your suggestion then?

Sent from my iPad


On 2012-7-13, at 下午12:55, Ian Varley <[email protected]> wrote:

> Yes, that's what I mean.
> 
> It is not the only way to model this, but your question was, "Can we embedded 
> the transactions inside the customer table in HBase".
> 
> 
> 
> On Jul 12, 2012, at 8:21 PM, "Xiaobo Gu" 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> Hi Ian,
> 
> Do you mean each transaction will be created as a column inside the cf
> for transactions, and these columns are created dynamically as
> transactions occur?
> 
> Regards,
> 
> Xiaobo Gu
> 
> On Fri, Jul 13, 2012 at 11:08 AM, Ian Varley 
> <[email protected]<mailto:[email protected]>> wrote:
> Column families are not the same thing as columns. You should indeed have a 
> small number of column families, as that article points out. Columns (aka 
> column qualifiers) are run-time defined key/value pairs that contain the data 
> for every row, and having large numbers of these is fine.
> 
> 
> 
> On Jul 12, 2012, at 7:27 PM, "Cole" 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> I think this design has some question, please refer
> http://hbase.apache.org/book/number.of.cfs.html
> 
> 2012/7/12 Ian Varley <[email protected]<mailto:[email protected]>>
> 
> Yes, that's fine; you can always do a single column PUT into an existing
> row, in a concurrency-safe way, and the lock on the row is only held as
> long as it takes to do that. Because of HBase's Log-Structured Merge-Tree
> architecture, that's efficient because the PUT only goes to memory, and is
> merged with on-disk records at read time (until a regular flush or
> compaction happens).
> 
> So even though you already have, say, 10K transactions in the table, it's
> still efficient to PUT a single new transaction in (whether that's in the
> middle of the sorted list of columns, at the end, etc.)
> 
> Ian
> 
> On Jul 11, 2012, at 11:27 PM, Xiaobo Gu wrote:
> 
> but they are other writers insert new transactions into the table when
> customers do new transactions.
> 
> On Thu, Jul 12, 2012 at 1:13 PM, Ian Varley 
> <[email protected]<mailto:[email protected]>
> <mailto:[email protected]>> wrote:
> Hi Xiaobo -
> 
> For HBase, this is doable; you could have a single table in HBase where
> each row is a customer (with the customerid as the rowkey), and columns for
> each of the 300 attributes that are directly part of the customer entity.
> This is sparse, so you'd only take up space for the attributes that
> actually exist for each customer.
> 
> You could then have (possibly in another column family, but not
> necessarily) an additional column for each transaction, where the column
> name is composed of a date concatenated with the transaction id, in which
> you store the 30 attributes as serialized into a single byte array in the
> cell value. (Or, you could alternately do each attribute as its own column
> but there's no advantage to doing so, since presumably a transaction is
> roughly like an immutable event that you wouldn't typically change just a
> single attribute of.) A schema for this (if spelled out in an xml
> representation) could be:
> 
> <table name="customer">
> <key>
> <column name="customerid">
> </key>
> <columnfamily name="1">
> <column name="customer_attribute_1" />
> <column name="customer_attribute_2" />
> ...
> <column name="customer_attribute_300" />
> </columnFamily>
> <columnFamily name="2">
> <entity name="transaction" values="serialized">
>   <key>
>     <column name="transaction_date" type="date">
>     <column name="transaction_id" />
>   </key>
>   <column name="transaction_attribute_1" />
>   <column name="transaction_attribute_2" />
>   ...
>   <column name="transaction_attribute_30" />
> </entity>
> </columnFamily>
> </table>
> 
> (This isn't real HBase syntax, it's just an abstract way to show you the
> structure.) In practice, HBase isn't doing anything "special" with the
> entity that lives nested inside your table; it's just a matter of
> convention, that you could "see" it that way. The customer-level attributes
> (like, say, "customer_name" and "customer_address") would be literal column
> names (aka column qualifiers) embedded in your code, whereas the
> transaction-oriented columns would be created at runtime with column names
> like "2012-07-11 12:34:56_TXN12345", and values that are simply collection
> objects (containing the 30 attributes) serialized into a byte array.
> 
> In this scenario, you get fast access to any customer by ID, and further
> to a range of transactions by date (using, say, a column pagination
> filter). This would perform roughly equivalently regardless of how many
> customers are in the table, or how many transactions exist for each
> customer. What you'd lose on this design would be the ability to get a
> single transaction for a single customer by ID (since you're storing them
> by date). But if you need that, you could actually store it both ways. You
> also might be introducing some extra contention on concurrent transaction
> PUT requests for a single client, because they'd have to fight over a lock
> for the row (but that's probably not a big deal, since it's only
> contentious within each customer).
> 
> You might find my presentation on designing HBase schemas (from this
> year's HBaseCon) useful:
> 
> http://www.hbasecon.com/sessions/hbase-schema-design-2/
> 
> Ian
> 
> On Jul 11, 2012, at 10:58 PM, Xiaobo Gu wrote:
> 
> Hi,
> 
> I have technical problem, and wander whether HBase or Cassandra
> support Embedded table data model, or can somebody show me a way to do
> this:
> 
> 1.We have a very large customer entity table which have 100 milliion
> rows, each customer row has about 300 attributes(columns).
> 2.Each customer do about 1000 transactions per year, each transaction
> has about 30 attributes(columns), and we just save one year
> transactions for each customer
> 
> We want a data model that  we can get the customer entity with all the
> transactions which he did for a single client call within a fixed time
> window, according to the customer id (which is the primary key of the
> customer table). We do the following in RDBMS,
> A customer table with customerid as the primary key, A transaction
> table with customer id as a secondary index, and join them , or we
> must do two separate  calls, and because we have so many concurrent
> readers and these two tables are became so large, the RDBMS system
> performs poor.
> 
> 
> Can we embedded the transactions inside the customer table in HBase or
> Cassandra?
> 
> 
> Regards,
> 
> Xiaobo Gu
> 
> 
>

Re: Embedded table data model

Reply via email to