Hi Ian, Do you mean each transaction will be created as a column inside the cf for transactions, and these columns are created dynamically as transactions occur?
Regards, Xiaobo Gu On Fri, Jul 13, 2012 at 11:08 AM, Ian Varley <[email protected]> wrote: > Column families are not the same thing as columns. You should indeed have a > small number of column families, as that article points out. Columns (aka > column qualifiers) are run-time defined key/value pairs that contain the data > for every row, and having large numbers of these is fine. > > > > On Jul 12, 2012, at 7:27 PM, "Cole" <[email protected]> wrote: > >> I think this design has some question, please refer >> http://hbase.apache.org/book/number.of.cfs.html >> >> 2012/7/12 Ian Varley <[email protected]> >> >>> Yes, that's fine; you can always do a single column PUT into an existing >>> row, in a concurrency-safe way, and the lock on the row is only held as >>> long as it takes to do that. Because of HBase's Log-Structured Merge-Tree >>> architecture, that's efficient because the PUT only goes to memory, and is >>> merged with on-disk records at read time (until a regular flush or >>> compaction happens). >>> >>> So even though you already have, say, 10K transactions in the table, it's >>> still efficient to PUT a single new transaction in (whether that's in the >>> middle of the sorted list of columns, at the end, etc.) >>> >>> Ian >>> >>> On Jul 11, 2012, at 11:27 PM, Xiaobo Gu wrote: >>> >>> but they are other writers insert new transactions into the table when >>> customers do new transactions. >>> >>> On Thu, Jul 12, 2012 at 1:13 PM, Ian Varley <[email protected] >>> <mailto:[email protected]>> wrote: >>> Hi Xiaobo - >>> >>> For HBase, this is doable; you could have a single table in HBase where >>> each row is a customer (with the customerid as the rowkey), and columns for >>> each of the 300 attributes that are directly part of the customer entity. >>> This is sparse, so you'd only take up space for the attributes that >>> actually exist for each customer. >>> >>> You could then have (possibly in another column family, but not >>> necessarily) an additional column for each transaction, where the column >>> name is composed of a date concatenated with the transaction id, in which >>> you store the 30 attributes as serialized into a single byte array in the >>> cell value. (Or, you could alternately do each attribute as its own column >>> but there's no advantage to doing so, since presumably a transaction is >>> roughly like an immutable event that you wouldn't typically change just a >>> single attribute of.) A schema for this (if spelled out in an xml >>> representation) could be: >>> >>> <table name="customer"> >>> <key> >>> <column name="customerid"> >>> </key> >>> <columnfamily name="1"> >>> <column name="customer_attribute_1" /> >>> <column name="customer_attribute_2" /> >>> ... >>> <column name="customer_attribute_300" /> >>> </columnFamily> >>> <columnFamily name="2"> >>> <entity name="transaction" values="serialized"> >>> <key> >>> <column name="transaction_date" type="date"> >>> <column name="transaction_id" /> >>> </key> >>> <column name="transaction_attribute_1" /> >>> <column name="transaction_attribute_2" /> >>> ... >>> <column name="transaction_attribute_30" /> >>> </entity> >>> </columnFamily> >>> </table> >>> >>> (This isn't real HBase syntax, it's just an abstract way to show you the >>> structure.) In practice, HBase isn't doing anything "special" with the >>> entity that lives nested inside your table; it's just a matter of >>> convention, that you could "see" it that way. The customer-level attributes >>> (like, say, "customer_name" and "customer_address") would be literal column >>> names (aka column qualifiers) embedded in your code, whereas the >>> transaction-oriented columns would be created at runtime with column names >>> like "2012-07-11 12:34:56_TXN12345", and values that are simply collection >>> objects (containing the 30 attributes) serialized into a byte array. >>> >>> In this scenario, you get fast access to any customer by ID, and further >>> to a range of transactions by date (using, say, a column pagination >>> filter). This would perform roughly equivalently regardless of how many >>> customers are in the table, or how many transactions exist for each >>> customer. What you'd lose on this design would be the ability to get a >>> single transaction for a single customer by ID (since you're storing them >>> by date). But if you need that, you could actually store it both ways. You >>> also might be introducing some extra contention on concurrent transaction >>> PUT requests for a single client, because they'd have to fight over a lock >>> for the row (but that's probably not a big deal, since it's only >>> contentious within each customer). >>> >>> You might find my presentation on designing HBase schemas (from this >>> year's HBaseCon) useful: >>> >>> http://www.hbasecon.com/sessions/hbase-schema-design-2/ >>> >>> Ian >>> >>> On Jul 11, 2012, at 10:58 PM, Xiaobo Gu wrote: >>> >>> Hi, >>> >>> I have technical problem, and wander whether HBase or Cassandra >>> support Embedded table data model, or can somebody show me a way to do >>> this: >>> >>> 1.We have a very large customer entity table which have 100 milliion >>> rows, each customer row has about 300 attributes(columns). >>> 2.Each customer do about 1000 transactions per year, each transaction >>> has about 30 attributes(columns), and we just save one year >>> transactions for each customer >>> >>> We want a data model that we can get the customer entity with all the >>> transactions which he did for a single client call within a fixed time >>> window, according to the customer id (which is the primary key of the >>> customer table). We do the following in RDBMS, >>> A customer table with customerid as the primary key, A transaction >>> table with customer id as a secondary index, and join them , or we >>> must do two separate calls, and because we have so many concurrent >>> readers and these two tables are became so large, the RDBMS system >>> performs poor. >>> >>> >>> Can we embedded the transactions inside the customer table in HBase or >>> Cassandra? >>> >>> >>> Regards, >>> >>> Xiaobo Gu >>> >>> >>>
