Hi all, I have a schema design question that I would like to run by the group.
Table "A" has a composite row key containing a user identifier and an item identifier. I need to query this table by time, to determine the latest X new or updated rows (on a per-user basis). (This is a "live" query and not part of a MapReduce workflow.) To support this query, I have designed an additional table (Table "B") with the following schema: - row key: composite key, containing a user identifier (fixed width) and a reverse order timestamp (fixed width); both parts stored in their binary form - a single CF - each column qualifier is the actual item identifier; this is used to form part of the row key for table A - cell values are always empty There are a few additional options re: table B: 1) use the exact timestamp that the event occurred at for the row key. or, 2) set a "max timespan" to control the maximum time delta that can be stored per row; quantize the timestamp of each row to start/end along these "max timespan" boundaries. The number of rows added or modified in table A will vary on a per-user basis, and will range anywhere from 50 to 5000 modifications per user per day. The size of the user base is measured in millions, so there will be significant write traffic. With design 1), almost every addition/ modification will result in a new row. With design 2), the number of rows is dependent upon the timespan; if set to one day, for example, there will be one row per user per day with ~ 50 to 5000 columns. To satisfy the requisite query, given design 1): perform a scan, specifying a start key that is simply the user's identifier, and limit the number of rows to X. Given design 2), the query is less straightforward, because we do not know how many columns are contained per row before issuing the scan. However, I can guess roughly how many rows I will need, add a buffer, and limit the scan to this number of rows. Then, given the response, I can sort the results client-side by timestamp, and limit to X. I took a look at the OpenTSDB source code after hearing about it at HadoopWorld, and the implementation is similar to the second design (quantizing event timestamps and storing multiple events per row), although the use case/ queries are different. I currently have option 2 mostly implemented, but I wanted to run this by the community to make sure that I was on the right track. Is design 2) the better choice? The query is not as clean as that in design 1), and it requires extra bandwidth per query and client-side CPU for sorting, but it very significantly reduces the number of required rows (by several orders of magnitude). Alternatively, is there an even better implementation path that I have not considered? Thanks all for any help and advice. Happy New Year, Michael
