Schema Design: Query by time; more on rows versus columns

Michael Russo Fri, 31 Dec 2010 12:25:05 -0800

Hi all,

I have a schema design question that I would like to run by the group.


Table "A" has a composite row key containing a user identifier and an item
identifier.

I need to query this table by time, to determine the latest X new or updated
rows (on a per-user basis).  (This is a "live" query and not part of a
MapReduce workflow.)

To support this query, I have designed an additional table (Table "B") with
the following schema:

- row key: composite key, containing a user identifier (fixed width) and a
reverse order timestamp (fixed width); both parts stored in their binary
form
- a single CF
- each column qualifier is the actual item identifier; this is used to form
part of the row key for table A
- cell values are always empty

There are a few additional options re: table B:

1) use the exact timestamp that the event occurred at for the row key.

or,

2) set a "max timespan" to control the maximum time delta that can be stored
per row; quantize the timestamp of each row to start/end along these "max
timespan" boundaries.

The number of rows added or modified in table A will vary on a per-user
basis, and will range anywhere from 50 to 5000 modifications per user per
day.  The size of the user base is measured in millions, so there will be
significant write traffic.

With design 1), almost every addition/ modification will result in a new
row.

With design 2), the number of rows is dependent upon the timespan; if set to
one day, for example, there will be one row per user per day with ~ 50 to
5000 columns.

To satisfy the requisite query, given design 1): perform a scan, specifying
a start key that is simply the user's identifier, and limit the number of
rows to X.

Given design 2), the query is less straightforward, because we do not know
how many columns are contained per row before issuing the scan.  However, I
can guess roughly how many rows I will need, add a buffer, and limit the
scan to this number of rows.  Then, given the response, I can sort the
results client-side by timestamp, and limit to X.

I took a look at the OpenTSDB source code after hearing about it at
HadoopWorld, and the implementation is similar to the second design
(quantizing event timestamps and storing multiple events per row), although
the use case/ queries are different.

I currently have option 2 mostly implemented, but I wanted to run this by
the community to make sure that I was on the right track. Is design 2) the
better choice?  The query is not as clean as that in design 1), and it
requires extra bandwidth per query and client-side CPU for sorting, but it
very significantly reduces the number of required rows (by several orders of
magnitude).  Alternatively, is there an even better implementation path that
I have not considered?

Thanks all for any help and advice.

Happy New Year,
Michael

Schema Design: Query by time; more on rows versus columns

Reply via email to