One of the concerns I see with this schema is if one of the shows becomes hot. Since you are maintaining your bookings at the column level, a hot "row" cannot be partitioned across regions. Hbase is atomic at the row level. Therefore, different clients updating to the same SHOW_ID will compete with each other. The throughput on a single row is limited because operations at the row level are atomic.
See this discussion on Quora: http://www.quora.com/Is-there-a-limit-to-the-number-of-columns-in-an-HBase-row I will let the experts comment further. On Fri, Nov 18, 2011 at 9:33 AM, Suraj Varma <[email protected]> wrote: > I have an HBase schema design question that I wanted to discuss with the list. > > Let's say we have a "wide" table design that has a table with one > column family containing "show bookings", say. > > RowKey: SHOW_ID > Columns: SEATS_AVAILABLE, BOOKING_<#1>, BOOKING_<#2>, BOOKING_<#3>, etc > Values: <remaining available seats>, <seats booked>, <seats booked, > <seats booked>, etc > > Each "SHOW_ID" will have variable number of columns. > > Usage Pattern: > 1) Multiple clients / threads are constantly > creating/updating/deleting "bookings" and this results in a column > being added /updated/deleted to the row. > 2) The SEATS_AVAILABLE column needs to be atomically updated whenever > a corresponding BOOKING_<#> is added, updated or deleted. > 3) Clients update their own unique BOOKING columns (i.e. clients > update their own mutually exclusive BOOKING_<#> columns. > 4) Clients can concurrently update the SEATS_AVAILABLE column. > 5) Some SHOW_ID will be harder hit than other SHOW_IDs > 6) A TTL on the BOOKING columns will be set to expire them after some set > time. > 7) We want to leverage the atomic update at "row level" that HBase > provides for atomically updating the related columns. > > So - we are visualizing this as sort of an "equalizer" graphic on a > stereo where each row is constantly varying in terms of columns added > & removed. The SEATS_AVAILABLE value goes up & down correspondingly. > > Questions / Notes: > 1) Could this lead to a hot key / hot row scenario? The columns being > updated are mutually exclusive except for the SEATS_AVAILABLE. Or > would this be very low overhead given that only one column is really > being "updated" by multiple client threads? > > 2) The alternative we had explored was tall table where each BOOKING > is a separate row (SHOW_ID-BOOKING-<#> would be the key) ... however, > in this case, we won't be able to atomically update the > SEATS_AVAILABLE column at the same time. > > 3) In terms of "row locking", what is the granularity? i.e. when is > the row level lock engaged to make it atomic (i.e. are the column > updates made on the side and "swapped" in with the row level lock?) or > is the row level lock held for the full duration of the update. > > 4) I think the concern is whether this design is scalable as the > number of clients keep increasing over time ... > > 5) Any other suggestions on how hot row key scenario (if real) can be > sidestepped? > > Thanks, > --Suraj >
