Hi all,
I was hoping you could provide some assistance with a data modeling
question (my apologies if a similar question has already been posed). I have
time based data that I need to store on a per customer (aka app id ) basis so
that I can easily return it in sorted order by event time. The data in
question is being written at high volume (~50K / sec) and I am concerned about
the cardinality of using either app id or event time as the row key as either
will likely result in hot spots. Here are is the table definition I am
considering:
create table organic_events (
event_id UUID,
app_id INT,
event_time TIMESTAMP,
user_id INT,
….
PRIMARY KEY (app_id, event_time, event_id)
) WITH CLUSTERING ORDER BY (app_id asc,event_time desc);
So that I can be able to query as follows which will naturally sort the results
by time descending:
select * from organic_events where app_id = 1234 and event_time <= '2012-01-01'
and event_time > '2012-01-01';
Anyone have an idea of the best way to accomplish this? I was considering the
following:
* Making the row key a concatenation of app id and 0-100 using a mod on
event id to get the value. When getting data I would just fetch all keys given
the mods (app_id in (1234_0,1234_1,1234_2, etc). This would alleviate the
"hot" key issue but still seems expensive and a little hacky
* I tried removing app_id from the primary key all together (using primary
key of user_id, event_time, event_id) and making app_id a secondary index. I
would need to sort by time on the client. The above query is valid however
running a query is VERY slow as I believe it needs to fetch every row key that
matches the index which is quite expensive (I get a timeout in cqlsh).
* Create a different column family for each app id (I.e.
1234_organic_events). Note that we could easily have 1000s of application ids.
Thanks!