We are launching a data-intensive application that will store in upwards of 50 
million 150-byte records per day per user. We have identified Cassandra as our 
database technology and Flume as what we will use to seed the data from log 
files into the database. 

Each user is given their own server instance, but the schema of the data for 
each user will be the same.

We will be performing realtime analysis on this information as part of our 
application and was considering the advantages/disadvantages of all users using 
the same keyspace. All data will be treated the same as far as replication 
factor and the only difference is we won't be displaying one user's info to 
another user. They will be compartmentalized and one user's data will not 
affect or ever be compared against another user.

Conceptualize this as a each user has their own Apache server and that server 
spits out 50 million records per day and each user will only be analyzing the 
data for their particular server, not anyone elses. The log formats are exactly 
the same.

My experience lies in relational databases and not key-value stores, like 
Cassandra. So, in the mysql world we would put each user in their own database 
to avoid the locking contention and to make queries faster. 

If we don't post info into different keyspaces, i assume we will have to add an 
additional field to our records to identify the user that owns that particular 
record. How does a single large Keyspace affect query speed, etc. etc.



Trevor Francis


Reply via email to