We are launching a data-intensive application that will store in upwards of 50 million 150-byte records per day per user. We have identified Cassandra as our database technology and Flume as what we will use to seed the data from log files into the database.
Each user is given their own server instance, but the schema of the data for each user will be the same. We will be performing realtime analysis on this information as part of our application and was considering the advantages/disadvantages of all users using the same keyspace. All data will be treated the same as far as replication factor and the only difference is we won't be displaying one user's info to another user. They will be compartmentalized and one user's data will not affect or ever be compared against another user. Conceptualize this as a each user has their own Apache server and that server spits out 50 million records per day and each user will only be analyzing the data for their particular server, not anyone elses. The log formats are exactly the same. My experience lies in relational databases and not key-value stores, like Cassandra. So, in the mysql world we would put each user in their own database to avoid the locking contention and to make queries faster. If we don't post info into different keyspaces, i assume we will have to add an additional field to our records to identify the user that owns that particular record. How does a single large Keyspace affect query speed, etc. etc. Trevor Francis