Hi all,


I am trying to store some data (user actions in our application) for future
analysis (probably using Spark). I understand best practice is to store it
in denormalized form, and this will definitely make some of our future
queries much easier. But I have a problem with denormalizing the data.



For example, let’s say one of my queries is “the number of reports
generated by user type”. In the part of the application that the user
connects to to generate reports, we only have access to the user id. In a
traditional RDBMS, this is fine, because at query time you join the user id
onto the users table and get all the user data associated with that user.
But how do I populate extra fields like user type on the fly?



My ideas so far:

1.       I try and maintain an in-memory cache of data such as “user”, and
do a lookup to this cache for every user action and store the user data
with it. #PROS: fast #CONS: not scalable, will run out of memory if data
sets grow

2.       For each user action, I do a call to RDBMS and look up the data
for the user in question, then store the user action plus the user data as
a single row. #PROS easy to scale #CONS slow

3.       I write only the user id and the action straight away, and have a
separate batch process that periodically goes through my table looking for
rows without user data, and looks up the user data from RDBMS and populates
it





None of these solutions seem ideal to me. Does Cassandra have something
like ‘triggers’, where I can set up a table to automatically populate some
rows based on a lookup from another table? Or perhaps Spark or some other
library has built-in functionality that solves exactly this problem?



Any suggestions much appreciated.



Thanks,

Matthew

Reply via email to