Hi,

I am trying to figure out a system that can offer both low latency in 
generating reports, low latency between data being collected and being 
available in the reporting system, and avoiding glitches and errors.

In our system users are collecting responses from their students.  We want to 
generate reports showing student scores over time, and the scores should 
typically be available within a minute of being collected.

I am looking to store the data tables in parquet on S3 and query them using 
drill.

However, updating parquet files can be a bit troublesome.  The files cannot 
easily be appended to.  So some process has to periodically re-write the 
parquet files.  Also, we don't want to have hundreds or thousands of separate 
files, as this can slow down query executing.  So we don't want to end up with 
a new file every 10 seconds.

What I have been thinking is to have a process that runs which writes changes 
fairly frequently to small new files and another process that rolls up those 
small files into progressively larger ones as they get older.

When querying the data I will have to de-duplicate and keep only the most 
recent version of each record, which I think is possible using window 
functions.  Thus the file aggregation process might not have to worry about 
having the exact same row in two files temporarily.

I'm wondering if anyone has gone down this road before and has insights to 
share about it.

Reply via email to