Hi, I am trying to figure out a system that can offer both low latency in generating reports, low latency between data being collected and being available in the reporting system, and avoiding glitches and errors.
In our system users are collecting responses from their students. We want to generate reports showing student scores over time, and the scores should typically be available within a minute of being collected. I am looking to store the data tables in parquet on S3 and query them using drill. However, updating parquet files can be a bit troublesome. The files cannot easily be appended to. So some process has to periodically re-write the parquet files. Also, we don't want to have hundreds or thousands of separate files, as this can slow down query executing. So we don't want to end up with a new file every 10 seconds. What I have been thinking is to have a process that runs which writes changes fairly frequently to small new files and another process that rolls up those small files into progressively larger ones as they get older. When querying the data I will have to de-duplicate and keep only the most recent version of each record, which I think is possible using window functions. Thus the file aggregation process might not have to worry about having the exact same row in two files temporarily. I'm wondering if anyone has gone down this road before and has insights to share about it.
