Hi Dobes,
Also, if Ted is still lurking on this list, he's an expert at this stuff. Here are some patterns I've seen. What you describe is a pretty standard pattern. Substitute anything for "scores" (logs, sales, clicks, GPS tracking locations) and you find that many folks have solved the same issue. The general approach is either: single system (some kind of DB), or a two-part system (lambda architecture [1]) such as you described earlier. As you've discovered, Parquet is designed to be an archive format, not a live update format. As you've seen, considerable work goes into write time to organized data into columns row groups, footers, etc. to speed up read access. But, as a result, Parquet cannot be updated incrementally. A common pattern is to write data into Kafka as it arrives. Then, run your ETL step every so often (once a day, say.) In the mean time, stream data to a short-term store. For scores, each (student, exercise) pair is unique, you probably don't have to accumulate multiple events for the same exercise or test? Depending on your constraints and budget, you can accumulate current activity into a DB, an in-memory store, etc. Or, if you can find a good partitioning scheme that keeps each Kafka partition relatively small, you can even query Kafka directly (Drill has a Kafka plugin). In this model, Kafka would hold, say, today's data and Parquet would hold past data (partitioned in some way so you scan only one Parquet file per day per school, say.) The Kafka component is essential for many reasons including those above. It also gives you a rock-solid guarantee that you never lose data even if your ETL fails and so on. So: External source --> Kafka --> ETL --> Parquet And ... Kafka --> "Current Data" Cache Both can then be queried by Drill to combine results as you described in an earlier message. The best reference I've seen for this kind of thing is Designing Data-Intensive Applications by Martin Kleppmann (avoid the knock-off with nearly the same title.) Thanks, - Paul [1] http://lambda-architecture.net/ [2] https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/ On Thursday, February 27, 2020, 11:37:45 AM PST, Dobes Vandermeer <[email protected]> wrote: Hi, I am trying to figure out a system that can offer both low latency in generating reports, low latency between data being collected and being available in the reporting system, and avoiding glitches and errors. In our system users are collecting responses from their students. We want to generate reports showing student scores over time, and the scores should typically be available within a minute of being collected. I am looking to store the data tables in parquet on S3 and query them using drill. However, updating parquet files can be a bit troublesome. The files cannot easily be appended to. So some process has to periodically re-write the parquet files. Also, we don't want to have hundreds or thousands of separate files, as this can slow down query executing. So we don't want to end up with a new file every 10 seconds. What I have been thinking is to have a process that runs which writes changes fairly frequently to small new files and another process that rolls up those small files into progressively larger ones as they get older. When querying the data I will have to de-duplicate and keep only the most recent version of each record, which I think is possible using window functions. Thus the file aggregation process might not have to worry about having the exact same row in two files temporarily. I'm wondering if anyone has gone down this road before and has insights to share about it.
