Hi Dobes,

Also, if Ted is still lurking on this list, he's an expert at this stuff. Here 
are some patterns I've seen.


What you describe is a pretty standard pattern. Substitute anything for 
"scores" (logs, sales, clicks, GPS tracking locations) and you find that many 
folks have solved the same issue.

The general approach is either: single system (some kind of DB), or a two-part 
system (lambda architecture [1]) such as you described earlier.

As you've discovered, Parquet is designed to be an archive format, not a live 
update format. As you've seen, considerable work goes into write time to 
organized data into columns row groups, footers, etc. to speed up read access. 
But, as a result, Parquet cannot be updated incrementally.

A common pattern is to write data into Kafka as it arrives. Then, run your ETL 
step every so often (once a day, say.) In the mean time, stream data to a 
short-term store. For scores, each (student, exercise) pair is unique, you 
probably don't have to accumulate multiple events for the same exercise or test?

Depending on your constraints and budget, you can accumulate current activity 
into a DB, an in-memory store, etc. Or, if you can find a good partitioning 
scheme that keeps each Kafka partition relatively small, you can even query 
Kafka directly (Drill has a Kafka plugin). In this model, Kafka would hold, 
say, today's data and Parquet would hold past data (partitioned in some way so 
you scan only one Parquet file per day per school, say.)

The Kafka component is essential for many reasons including those above. It 
also gives you a rock-solid guarantee that you never lose data even if your ETL 
fails and so on.


So:

External source --> Kafka --> ETL --> Parquet

And

... Kafka --> "Current Data" Cache

Both can then be queried by Drill to combine results as you described in an 
earlier message.


The best reference I've seen for this kind of thing is Designing Data-Intensive 
Applications by  Martin Kleppmann (avoid the knock-off with nearly the same 
title.)

Thanks,
- Paul


[1] http://lambda-architecture.net/
[2] 
https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/
 

    On Thursday, February 27, 2020, 11:37:45 AM PST, Dobes Vandermeer 
<[email protected]> wrote:  
 
 Hi,

I am trying to figure out a system that can offer both low latency in 
generating reports, low latency between data being collected and being 
available in the reporting system, and avoiding glitches and errors.

In our system users are collecting responses from their students.  We want to 
generate reports showing student scores over time, and the scores should 
typically be available within a minute of being collected.

I am looking to store the data tables in parquet on S3 and query them using 
drill.

However, updating parquet files can be a bit troublesome.  The files cannot 
easily be appended to.  So some process has to periodically re-write the 
parquet files.  Also, we don't want to have hundreds or thousands of separate 
files, as this can slow down query executing.  So we don't want to end up with 
a new file every 10 seconds.

What I have been thinking is to have a process that runs which writes changes 
fairly frequently to small new files and another process that rolls up those 
small files into progressively larger ones as they get older.

When querying the data I will have to de-duplicate and keep only the most 
recent version of each record, which I think is possible using window 
functions.  Thus the file aggregation process might not have to worry about 
having the exact same row in two files temporarily.

I'm wondering if anyone has gone down this road before and has insights to 
share about it.
  

Reply via email to