Re: Patterns for data updating?

Paul Rogers Thu, 27 Feb 2020 12:12:23 -0800

Hi Dobes,

Also, if Ted is still lurking on this list, he's an expert at this stuff. Here
are some patterns I've seen.

What you describe is a pretty standard pattern. Substitute anything for
"scores" (logs, sales, clicks, GPS tracking locations) and you find that many
folks have solved the same issue.

The general approach is either: single system (some kind of DB), or a two-part
system (lambda architecture [1]) such as you described earlier.

As you've discovered, Parquet is designed to be an archive format, not a live
update format. As you've seen, considerable work goes into write time to
organized data into columns row groups, footers, etc. to speed up read access.
But, as a result, Parquet cannot be updated incrementally.

A common pattern is to write data into Kafka as it arrives. Then, run your ETL
step every so often (once a day, say.) In the mean time, stream data to a
short-term store. For scores, each (student, exercise) pair is unique, you
probably don't have to accumulate multiple events for the same exercise or test?

Depending on your constraints and budget, you can accumulate current activity
into a DB, an in-memory store, etc. Or, if you can find a good partitioning
scheme that keeps each Kafka partition relatively small, you can even query
Kafka directly (Drill has a Kafka plugin). In this model, Kafka would hold,
say, today's data and Parquet would hold past data (partitioned in some way so
you scan only one Parquet file per day per school, say.)

The Kafka component is essential for many reasons including those above. It
also gives you a rock-solid guarantee that you never lose data even if your ETL
fails and so on.

So:

External source --> Kafka --> ETL --> Parquet

And

... Kafka --> "Current Data" Cache

Both can then be queried by Drill to combine results as you described in an
earlier message.

The best reference I've seen for this kind of thing is Designing Data-Intensive
Applications by Martin Kleppmann (avoid the knock-off with nearly the same
title.)

Thanks,
- Paul

[1] http://lambda-architecture.net/
[2]
https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/

On Thursday, February 27, 2020, 11:37:45 AM PST, Dobes Vandermeer
<[email protected]> wrote:

Hi,

I am trying to figure out a system that can offer both low latency in
generating reports, low latency between data being collected and being
available in the reporting system, and avoiding glitches and errors.

In our system users are collecting responses from their students. We want to
generate reports showing student scores over time, and the scores should
typically be available within a minute of being collected.

I am looking to store the data tables in parquet on S3 and query them using
drill.

However, updating parquet files can be a bit troublesome. The files cannot
easily be appended to. So some process has to periodically re-write the
parquet files. Also, we don't want to have hundreds or thousands of separate
files, as this can slow down query executing. So we don't want to end up with
a new file every 10 seconds.

What I have been thinking is to have a process that runs which writes changes
fairly frequently to small new files and another process that rolls up those
small files into progressively larger ones as they get older.

When querying the data I will have to de-duplicate and keep only the most
recent version of each record, which I think is possible using window
functions. Thus the file aggregation process might not have to worry about
having the exact same row in two files temporarily.

I'm wondering if anyone has gone down this road before and has insights to
share about it.

Re: Patterns for data updating?

Reply via email to