> However, updating parquet files can be a bit troublesome.  The files
> cannot easily be appended to.  So some process has to periodically
> re-write the parquet files.  Also, we don't want to have hundreds or
> thousands of separate files, as this can slow down query executing.
> So we don't want to end up with a new file every 10 seconds.  What I
> have been thinking is to have a process that runs which writes changes
> fairly frequently to small new files and another process that rolls up
> those small files into progressively larger ones as they get older.
> When querying the data I will have to de-duplicate and keep only the
> most recent version of each record, which I think is possible using
> window functions.  Thus the file aggregation process might not have to
> worry about having the exact same row in two files temporarily.  I'm
> wondering if anyone has gone down this road before and has insights to
> share about it.

You might be interested in delta-lake which provides an implementation
of the sql merge statement on top of parquet files. Implementing a drill
connector on this should be feasible. This could be used together the
hybrid design described by Ted and Paul - and makes parquet be more than
static archive.

https://docs.delta.io/latest/delta-intro.html

--
nicolas paris

Reply via email to