> However, updating parquet files can be a bit troublesome. The files > cannot easily be appended to. So some process has to periodically > re-write the parquet files. Also, we don't want to have hundreds or > thousands of separate files, as this can slow down query executing. > So we don't want to end up with a new file every 10 seconds. What I > have been thinking is to have a process that runs which writes changes > fairly frequently to small new files and another process that rolls up > those small files into progressively larger ones as they get older. > When querying the data I will have to de-duplicate and keep only the > most recent version of each record, which I think is possible using > window functions. Thus the file aggregation process might not have to > worry about having the exact same row in two files temporarily. I'm > wondering if anyone has gone down this road before and has insights to > share about it.
You might be interested in delta-lake which provides an implementation of the sql merge statement on top of parquet files. Implementing a drill connector on this should be feasible. This could be used together the hybrid design described by Ted and Paul - and makes parquet be more than static archive. https://docs.delta.io/latest/delta-intro.html -- nicolas paris
