Dear List,

I try to find out why querying Parquet files on S3 takes longer than I think it should take.

I have a dummy data set with a couple of random columns, but also 'year', 'month' and 'city'.

This data set is stored on my disk, partitioned along the three columns mentioned before. This looks for example like this:

synth/year=2020/month=4/city=Karlsruhe/6a260e3e455f4b05b0cf7b4c361de6be.parquet

I query this data set like this:

###

import pyarrow.dataset as ds
import duckdb

data_set = ds.dataset(path, partitioning="hive")
quack = duckdb.arrow(data_set)
res = quack.query("tbl", "SELECT avg(col_55) as avg, count(*) as num_records, city FROM tbl WHERE col_55 > col_44 AND city='Karlsruhe' GROUP BY city")

###

This query works and I think the result is correct.

However, I was wondering why especially with remote files (using s3fs) it takes really long although the query is only supposed to touch 10% or so of the files since I have city in the partitioning as well as in the WHERE clause of the query.

So I ran my query with 'strace':

$ strace -f -t -e trace=file my-command

The result was interesting:

* I had 'stat' calls on all of the files (which I expect)
* But apart from that, I also had 'openat' calls on more files than just the 'city=Karlsruhe' ones, but not on all of them. From the 10 cities I have, only 'Berlin' and 'Braunschweig' have been read (and there, not all parquet files, only half of them), plus 'Karlsruhe', of course.

I would be interested what the expectation is and if there is a way to really load (remote) data on demand, and the demand is determined by DuckDB based on the query.

Thanks,
Philipp

Reply via email to