Arrow and DuckDB: Usage of filters

Philipp Bender Mon, 17 Jan 2022 05:00:58 -0800

Dear List,

I try to find out why querying Parquet files on S3 takes longer than Ithink it should take.

I have a dummy data set with a couple of random columns, but also'year', 'month' and 'city'.

This data set is stored on my disk, partitioned along the three columnsmentioned before. This looks for example like this:


synth/year=2020/month=4/city=Karlsruhe/6a260e3e455f4b05b0cf7b4c361de6be.parquet

I query this data set like this:

###

import pyarrow.dataset as ds
import duckdb

data_set = ds.dataset(path, partitioning="hive")
quack = duckdb.arrow(data_set)

res = quack.query("tbl", "SELECT avg(col_55) as avg, count(*) asnum_records, city FROM tbl WHERE col_55 > col_44 AND city='Karlsruhe'GROUP BY city")


###

This query works and I think the result is correct.

However, I was wondering why especially with remote files (using s3fs)it takes really long although the query is only supposed to touch 10% orso of the files since I have city in the partitioning as well as in theWHERE clause of the query.


So I ran my query with 'strace':

$ strace -f -t -e trace=file my-command

The result was interesting:

* I had 'stat' calls on all of the files (which I expect)

* But apart from that, I also had 'openat' calls on more files than justthe 'city=Karlsruhe' ones, but not on all of them. From the 10 cities Ihave, only 'Berlin' and 'Braunschweig' have been read (and there, notall parquet files, only half of them), plus 'Karlsruhe', of course.

I would be interested what the expectation is and if there is a way toreally load (remote) data on demand, and the demand is determined byDuckDB based on the query.


Thanks,
Philipp

Arrow and DuckDB: Usage of filters

Reply via email to