Parquet pushdown filtering and cost

Adam Gilmore Mon, 27 Apr 2015 23:28:46 -0700

Hi guys,

I have started implementing a Parquet pushdown filtering optimizer rule and
have made significant progress.  Using some of the Mongo pushdown filtering
code, I was able to quickly convert logical expressions into proper Parquet
filter2 API expressions.


The issue is, because the "old" (or flat) Parquet reader reads values out
in batches and we don't want to do inspection at this stage, I am not
actually removing the FilterPrel from the plan.

The idea is that we should be able to quickly exclude row groups that don't
match the filter condition to potentially, significantly reduce the number
of rows being read and hence significantly improve performance.

This of course relies on the fact the filter will actually have a good
chance of filtering out those row groups (this may or may not be the case
based on the dataset and filter).

My problem is, that because I don't remove the FilterPrel, the plan is
effectively the same cost without the filter pushed down, which means the
optimizer picks the first plan.  The code for ScanPrel calculating cost is
fairly simplistic in that it purely looks at overall row count and column
count.

Moreover, adding the pushdown filter may not in fact be any better (in fact
it could be slightly worse by doing the extra CPU work of checking row
group statistics), but it is most likely a fair bit better.

The only real way to tell would be to read all the Parquet files first, get
an idea of how many row groups would be excluded and then update the cost,
but this seems a bit crazy at the planning stage.

I'd like to get some input on how we think this could be managed.  I don't
want to hack the cost calculations to artificially make the pushed down
filter seem less expensive.

Any thoughts would be great.

Parquet pushdown filtering and cost

Reply via email to