Hi guys, I have started implementing a Parquet pushdown filtering optimizer rule and have made significant progress. Using some of the Mongo pushdown filtering code, I was able to quickly convert logical expressions into proper Parquet filter2 API expressions.
The issue is, because the "old" (or flat) Parquet reader reads values out in batches and we don't want to do inspection at this stage, I am not actually removing the FilterPrel from the plan. The idea is that we should be able to quickly exclude row groups that don't match the filter condition to potentially, significantly reduce the number of rows being read and hence significantly improve performance. This of course relies on the fact the filter will actually have a good chance of filtering out those row groups (this may or may not be the case based on the dataset and filter). My problem is, that because I don't remove the FilterPrel, the plan is effectively the same cost without the filter pushed down, which means the optimizer picks the first plan. The code for ScanPrel calculating cost is fairly simplistic in that it purely looks at overall row count and column count. Moreover, adding the pushdown filter may not in fact be any better (in fact it could be slightly worse by doing the extra CPU work of checking row group statistics), but it is most likely a fair bit better. The only real way to tell would be to read all the Parquet files first, get an idea of how many row groups would be excluded and then update the cost, but this seems a bit crazy at the planning stage. I'd like to get some input on how we think this could be managed. I don't want to hack the cost calculations to artificially make the pushed down filter seem less expensive. Any thoughts would be great.
