Hi Arrow community, I'm new to the project and am trying to understand exactly what is happening under the hood when I run a filter-collect query on an Arrow Dataset (backed by Parquet).
Let's say I created a Parquet dataset with no file-level partitions. I just wrote a bunch of separate files to a dataset. Now I want to run a query that returns the rows corresponding to a specific range of datetimes in the dataset's dt column. My understanding is that the Dataset API will push this query down to the file level, checking the footer of each file for the min/max value of dt and determining whether this block of rows should be read. Assuming this is correct, a few questions: Will every query result in the reading all of the file footers? Is there any caching of these min/max values? Is there a way to profile query performance? A way to view a query plan before it is executed? I appreciate your time in helping me better understand. Andrew
