Hi Andrew, Thanks for your questions! Some inline answers:
On Sun, 15 Nov 2020 at 23:57, Andrew Campbell <[email protected]> wrote: > Hi Arrow community, > > I'm new to the project and am trying to understand exactly what is > happening under the hood when I run a filter-collect query on an Arrow > Dataset (backed by Parquet). > > Let's say I created a Parquet dataset with no file-level partitions. I > just wrote a bunch of separate files to a dataset. Now I want to run a > query that returns the rows corresponding to a specific range of datetimes > in the dataset's dt column. > > My understanding is that the Dataset API will push this query down to the > file level, checking the footer of each file for the min/max value of dt > and determining whether this block of rows should be read. > Indeed, that understanding is correct. > Assuming this is correct, a few questions: > > Will every query result in the reading all of the file footers? Is there > any caching of these min/max values? > If you are using the same dataset object to do multiple queries, then the FileMetadata read from the file footers is indeed cached after it is read a first time. > Is there a way to profile query performance? A way to view a query plan > before it is executed? > No, at least not yet. I suppose once there is more work on general query execution (and not only reading/filtering), there will come more tools around it. But for now, you will need to do with general performance profiling tools (for python I can recommend py-spy with its mode it also profile native code, and not only python calls). Best, Joris > I appreciate your time in helping me better understand. > > Andrew >
