There is a hardcoded 65k read [1]. The idea being, instead of making two round trips to read the footer (one to read the size, one to read the data), you can optimistically read the tail of the file. If the footer is in the last 65k, then you save a round trip.
It is not configurable. I think it would be reasonable to let it be configurable. I believe I've seen systems where this was patched as a workaround. [1]: https://github.com/apache/arrow/blob/1c9a3122c603e2766a793766a11ff4c706efb2aa/cpp/src/parquet/file_reader.cc#L87-L88 On Thu, Mar 7, 2024, at 09:45, Felipe Oliveira Carvalho wrote: > 1. the first read is always 65536, then it is followed by read of the > size of parquet. > > This might be a constant inside adlfs or the Azure SDK itself (?). I > don't know from the top of my head if Parquet always reads 64k or > that's an Azure SDK thing. > > 2. looks like parquet footer is read on almost every subsequent call > > It might be a good idea to post a sample of code so the meaning of > "subsequent call" becomes more clear. Caching can be problematic > because it's easy to use too much memory with data that doesn't get > re-used and/or become outdated compared to the source. > > PS: Arrow 16 (the next release) is going to have almost-complete Azure > Data Lake FS support built-in [1] which might allow us to tweak the > way it interacts with Parquet reader more deeply. > > -- > Felipe > > [1] https://github.com/apache/arrow/issues/18014 (Python bindings and > URI parsing are still work in progress) > > On Tue, Mar 5, 2024 at 2:44 PM Jacek Pliszka <[email protected]> wrote: >> >> Hi! >> >> I have noticed 2 things while using >> pyarrow.dataset.dataset with ADLFS with parquet and I wonder if this is >> something >> worth opening a ticket for. >> >> 1. the first read is always 65536, then it is followed by read of the size >> of parquet. >> I wonder if there is a way to have the size of the first read defined and >> have just 1 read. >> I pretty much know how large is the footer in parquet files I am getting >> and I would like to read it in one request. >> >> 2. looks like parquet footer is read on almost every subsequent call . Is >> there a way to cache >> parquet footer so it is not read every time? >> >> Thanks in advance for your insights, >> >> Jacek >> >>
