Thank you for your response, My footers are 110-132k so I have always 65k 120k two calls pattern
By subsequent calls I mean calls to read specific columns - I do not read all columns at all. I usually read PK column, then 1 or 2 calls for FK columns and final for data columns... On each of these calls I have 2 calls: 65k and 120k ones first. BR J czw., 7 mar 2024 o 15:58 David Li <[email protected]> napisał(a): > There is a hardcoded 65k read [1]. The idea being, instead of making two > round trips to read the footer (one to read the size, one to read the > data), you can optimistically read the tail of the file. If the footer is > in the last 65k, then you save a round trip. > > It is not configurable. I think it would be reasonable to let it be > configurable. I believe I've seen systems where this was patched as a > workaround. > > [1]: > https://github.com/apache/arrow/blob/1c9a3122c603e2766a793766a11ff4c706efb2aa/cpp/src/parquet/file_reader.cc#L87-L88 > > On Thu, Mar 7, 2024, at 09:45, Felipe Oliveira Carvalho wrote: > > 1. the first read is always 65536, then it is followed by read of the > > size of parquet. > > > > This might be a constant inside adlfs or the Azure SDK itself (?). I > > don't know from the top of my head if Parquet always reads 64k or > > that's an Azure SDK thing. > > > > 2. looks like parquet footer is read on almost every subsequent call > > > > It might be a good idea to post a sample of code so the meaning of > > "subsequent call" becomes more clear. Caching can be problematic > > because it's easy to use too much memory with data that doesn't get > > re-used and/or become outdated compared to the source. > > > > PS: Arrow 16 (the next release) is going to have almost-complete Azure > > Data Lake FS support built-in [1] which might allow us to tweak the > > way it interacts with Parquet reader more deeply. > > > > -- > > Felipe > > > > [1] https://github.com/apache/arrow/issues/18014 (Python bindings and > > URI parsing are still work in progress) > > > > On Tue, Mar 5, 2024 at 2:44 PM Jacek Pliszka <[email protected]> > wrote: > >> > >> Hi! > >> > >> I have noticed 2 things while using > >> pyarrow.dataset.dataset with ADLFS with parquet and I wonder if this > is something > >> worth opening a ticket for. > >> > >> 1. the first read is always 65536, then it is followed by read of the > size of parquet. > >> I wonder if there is a way to have the size of the first read defined > and have just 1 read. > >> I pretty much know how large is the footer in parquet files I am getting > >> and I would like to read it in one request. > >> > >> 2. looks like parquet footer is read on almost every subsequent call . > Is there a way to cache > >> parquet footer so it is not read every time? > >> > >> Thanks in advance for your insights, > >> > >> Jacek > >> > >> >
