Re: Fine tunning pyarrow.dataset.dataset with adlfs

Jacek Pliszka Thu, 07 Mar 2024 07:03:39 -0800

Thank you for your response,

My footers are 110-132k so I have always 65k 120k two calls pattern


By subsequent calls I mean calls to read specific columns - I do not read
all columns at all. I usually read PK column,
then 1 or 2 calls for FK columns and final for data columns...

On each of these calls I have 2 calls: 65k and 120k ones first.

BR

J

czw., 7 mar 2024 o 15:58 David Li <[email protected]> napisał(a):

> There is a hardcoded 65k read [1]. The idea being, instead of making two
> round trips to read the footer (one to read the size, one to read the
> data), you can optimistically read the tail of the file. If the footer is
> in the last 65k, then you save a round trip.
>
> It is not configurable. I think it would be reasonable to let it be
> configurable. I believe I've seen systems where this was patched as a
> workaround.
>
> [1]:
> https://github.com/apache/arrow/blob/1c9a3122c603e2766a793766a11ff4c706efb2aa/cpp/src/parquet/file_reader.cc#L87-L88
>
> On Thu, Mar 7, 2024, at 09:45, Felipe Oliveira Carvalho wrote:
> > 1. the first read is always 65536, then it is followed by read of the
> > size of parquet.
> >
> > This might be a constant inside adlfs or the Azure SDK itself (?). I
> > don't know from the top of my head if Parquet always reads 64k or
> > that's an Azure SDK thing.
> >
> > 2. looks like parquet footer is read on almost every subsequent call
> >
> > It might be a good idea to post a sample of code so the meaning of
> > "subsequent call" becomes more clear. Caching can be problematic
> > because it's easy to use too much memory with data that doesn't get
> > re-used and/or become outdated compared to the source.
> >
> > PS: Arrow 16 (the next release) is going to have almost-complete Azure
> > Data Lake FS support built-in [1] which might allow us to tweak the
> > way it interacts with Parquet reader more deeply.
> >
> > --
> > Felipe
> >
> > [1] https://github.com/apache/arrow/issues/18014 (Python bindings and
> > URI parsing are still work in progress)
> >
> > On Tue, Mar 5, 2024 at 2:44 PM Jacek Pliszka <[email protected]>
> wrote:
> >>
> >> Hi!
> >>
> >> I have noticed 2 things while using
> >> pyarrow.dataset.dataset  with ADLFS with parquet and I wonder if this
> is something
> >> worth opening a ticket for.
> >>
> >> 1. the first read is always 65536, then it is followed by read of the
> size of parquet.
> >> I wonder if there is a way to have the size of the first read defined
> and have just 1 read.
> >> I pretty much know how large is the footer in parquet files I am getting
> >> and I would like to read it in one request.
> >>
> >> 2. looks like parquet footer is read on almost every subsequent call .
> Is there a way to cache
> >> parquet footer so it is not read every time?
> >>
> >> Thanks in advance for your insights,
> >>
> >> Jacek
> >>
> >>
>

Re: Fine tunning pyarrow.dataset.dataset with adlfs

Reply via email to