Re: feather file and arrow internals

Weston Pace Wed, 03 Nov 2021 22:26:14 -0700

> interesting. just curious, but is there a way to do it without extending the 
> format? like it would still scan through all the batches but it wouldn’t load 
> all rows into memory or maybe it’s already possible? for example, if i write 
> a 1GB arrow table as a feather file and it’s 1GB uncompressed on disk. can i 
> read_table on that file but apply filtering as it scans so the resulting 
> table is never 1GB? not sure if this conflicts with zero-copy and/or SIMD 
> design.


That is definitely possible today.  In pyarrow you can get this
functionality with the datasets API and it doesn't matter if you use
parquet or feather.  This is done automatically by
pyarrow.dataset.Dataset.to_table and
pyarrow.dataset.Dataset.to_batches if you provide a filter.  The
datasets API will first try and use the file format to reduce the
amount of data it has to load (predicate pushdown).  If it cannot do
this (e.g. with the feather format), then it will still apply the
filter in a streaming / out-of-core fashion.

One important note is that your feather file will need to have more
than 1 row group for this to work.  If you have a 100GB feather file
stored as a single record batch and you try to apply a filter to it
then it will occupy 100GB of ram (note: this is a limitation of the
datasets API / IPC reader and not the feather file format as the
feather file format supports random slicing).  If you have a 100GB
feather file stored as 1,000 100MB record batches then it will never
occupy 100GB of RAM.  Maybe size your record batches so that you can
fit about 100 of them into RAM (note: how many actual batches are
stored in memory is a factor of how many files you have, how many
record batches each file has, your file format, your fragment
readahead parameter, and the batch readahead parameter.  These last
two parameters are not exposed in pyarrow today).  If you only have
one file it shouldn't load more than ~32 batches into RAM at once.

Finally, the resulting table (if you are running to_table) will not be
1GB.  The filter passed to the `to_table` function is applied as the
table is constructed.

On Wed, Nov 3, 2021 at 12:30 PM gordon chung <[email protected]> wrote:
>
> thanks for the details! really helpful.
>
> > However, there is a spot for it (record batch metadata) and there has
> > been discussion in the past of adding similar capabilities someday.
>
> interesting. just curious, but is there a way to do it without extending the 
> format? like it would still scan through all the batches but it wouldn’t load 
> all rows into memory or maybe it’s already possible? for example, if i write 
> a 1GB arrow table as a feather file and it’s 1GB uncompressed on disk. can i 
> read_table on that file but apply filtering as it scans so the resulting 
> table is never 1GB? not sure if this conflicts with zero-copy and/or SIMD 
> design.
>
> > As a general rule of thumb parquet is more space-efficient
> > and should be used when you are limited by I/O bandwidth.  Feather is
> > more CPU-efficient and should be used when you are limited by CPU
> > bandwidth.
>
> this was really useful way of describing when to choose which at a 
> high-level. i’ll definitely need to take a closer look at feather now to test 
> this out.
>
> ________________________________________
> From: Weston Pace <[email protected]>
> Sent: November 3, 2021 16:54
> To: [email protected]
> Subject: Re: feather file and arrow internals
>
> Great questions.
>
> > is this because internally there is no metadata as to what a RecordBatch 
> > contains and it has to iterate through all batches or it is just something 
> > unsupported by api?
>
> The former.  Row filtering in parquet relies on row group statistics
> (min & max values) and someday we may also support using bloom filters
> (more than just min/max) and data page statistics (still min/max but
> at a finer resolution).  The feather-v2 format (a.k.a Arrow-IPC) does
> not have any defined standard for storing row group statistics.
> However, there is a spot for it (record batch metadata) and there has
> been discussion in the past of adding similar capabilities someday.
> If someone had enough motivation I think all the necessary parts are
> ready so it is mainly just waiting for someone with motivation and
> engineering time.
>
> > should I use featherv2 in production if I'm ok with "drawbacks" (larger 
> > file, less adoption, other stuff I'm not aware of...) or is feather just a 
> > poc?
>
> Feather-v1 is something of a proof of concept (although we are
> maintaining backwards compatibility with it).  Feather-v2, which is
> sometimes just called the Arrow IPC format, is definitely intended to
> be maintained and not just a proof of concept.
>
> > most references to feather/storing arrow on disk have historically had a 
> > disclaimer saying it's not meant to replace parquet.
>
> Feather and parquet have different use cases and it's difficult to
> describe which is more appropriate as it can depend on a lot of
> details.  As a general rule of thumb parquet is more space-efficient
> and should be used when you are limited by I/O bandwidth.  Feather is
> more CPU-efficient and should be used when you are limited by CPU
> bandwidth.  However, this is only a rule of thumb and there are plenty
> of exceptions.
>
> On Wed, Nov 3, 2021 at 9:01 AM gordon chung <[email protected]> wrote:
> >
> > hi,
> >
> > apologies if this in the doc or mailing list somewhere and I missed it but 
> > I was hoping to understand the arrow file format a bit more.
> >
> > I noticed that when reading a feather file, the API, at least for Python, 
> > doesn't support filtering. is this because internally there is no metadata 
> > as to what a RecordBatch contains and it has to iterate through all batches 
> > or it is just something unsupported by api? there are references that it 
> > supports slicing but I'm thinking more like filtering to only get rows 
> > fitting a specific condition (get rows where col1 == 'a' vs get rows 
> > 1,3,5...).
> >
> > also, most references to feather/storing arrow on disk have historically 
> > had a disclaimer saying it's not meant to replace parquet. that said, the 
> > featherv2 post does have comparison against parquet and my limited testing 
> > does show featherv2 performing favourably against it. i guess the question 
> > is, should I use featherv2 in production if I'm ok with "drawbacks" (larger 
> > file, less adoption, other stuff I'm not aware of...) or is feather just a 
> > poc?
> >
> > thanks,
> >
> > gord

Re: feather file and arrow internals

Reply via email to