Re: feather file and arrow internals

Weston Pace Wed, 03 Nov 2021 14:09:59 -0700

Sorry to double-post but I should also add one more difference.

Feather supports random access (either a single index or a slice) and
parquet doesn't.  That's another big factor in choosing which format
you want to go with.


On Wed, Nov 3, 2021 at 10:54 AM Weston Pace <[email protected]> wrote:
>
> Great questions.
>
> > is this because internally there is no metadata as to what a RecordBatch 
> > contains and it has to iterate through all batches or it is just something 
> > unsupported by api?
>
> The former.  Row filtering in parquet relies on row group statistics
> (min & max values) and someday we may also support using bloom filters
> (more than just min/max) and data page statistics (still min/max but
> at a finer resolution).  The feather-v2 format (a.k.a Arrow-IPC) does
> not have any defined standard for storing row group statistics.
> However, there is a spot for it (record batch metadata) and there has
> been discussion in the past of adding similar capabilities someday.
> If someone had enough motivation I think all the necessary parts are
> ready so it is mainly just waiting for someone with motivation and
> engineering time.
>
> > should I use featherv2 in production if I'm ok with "drawbacks" (larger 
> > file, less adoption, other stuff I'm not aware of...) or is feather just a 
> > poc?
>
> Feather-v1 is something of a proof of concept (although we are
> maintaining backwards compatibility with it).  Feather-v2, which is
> sometimes just called the Arrow IPC format, is definitely intended to
> be maintained and not just a proof of concept.
>
> > most references to feather/storing arrow on disk have historically had a 
> > disclaimer saying it's not meant to replace parquet.
>
> Feather and parquet have different use cases and it's difficult to
> describe which is more appropriate as it can depend on a lot of
> details.  As a general rule of thumb parquet is more space-efficient
> and should be used when you are limited by I/O bandwidth.  Feather is
> more CPU-efficient and should be used when you are limited by CPU
> bandwidth.  However, this is only a rule of thumb and there are plenty
> of exceptions.
>
> On Wed, Nov 3, 2021 at 9:01 AM gordon chung <[email protected]> wrote:
> >
> > hi,
> >
> > apologies if this in the doc or mailing list somewhere and I missed it but 
> > I was hoping to understand the arrow file format a bit more.
> >
> > I noticed that when reading a feather file, the API, at least for Python, 
> > doesn't support filtering. is this because internally there is no metadata 
> > as to what a RecordBatch contains and it has to iterate through all batches 
> > or it is just something unsupported by api? there are references that it 
> > supports slicing but I'm thinking more like filtering to only get rows 
> > fitting a specific condition (get rows where col1 == 'a' vs get rows 
> > 1,3,5...).
> >
> > also, most references to feather/storing arrow on disk have historically 
> > had a disclaimer saying it's not meant to replace parquet. that said, the 
> > featherv2 post does have comparison against parquet and my limited testing 
> > does show featherv2 performing favourably against it. i guess the question 
> > is, should I use featherv2 in production if I'm ok with "drawbacks" (larger 
> > file, less adoption, other stuff I'm not aware of...) or is feather just a 
> > poc?
> >
> > thanks,
> >
> > gord

Re: feather file and arrow internals

Reply via email to