Sorry to double-post but I should also add one more difference. Feather supports random access (either a single index or a slice) and parquet doesn't. That's another big factor in choosing which format you want to go with.
On Wed, Nov 3, 2021 at 10:54 AM Weston Pace <[email protected]> wrote: > > Great questions. > > > is this because internally there is no metadata as to what a RecordBatch > > contains and it has to iterate through all batches or it is just something > > unsupported by api? > > The former. Row filtering in parquet relies on row group statistics > (min & max values) and someday we may also support using bloom filters > (more than just min/max) and data page statistics (still min/max but > at a finer resolution). The feather-v2 format (a.k.a Arrow-IPC) does > not have any defined standard for storing row group statistics. > However, there is a spot for it (record batch metadata) and there has > been discussion in the past of adding similar capabilities someday. > If someone had enough motivation I think all the necessary parts are > ready so it is mainly just waiting for someone with motivation and > engineering time. > > > should I use featherv2 in production if I'm ok with "drawbacks" (larger > > file, less adoption, other stuff I'm not aware of...) or is feather just a > > poc? > > Feather-v1 is something of a proof of concept (although we are > maintaining backwards compatibility with it). Feather-v2, which is > sometimes just called the Arrow IPC format, is definitely intended to > be maintained and not just a proof of concept. > > > most references to feather/storing arrow on disk have historically had a > > disclaimer saying it's not meant to replace parquet. > > Feather and parquet have different use cases and it's difficult to > describe which is more appropriate as it can depend on a lot of > details. As a general rule of thumb parquet is more space-efficient > and should be used when you are limited by I/O bandwidth. Feather is > more CPU-efficient and should be used when you are limited by CPU > bandwidth. However, this is only a rule of thumb and there are plenty > of exceptions. > > On Wed, Nov 3, 2021 at 9:01 AM gordon chung <[email protected]> wrote: > > > > hi, > > > > apologies if this in the doc or mailing list somewhere and I missed it but > > I was hoping to understand the arrow file format a bit more. > > > > I noticed that when reading a feather file, the API, at least for Python, > > doesn't support filtering. is this because internally there is no metadata > > as to what a RecordBatch contains and it has to iterate through all batches > > or it is just something unsupported by api? there are references that it > > supports slicing but I'm thinking more like filtering to only get rows > > fitting a specific condition (get rows where col1 == 'a' vs get rows > > 1,3,5...). > > > > also, most references to feather/storing arrow on disk have historically > > had a disclaimer saying it's not meant to replace parquet. that said, the > > featherv2 post does have comparison against parquet and my limited testing > > does show featherv2 performing favourably against it. i guess the question > > is, should I use featherv2 in production if I'm ok with "drawbacks" (larger > > file, less adoption, other stuff I'm not aware of...) or is feather just a > > poc? > > > > thanks, > > > > gord
