feather file and arrow internals

gordon chung Wed, 03 Nov 2021 12:01:29 -0700

hi,

apologies if this in the doc or mailing list somewhere and I missed it but I 
was hoping to understand the arrow file format a bit more.


I noticed that when reading a feather file, the API, at least for Python, 
doesn't support filtering. is this because internally there is no metadata as 
to what a RecordBatch contains and it has to iterate through all batches or it 
is just something unsupported by api? there are references that it supports 
slicing but I'm thinking more like filtering to only get rows fitting a 
specific condition (get rows where col1 == 'a' vs get rows 1,3,5...).

also, most references to feather/storing arrow on disk have historically had a 
disclaimer saying it's not meant to replace parquet. that said, the featherv2 
post does have comparison against parquet and my limited testing does show 
featherv2 performing favourably against it. i guess the question is, should I 
use featherv2 in production if I'm ok with "drawbacks" (larger file, less 
adoption, other stuff I'm not aware of...) or is feather just a poc?

thanks,

gord

feather file and arrow internals

Reply via email to