Hello there,

I am using Arrow to store data on disk temporarily, so disk space is not a
problem (I understand that Parquet is preferable for more efficient disk
storage). It seems that Arrow's memory mapping/zero copy capabilities would
provide better performance given this use case.

Here are my questions:

1. For new applications, should we prefer the pa.ipc.new_file interface
over write_feather? My understanding from reading [0] is that
pa.feather.write_feather is an API provided for backward compatibility, and
with compression disabled, it seems to produce files of the same size (the
files appear to be identical) as the RecordBatchFileWriter.

2. Does compression affect the need to make copies? I imagine that
compressing the file means that the code to use the file cannot be
zero-copy anymore.

3. When using pandas to analyze the data, is there a way to load the data
using memory mapping, and if so, would this be expected to improve
deserialization performance and memory utilization if multiple processes
are reading the same table data simultaneously? Assume that I'm running on
a modern server-class SSD.

Thank you!

Jonathan

[0] https://arrow.apache.org/faq/#what-about-the-feather-file-format

Reply via email to