[Python/C-Glib] writing IPC file format column-by-column

Ishan Anand Wed, 09 Sep 2020 03:12:32 -0700

Hi

I'm looking at using Arrow primarily on low-resource instances with out of 
memory datasets. This is the workflow I'm trying to implement.



  *   Write record batches in IPC streaming format to a file from a C runtime.
  *   Consume it one row at a time from python/C by loading the file in chunks.
  *   If the schema is simple enough to support zero copy operations, make the 
table readable from pandas. This needs me to,
     *   convert it into a Table with a single chunk per column (since pandas 
can't use mmap with chunked arrays).
     *   write the table in IPC random access format to a file.

PyArrow provides a method `combine_chunks` to combine chunks into a single 
chunk. However, it needs to create the entire table in memory (I suspect it is 
2x, since it loads both versions of the table in memory but that can be 
avoided).

Since the Arrow layout is columnar, I'm curious if it is possible to write the 
table one column at a time. And if the existing glib/python APIs support it? 
The C++ file writer objects seem to go down to serializing a single record 
batch at a time and not per column.


Thank you,
Ishan

[Python/C-Glib] writing IPC file format column-by-column

Reply via email to