Writing and reading PyArrow objects with multiprocessing.shared_memory

Spencer Nelson Thu, 17 Aug 2023 14:21:20 -0700

I'm working with some large-ish datasets that I want to share across a
Python multiprocessing worker pool. Is multiprocessing.shared_memory safe
for this job? I have a few questions about it.


Here's a sketch:
```
import multiprocessing.sharedmemory
import pyarrow as pa

# Make some data in a RecordBatch
data = pa.array([1, 2, 3, 4, 5])
batch = pa.record_batch(data, names=["x"])

# Write it to a temporary buffer
sink = pa.BufferOutputStream()
with pa.ipc.new_stream(sink, batch.schema) as writer:
    writer.write_batch(batch)
buf = sink.getvalue()

# Allocate some shared memory
shm = multiprocessing.shared_memory.SharedMemory("arrow_shmem_1",
size=buf.size, create=True)

# Copy the buffer into shared memory
shm.buf[:buf.size] = buf.to_pybytes()
```

Then, in a separate process:

```
import multiprocessing.sharedmemory
import pyarrow as pa

# Connect to the existing shared memory
shm = multiprocessing.shared_memory.SharedMemory("arrow_shmem_1",
create=False)

# Read a batch out
r = pa.ipc.RecordBatchStreamReader(source=shm.buf.obj)
result = r.read_all()
print(result)
```

That second process prints the expected:

```
pyarrow.Table
x: int64
----
x: [[1,2,3,4,5]]
```

I have a few questions about this.

1. Is there a way to avoid the temporary buffer? Like, could I allocate
straight into a bigger slab of shared memory, or something?

2. Relatedly, is there a way to predict the IPC-serialized message size
without actually serializing it? I'm thinking of something like
`pa.ipc.get_record_batch_size(batch)`.

3. Is it possible to write a pyarrow.MemoryAllocator in pure Python? I
don't see obvious "allocate" and "deallocate" methods to hook into.

4. Are there other reasons this might be unsafe?

Thanks,
Spencer

Writing and reading PyArrow objects with multiprocessing.shared_memory

Reply via email to