I'm working with some large-ish datasets that I want to share across a
Python multiprocessing worker pool. Is multiprocessing.shared_memory safe
for this job? I have a few questions about it.
Here's a sketch:
```
import multiprocessing.sharedmemory
import pyarrow as pa
# Make some data in a RecordBatch
data = pa.array([1, 2, 3, 4, 5])
batch = pa.record_batch(data, names=["x"])
# Write it to a temporary buffer
sink = pa.BufferOutputStream()
with pa.ipc.new_stream(sink, batch.schema) as writer:
writer.write_batch(batch)
buf = sink.getvalue()
# Allocate some shared memory
shm = multiprocessing.shared_memory.SharedMemory("arrow_shmem_1",
size=buf.size, create=True)
# Copy the buffer into shared memory
shm.buf[:buf.size] = buf.to_pybytes()
```
Then, in a separate process:
```
import multiprocessing.sharedmemory
import pyarrow as pa
# Connect to the existing shared memory
shm = multiprocessing.shared_memory.SharedMemory("arrow_shmem_1",
create=False)
# Read a batch out
r = pa.ipc.RecordBatchStreamReader(source=shm.buf.obj)
result = r.read_all()
print(result)
```
That second process prints the expected:
```
pyarrow.Table
x: int64
----
x: [[1,2,3,4,5]]
```
I have a few questions about this.
1. Is there a way to avoid the temporary buffer? Like, could I allocate
straight into a bigger slab of shared memory, or something?
2. Relatedly, is there a way to predict the IPC-serialized message size
without actually serializing it? I'm thinking of something like
`pa.ipc.get_record_batch_size(batch)`.
3. Is it possible to write a pyarrow.MemoryAllocator in pure Python? I
don't see obvious "allocate" and "deallocate" methods to hook into.
4. Are there other reasons this might be unsafe?
Thanks,
Spencer