[Python][Arrow IPC] Identify if RecordBatchReader is actually loading data or not

Kevin Crouse Fri, 07 Jun 2024 13:05:58 -0700

Hey folks,

I've been using arrow/pyarrow and am writing a multithreaded loader for
arbitrary IPC files (RecordBatchStreamReader based on a MemoryMappedFile)
that processes batches to calculate custom statistics without loading into
memory - and I'm running into issues with IPC outsmarting me.


I want to know how I can identify when StreaReader is actually reading the
full batch from disk and when it's just lazily pulling the batch
metadata and actually reading the batch. I know in Arrow C++ you can
specify cache options and turn lazy off, but that isn't in the pyarrow
interface (and I'm not sure that I would want it anyway) - I'm more
interested in checking what the stream reader is doing rather than
manipulating how it's doing it.

Example: since IPC identifies when a data source is clearly too big for
memory and doesn't try to load it to disk until the data is requested, I'm
running into situations where:
- For 15GB files, each iteration of the stream reader loads the batch from
disk, which is a slow and IO intensive process, and the full naive
iteration of just looping over the reader takes a minute or so - but since
the stream reader locks on each batch fetch, it's effectively the minimum
bound for calculating the statistics.
- For 300GB files, each iteration of the stream reader just grabs the
metadata and a data summary, so the full iteration of the file takes 3s.

So, I have an algorithm that is extremely efficient for the 15GB, not-lazy
context, because it balances the IO dependency of the batch reader with the
various CPU-intensive workers.

For the 300GB, this is sub-optimal because now the workers have become IO
limited and are fighting with each other for the lock. I have a different
algorithm that works better in that scenario.

So, any idea on how I can identify when the stream reader is deciding to
load in-memory or out-of-memory batches?

I'm also already binding some to the C++ interface, so if it's not in
pyarrow but in the C++ interface, I can work with that too.

Thanks,

Kevin

[Python][Arrow IPC] Identify if RecordBatchReader is actually loading data or not

Reply via email to