Hey folks, I've been using arrow/pyarrow and am writing a multithreaded loader for arbitrary IPC files (RecordBatchStreamReader based on a MemoryMappedFile) that processes batches to calculate custom statistics without loading into memory - and I'm running into issues with IPC outsmarting me.
I want to know how I can identify when StreaReader is actually reading the full batch from disk and when it's just lazily pulling the batch metadata and actually reading the batch. I know in Arrow C++ you can specify cache options and turn lazy off, but that isn't in the pyarrow interface (and I'm not sure that I would want it anyway) - I'm more interested in checking what the stream reader is doing rather than manipulating how it's doing it. Example: since IPC identifies when a data source is clearly too big for memory and doesn't try to load it to disk until the data is requested, I'm running into situations where: - For 15GB files, each iteration of the stream reader loads the batch from disk, which is a slow and IO intensive process, and the full naive iteration of just looping over the reader takes a minute or so - but since the stream reader locks on each batch fetch, it's effectively the minimum bound for calculating the statistics. - For 300GB files, each iteration of the stream reader just grabs the metadata and a data summary, so the full iteration of the file takes 3s. So, I have an algorithm that is extremely efficient for the 15GB, not-lazy context, because it balances the IO dependency of the batch reader with the various CPU-intensive workers. For the 300GB, this is sub-optimal because now the workers have become IO limited and are fighting with each other for the lock. I have a different algorithm that works better in that scenario. So, any idea on how I can identify when the stream reader is deciding to load in-memory or out-of-memory batches? I'm also already binding some to the C++ interface, so if it's not in pyarrow but in the C++ interface, I can work with that too. Thanks, Kevin
