If you are using the IPC reader to read memory mapped files then the data will not be read into memory until you access the data. I'm not really sure why 15GB files would act differently than 300GB files.
> since IPC identifies when a data source is clearly too big for memory and doesn't try to load it to disk until the data is requested > So, any idea on how I can identify when the stream reader is deciding to load in-memory or out-of-memory batches? The IPC reader does not have any special logic to detect when a data source is clearly too big for memory. If you read (and try to access) all contents of a file that is bigger than memory then you will run out of memory and crash (or start swapping). On Fri, Jun 7, 2024 at 1:05 PM Kevin Crouse <[email protected]> wrote: > Hey folks, > > I've been using arrow/pyarrow and am writing a multithreaded loader for > arbitrary IPC files (RecordBatchStreamReader based on a MemoryMappedFile) > that processes batches to calculate custom statistics without loading into > memory - and I'm running into issues with IPC outsmarting me. > > I want to know how I can identify when StreaReader is actually reading the > full batch from disk and when it's just lazily pulling the batch > metadata and actually reading the batch. I know in Arrow C++ you can > specify cache options and turn lazy off, but that isn't in the pyarrow > interface (and I'm not sure that I would want it anyway) - I'm more > interested in checking what the stream reader is doing rather than > manipulating how it's doing it. > > Example: since IPC identifies when a data source is clearly too big for > memory and doesn't try to load it to disk until the data is requested, I'm > running into situations where: > - For 15GB files, each iteration of the stream reader loads the batch from > disk, which is a slow and IO intensive process, and the full naive > iteration of just looping over the reader takes a minute or so - but since > the stream reader locks on each batch fetch, it's effectively the minimum > bound for calculating the statistics. > - For 300GB files, each iteration of the stream reader just grabs the > metadata and a data summary, so the full iteration of the file takes 3s. > > So, I have an algorithm that is extremely efficient for the 15GB, not-lazy > context, because it balances the IO dependency of the batch reader with the > various CPU-intensive workers. > > For the 300GB, this is sub-optimal because now the workers have become IO > limited and are fighting with each other for the lock. I have a different > algorithm that works better in that scenario. > > So, any idea on how I can identify when the stream reader is deciding to > load in-memory or out-of-memory batches? > > I'm also already binding some to the C++ interface, so if it's not in > pyarrow but in the C++ interface, I can work with that too. > > Thanks, > > Kevin > >
