Re: [Python][Arrow IPC] Identify if RecordBatchReader is actually loading data or not

Kevin Crouse Fri, 07 Jun 2024 18:48:21 -0700

Thanks for the response, @Weston. That was my understanding of what IPC was
supposed to be doing, but something interesting/odd is indeed going on.
Could it be related to something else in how either the file was
constructed or the dimensions of the files (I did not build them)? Or that
I'm on an arm64 (Mac) machine?


If I just have a simple test function as follows (with a bunch of
benchmarking and memory usage logging that I've omitted):

```
def reader(filename):
    fh = pa.memory_map(filename)
    rdr = pa.ipc.RecordBatchStreamReader(fh)
    batches = []
    for batch in rdr:
        batches.append(batch)
    fh.close()
    tbl = pa.Table.from_batches(batches) # just to see what happens
```

For the 15 20GB file (400 columns, ~5.5M rows)
- About 35s to iterate batches, process occupies 4GB of RSS memory
- 70s to build the table, proces now occupies 6GB of memory
- Notably, the real function, which does a bunch of calculations on the
batches and merges them together after full iteration, is also in the
60-80s range (which includes the batch iteration, but not table
instantiation).
- Strangely, I can only get real memory reading from psutils
- pyarrow.total_allocated_bytes()  doesn't recognize more than a trivial
allocation.
- It doesn't matter if I attempt to delete intermediary objects or
call release_unused() on the memory pool - the memory is there and not
recognized by arrow, and it's released when the function ends and the
objects are out of scope.

For the 315GB file (45 columns, ~725M rows):
- 2.7s to iterate batches, 0 MB allocated by any measure
- <0s to build the table, 0 MB allocated
- The delay on the first calculation for every batch verifies that they
aren't loaded into memory until accessed.

So, I'd love to hear any thoughts you'd have on why these behave so
differently.

On Fri, Jun 7, 2024 at 6:36 PM Weston Pace <[email protected]> wrote:

> If you are using the IPC reader to read memory mapped files then the data
> will not be read into memory until you access the data.  I'm not really
> sure why 15GB files would act differently than 300GB files.
>
> > since IPC identifies when a data source is clearly too big for memory
> and doesn't try to load it to disk until the data is requested
> > So, any idea on how I can identify when the stream reader is deciding to
> load in-memory or out-of-memory batches?
>
> The IPC reader does not have any special logic to detect when a data
> source is clearly too big for memory.  If you read (and try to access) all
> contents of a file that is bigger than memory then you will run out of
> memory and crash (or start swapping).
>
> On Fri, Jun 7, 2024 at 1:05 PM Kevin Crouse <[email protected]> wrote:
>
>> Hey folks,
>>
>> I've been using arrow/pyarrow and am writing a multithreaded loader for
>> arbitrary IPC files (RecordBatchStreamReader based on a MemoryMappedFile)
>> that processes batches to calculate custom statistics without loading into
>> memory - and I'm running into issues with IPC outsmarting me.
>>
>> I want to know how I can identify when StreaReader is actually reading
>> the full batch from disk and when it's just lazily pulling the batch
>> metadata and actually reading the batch. I know in Arrow C++ you can
>> specify cache options and turn lazy off, but that isn't in the pyarrow
>> interface (and I'm not sure that I would want it anyway) - I'm more
>> interested in checking what the stream reader is doing rather than
>> manipulating how it's doing it.
>>
>> Example: since IPC identifies when a data source is clearly too big for
>> memory and doesn't try to load it to disk until the data is requested, I'm
>> running into situations where:
>> - For 15GB files, each iteration of the stream reader loads the batch
>> from disk, which is a slow and IO intensive process, and the full naive
>> iteration of just looping over the reader takes a minute or so - but since
>> the stream reader locks on each batch fetch, it's effectively the minimum
>> bound for calculating the statistics.
>> - For 300GB files, each iteration of the stream reader just grabs the
>> metadata and a data summary, so the full iteration of the file takes 3s.
>>
>> So, I have an algorithm that is extremely efficient for the 15GB,
>> not-lazy context, because it balances the IO dependency of the batch reader
>> with the various CPU-intensive workers.
>>
>> For the 300GB, this is sub-optimal because now the workers have become IO
>> limited and are fighting with each other for the lock. I have a different
>> algorithm that works better in that scenario.
>>
>> So, any idea on how I can identify when the stream reader is deciding to
>> load in-memory or out-of-memory batches?
>>
>> I'm also already binding some to the C++ interface, so if it's not in
>> pyarrow but in the C++ interface, I can work with that too.
>>
>> Thanks,
>>
>> Kevin
>>
>>

Re: [Python][Arrow IPC] Identify if RecordBatchReader is actually loading data or not

Reply via email to