By default we use jemalloc as our memory allocator which empirically has
been seen to yield better application performance. jemalloc does not
release memory to the operating system right away, this can be altered by
using a different default allocator (for example, the system allocator may
return memory to the OS right away):

https://arrow.apache.org/docs/cpp/memory.html#overriding-the-default-memory-pool

I expect that the reason that psutil-reported allocated memory is higher in
the last case is because some temporary allocations made during the
filtering process are raising the "high water mark". I believe can see what
is reported as the peak memory allocation by looking at
pyarrow.default_memory_pool().max_memory()

On Mon, Dec 20, 2021 at 5:10 AM Yp Xie <[email protected]> wrote:

> Hi guys,
>
> I'm getting this weird memory usage info when I tried to start using
> pyarrow to read a parquet file.
>
> I wrote a simple script to show how much memory is consumed after each
> step.
> the result is illustrated in the table:
>
> row number pa.total_allocated_bytes memory usage by psutil
> without filters 5131100 177M 323M
> with field filter 57340 2041K 323M
> with column pruning 5131100 48M 154M
> with both field filter and column pruning 57340 567K 204M
>
> the weird part is: the total memory usage when I apply both field filter
> and column pruning is *larger* than only column pruning applied.
>
> I don't know how that happened, do you guys know the reason for this?
>
> thanks.
>
> env info:
>
> platform: Linux-5.4.0-91-generic-x86_64-with-glibc2.10
> distro info: ('Ubuntu', '20.04', 'focal')
> pyarrow: 6.0.1
>
>
> script code:
>
> import pyarrow as pa
> import psutil
> import os
> import pyarrow.dataset as ds
>
> pid = os.getpid()
>
> def show_mem(action: str) -> None:
>     mem = psutil.Process(pid).memory_info()[0] >> 20
>     print(f"******* memory usage after {action} **********")
>     print(f"*                   {mem}M                    *")
>     print(f"**********************************************")
>
> dataset = ds.dataset("tmp/uber.parquet", format="parquet")
> show_mem("read dataset")
> projection = {
>     "Dispatching_base_num": ds.field("Dispatching_base_num")
> }
> filter = ds.field("locationID") == 100
> table = dataset.to_table(
>     filter=filter,
>     columns=projection
>     )
> print(f"table row number: {table.num_rows}")
> print(f"total bytes: {pa.total_allocated_bytes() >> 10}K")
> show_mem("dataset.to_table")
>

Reply via email to