Wes' theory seems sound. Perhaps the easiest way to test that theory would be to put a five second sleep after the to_table call and before you run show_mem. In theory 1s is long enough but 5s is nice to remove any doubt.
If there is a filter (that cannot be serviced by parquet row group statistics) there will be more total allocation. This is because we first need to read in the full row group and then we need to filter it which is a copy operation to a (hopefully) smaller sized row group. The filtering should happen after the column pruning but if the filter is referencing any columns that are not included in the final result then we will need to load in those additional columns, use them for the filter, and then drop them. This is another way you might end up with more total allocation if you use a filter. -Weston On Mon, Jan 3, 2022 at 3:10 AM Wes McKinney <[email protected]> wrote: > By default we use jemalloc as our memory allocator which empirically has > been seen to yield better application performance. jemalloc does not > release memory to the operating system right away, this can be altered by > using a different default allocator (for example, the system allocator may > return memory to the OS right away): > > > https://arrow.apache.org/docs/cpp/memory.html#overriding-the-default-memory-pool > > I expect that the reason that psutil-reported allocated memory is higher > in the last case is because some temporary allocations made during the > filtering process are raising the "high water mark". I believe can see what > is reported as the peak memory allocation by looking at > pyarrow.default_memory_pool().max_memory() > > On Mon, Dec 20, 2021 at 5:10 AM Yp Xie <[email protected]> wrote: > >> Hi guys, >> >> I'm getting this weird memory usage info when I tried to start using >> pyarrow to read a parquet file. >> >> I wrote a simple script to show how much memory is consumed after each >> step. >> the result is illustrated in the table: >> >> row number pa.total_allocated_bytes memory usage by psutil >> without filters 5131100 177M 323M >> with field filter 57340 2041K 323M >> with column pruning 5131100 48M 154M >> with both field filter and column pruning 57340 567K 204M >> >> the weird part is: the total memory usage when I apply both field filter >> and column pruning is *larger* than only column pruning applied. >> >> I don't know how that happened, do you guys know the reason for this? >> >> thanks. >> >> env info: >> >> platform: Linux-5.4.0-91-generic-x86_64-with-glibc2.10 >> distro info: ('Ubuntu', '20.04', 'focal') >> pyarrow: 6.0.1 >> >> >> script code: >> >> import pyarrow as pa >> import psutil >> import os >> import pyarrow.dataset as ds >> >> pid = os.getpid() >> >> def show_mem(action: str) -> None: >> mem = psutil.Process(pid).memory_info()[0] >> 20 >> print(f"******* memory usage after {action} **********") >> print(f"* {mem}M *") >> print(f"**********************************************") >> >> dataset = ds.dataset("tmp/uber.parquet", format="parquet") >> show_mem("read dataset") >> projection = { >> "Dispatching_base_num": ds.field("Dispatching_base_num") >> } >> filter = ds.field("locationID") == 100 >> table = dataset.to_table( >> filter=filter, >> columns=projection >> ) >> print(f"table row number: {table.num_rows}") >> print(f"total bytes: {pa.total_allocated_bytes() >> 10}K") >> show_mem("dataset.to_table") >> >
