Thanks Wes, Weston for your explanation. just tried to wait 5s after the to_table action, and indeed the memory usage reported by psutil decreased to a reasonable size.
thanks again. - xyp Weston Pace <[email protected]> 于2022年1月3日周一 23:02写道: > Wes' theory seems sound. Perhaps the easiest way to test that theory > would be to put a five second sleep after the to_table call and before you > run show_mem. In theory 1s is long enough but 5s is nice to remove any > doubt. > > If there is a filter (that cannot be serviced by parquet row group > statistics) there will be more total allocation. This is because we first > need to read in the full row group and then we need to filter it which is a > copy operation to a (hopefully) smaller sized row group. > > The filtering should happen after the column pruning but if the filter is > referencing any columns that are not included in the final result then we > will need to load in those additional columns, use them for the filter, and > then drop them. This is another way you might end up with more total > allocation if you use a filter. > > -Weston > > On Mon, Jan 3, 2022 at 3:10 AM Wes McKinney <[email protected]> wrote: > >> By default we use jemalloc as our memory allocator which empirically has >> been seen to yield better application performance. jemalloc does not >> release memory to the operating system right away, this can be altered by >> using a different default allocator (for example, the system allocator may >> return memory to the OS right away): >> >> >> https://arrow.apache.org/docs/cpp/memory.html#overriding-the-default-memory-pool >> >> I expect that the reason that psutil-reported allocated memory is higher >> in the last case is because some temporary allocations made during the >> filtering process are raising the "high water mark". I believe can see what >> is reported as the peak memory allocation by looking at >> pyarrow.default_memory_pool().max_memory() >> >> On Mon, Dec 20, 2021 at 5:10 AM Yp Xie <[email protected]> wrote: >> >>> Hi guys, >>> >>> I'm getting this weird memory usage info when I tried to start using >>> pyarrow to read a parquet file. >>> >>> I wrote a simple script to show how much memory is consumed after each >>> step. >>> the result is illustrated in the table: >>> >>> row number pa.total_allocated_bytes memory usage by psutil >>> without filters 5131100 177M 323M >>> with field filter 57340 2041K 323M >>> with column pruning 5131100 48M 154M >>> with both field filter and column pruning 57340 567K 204M >>> >>> the weird part is: the total memory usage when I apply both field filter >>> and column pruning is *larger* than only column pruning applied. >>> >>> I don't know how that happened, do you guys know the reason for this? >>> >>> thanks. >>> >>> env info: >>> >>> platform: Linux-5.4.0-91-generic-x86_64-with-glibc2.10 >>> distro info: ('Ubuntu', '20.04', 'focal') >>> pyarrow: 6.0.1 >>> >>> >>> script code: >>> >>> import pyarrow as pa >>> import psutil >>> import os >>> import pyarrow.dataset as ds >>> >>> pid = os.getpid() >>> >>> def show_mem(action: str) -> None: >>> mem = psutil.Process(pid).memory_info()[0] >> 20 >>> print(f"******* memory usage after {action} **********") >>> print(f"* {mem}M *") >>> print(f"**********************************************") >>> >>> dataset = ds.dataset("tmp/uber.parquet", format="parquet") >>> show_mem("read dataset") >>> projection = { >>> "Dispatching_base_num": ds.field("Dispatching_base_num") >>> } >>> filter = ds.field("locationID") == 100 >>> table = dataset.to_table( >>> filter=filter, >>> columns=projection >>> ) >>> print(f"table row number: {table.num_rows}") >>> print(f"total bytes: {pa.total_allocated_bytes() >> 10}K") >>> show_mem("dataset.to_table") >>> >>
