[ https://issues.apache.org/jira/browse/ARROW-5993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney closed ARROW-5993. ------------------------------- Resolution: Duplicate I confirmed that this is ARROW-6060. On master peak memory use reading _the entire file_ is 85MB. In 0.14.1 it's a little under 6GB. This code can be used to reproduce {code} import pandas as pd from pandas.util.testing import rands import pyarrow as pa import pyarrow.parquet as pq import gc class memory_use: def __init__(self): self.start_use = pa.total_allocated_bytes() self.pool = pa.default_memory_pool() self.start_peak_use = self.pool.max_memory() def __enter__(self): return def __exit__(self, type, value, traceback): gc.collect() print("Change in memory use: {}" .format(pa.total_allocated_bytes() - self.start_use)) print("Change in peak use: {}" .format(self.pool.max_memory() - self.start_peak_use)) with memory_use(): table = pq.read_table('/home/wesm/Downloads/demofile.parquet') {code} > [Python] Reading a dictionary column from Parquet results in disproportionate > memory usage > ------------------------------------------------------------------------------------------ > > Key: ARROW-5993 > URL: https://issues.apache.org/jira/browse/ARROW-5993 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.14.0 > Reporter: Daniel Haviv > Assignee: Wes McKinney > Priority: Major > Labels: memory, parquet > Fix For: 0.15.0 > > > I'm using pyarrow to read a 40MB parquet file. > When reading all of the columns besides the "body" columns, the process peaks > at 170MB. > Reading only the "body" column results in over 6GB of memory used. > I made the file publicly accessible: > s3://dhavivresearch/pyarrow/demofile.parquet > > -- This message was sent by Atlassian Jira (v8.3.2#803003)