Sergey Mozharov created ARROW-6874:
--------------------------------------

             Summary: Memory leak in Table.to_pandas() when nested columns are 
present
                 Key: ARROW-6874
                 URL: https://issues.apache.org/jira/browse/ARROW-6874
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.15.0
         Environment: Operating system: Windows 10
pyarrow installed via conda
both python environments were identical except pyarrow: 
python: 3.6.7
numpy: 1.17.2
pandas: 0.25.1
            Reporter: Sergey Mozharov


I upgraded from pyarrow 0.14.1 to 0.15.0 and during some testing my python 
interpreter ran out of memory.

I narrowed the issue down to the pyarrow.Table.to_pandas() call, which appears 
to have a memory leak in the latest version. See details below to reproduce 
this issue.

 
{code:java}
import numpy as np
import pandas as pd
import pyarrow as pa

# create a table with one nested array column
nested_array = pa.array([np.random.rand(1000) for i in range(500)])
nested_array.type  # ListType(list<item: double>)
table = pa.Table.from_arrays(arrays=[nested_array], names=['my_arrays'])

# convert it to a pandas DataFrame in a loop to monitor memory consumption
num_iterations = 10000
# pyarrow v0.14.1: Memory allocation does not grow during loop execution
# pyarrow v0.15.0: ~550 Mb is added to RAM, never garbage collected
for i in range(num_iterations):
    df = pa.Table.to_pandas(table)


# When the table column is not nested, no memory leak is observed
array = pa.array(np.random.rand(500 * 1000))
table = pa.Table.from_arrays(arrays=[array], names=['numbers'])
# no memory leak:
for i in range(num_iterations):
    df = pa.Table.to_pandas(table){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to