[jira] [Commented] (ARROW-6874) [Python] Memory leak in Table.to_pandas() when nested columns are present

2019-10-16 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953065#comment-16953065
 ] 

Antoine Pitrou commented on ARROW-6874:
---

[~jorisvandenbossche] You were right. Attached PR is a bit more careful when 
using Arrow to allocate Numpy data.

> [Python] Memory leak in Table.to_pandas() when nested columns are present
> -
>
> Key: ARROW-6874
> URL: https://issues.apache.org/jira/browse/ARROW-6874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: Operating system: Windows 10
> pyarrow installed via conda
> both python environments were identical except pyarrow: 
> python: 3.6.7
> numpy: 1.17.2
> pandas: 0.25.1
>Reporter: Sergey Mozharov
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I upgraded from pyarrow 0.14.1 to 0.15.0 and during some testing my python 
> interpreter ran out of memory.
> I narrowed the issue down to the pyarrow.Table.to_pandas() call, which 
> appears to have a memory leak in the latest version. See details below to 
> reproduce this issue.
>  
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> # create a table with one nested array column
> nested_array = pa.array([np.random.rand(1000) for i in range(500)])
> nested_array.type  # ListType(list)
> table = pa.Table.from_arrays(arrays=[nested_array], names=['my_arrays'])
> # convert it to a pandas DataFrame in a loop to monitor memory consumption
> num_iterations = 1
> # pyarrow v0.14.1: Memory allocation does not grow during loop execution
> # pyarrow v0.15.0: ~550 Mb is added to RAM, never garbage collected
> for i in range(num_iterations):
> df = pa.Table.to_pandas(table)
> # When the table column is not nested, no memory leak is observed
> array = pa.array(np.random.rand(500 * 1000))
> table = pa.Table.from_arrays(arrays=[array], names=['numbers'])
> # no memory leak:
> for i in range(num_iterations):
> df = pa.Table.to_pandas(table){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6874) [Python] Memory leak in Table.to_pandas() when nested columns are present

2019-10-14 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951314#comment-16951314
 ] 

Joris Van den Bossche commented on ARROW-6874:
--

This seems to be caused by ARROW-6570 
(https://github.com/apache/arrow/commit/19545f878d17f99a07e51e818eddc8c77f38f56b).
 The problem comes up for object-dtype arrays (so for list, struct, string 
dtype)

> [Python] Memory leak in Table.to_pandas() when nested columns are present
> -
>
> Key: ARROW-6874
> URL: https://issues.apache.org/jira/browse/ARROW-6874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: Operating system: Windows 10
> pyarrow installed via conda
> both python environments were identical except pyarrow: 
> python: 3.6.7
> numpy: 1.17.2
> pandas: 0.25.1
>Reporter: Sergey Mozharov
>Priority: Major
> Fix For: 1.0.0, 0.15.1
>
>
> I upgraded from pyarrow 0.14.1 to 0.15.0 and during some testing my python 
> interpreter ran out of memory.
> I narrowed the issue down to the pyarrow.Table.to_pandas() call, which 
> appears to have a memory leak in the latest version. See details below to 
> reproduce this issue.
>  
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> # create a table with one nested array column
> nested_array = pa.array([np.random.rand(1000) for i in range(500)])
> nested_array.type  # ListType(list)
> table = pa.Table.from_arrays(arrays=[nested_array], names=['my_arrays'])
> # convert it to a pandas DataFrame in a loop to monitor memory consumption
> num_iterations = 1
> # pyarrow v0.14.1: Memory allocation does not grow during loop execution
> # pyarrow v0.15.0: ~550 Mb is added to RAM, never garbage collected
> for i in range(num_iterations):
> df = pa.Table.to_pandas(table)
> # When the table column is not nested, no memory leak is observed
> array = pa.array(np.random.rand(500 * 1000))
> table = pa.Table.from_arrays(arrays=[array], names=['numbers'])
> # no memory leak:
> for i in range(num_iterations):
> df = pa.Table.to_pandas(table){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)