[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits

2019-11-06 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968683#comment-16968683
 ] 

Wes McKinney commented on ARROW-6910:
-

The place to start will be twiddling with the jemalloc conf settings here:

https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.cc#L48

> [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not 
> released until program exits
> --
>
> Key: ARROW-6910
> URL: https://issues.apache.org/jira/browse/ARROW-6910
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
>Reporter: V Luong
>Assignee: Wes McKinney
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
> Attachments: arrow6910.png
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> I realize that when I read up a lot of Parquet files using 
> pyarrow.parquet.read_table(...), my program's memory usage becomes very 
> bloated, although I don't keep the table objects after converting them to 
> Pandas DFs.
> You can try this in an interactive Python shell to reproduce this problem:
> ```{python}
> from tqdm import tqdm
> from pyarrow.parquet import read_table
> PATH = '/tmp/big.snappy.parquet'
> for _ in tqdm(range(10)):
> read_table(PATH, use_threads=False, memory_map=False)
> (note that I'm not assigning the read_table(...) result to anything, so 
> I'm not creating any new objects at all)
> ```
> During the For loop above, if you view the memory usage (e.g. using htop 
> program), you'll see that it keeps creeping up. Either the program crashes 
> during the 10 iterations, or if the 10 iterations complete, the program will 
> still occupy a huge amount of memory, although no objects are kept. That 
> memory is only released when you exit() from Python.
> This problem means that my compute jobs using PyArrow currently need to use 
> bigger server instances than I think is necessary, which translates to 
> significant extra cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits

2019-11-06 Thread V Luong (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968675#comment-16968675
 ] 

V Luong commented on ARROW-6910:


ok [~wesm] let me create a new JIRA ticket for 0.15.1

> [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not 
> released until program exits
> --
>
> Key: ARROW-6910
> URL: https://issues.apache.org/jira/browse/ARROW-6910
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
>Reporter: V Luong
>Assignee: Wes McKinney
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
> Attachments: arrow6910.png
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> I realize that when I read up a lot of Parquet files using 
> pyarrow.parquet.read_table(...), my program's memory usage becomes very 
> bloated, although I don't keep the table objects after converting them to 
> Pandas DFs.
> You can try this in an interactive Python shell to reproduce this problem:
> ```{python}
> from tqdm import tqdm
> from pyarrow.parquet import read_table
> PATH = '/tmp/big.snappy.parquet'
> for _ in tqdm(range(10)):
> read_table(PATH, use_threads=False, memory_map=False)
> (note that I'm not assigning the read_table(...) result to anything, so 
> I'm not creating any new objects at all)
> ```
> During the For loop above, if you view the memory usage (e.g. using htop 
> program), you'll see that it keeps creeping up. Either the program crashes 
> during the 10 iterations, or if the 10 iterations complete, the program will 
> still occupy a huge amount of memory, although no objects are kept. That 
> memory is only released when you exit() from Python.
> This problem means that my compute jobs using PyArrow currently need to use 
> bigger server instances than I think is necessary, which translates to 
> significant extra cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits

2019-11-06 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968670#comment-16968670
 ] 

Wes McKinney commented on ARROW-6910:
-

If you can open a new JIRA for further investigation that would be helpful, the 
original issue you reported is no longer present, as chronicled in the linked 
pull request

> [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not 
> released until program exits
> --
>
> Key: ARROW-6910
> URL: https://issues.apache.org/jira/browse/ARROW-6910
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
>Reporter: V Luong
>Assignee: Wes McKinney
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
> Attachments: arrow6910.png
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> I realize that when I read up a lot of Parquet files using 
> pyarrow.parquet.read_table(...), my program's memory usage becomes very 
> bloated, although I don't keep the table objects after converting them to 
> Pandas DFs.
> You can try this in an interactive Python shell to reproduce this problem:
> ```{python}
> from tqdm import tqdm
> from pyarrow.parquet import read_table
> PATH = '/tmp/big.snappy.parquet'
> for _ in tqdm(range(10)):
> read_table(PATH, use_threads=False, memory_map=False)
> (note that I'm not assigning the read_table(...) result to anything, so 
> I'm not creating any new objects at all)
> ```
> During the For loop above, if you view the memory usage (e.g. using htop 
> program), you'll see that it keeps creeping up. Either the program crashes 
> during the 10 iterations, or if the 10 iterations complete, the program will 
> still occupy a huge amount of memory, although no objects are kept. That 
> memory is only released when you exit() from Python.
> This problem means that my compute jobs using PyArrow currently need to use 
> bigger server instances than I think is necessary, which translates to 
> significant extra cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits

2019-11-06 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968667#comment-16968667
 ] 

Wes McKinney commented on ARROW-6910:
-

What platform are you on? It's possible the background thread reclamation is 
not enabled in your build

Adding {{import gc; gc.collect()}} to your scripts may not be a bad idea

> [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not 
> released until program exits
> --
>
> Key: ARROW-6910
> URL: https://issues.apache.org/jira/browse/ARROW-6910
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
>Reporter: V Luong
>Assignee: Wes McKinney
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
> Attachments: arrow6910.png
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> I realize that when I read up a lot of Parquet files using 
> pyarrow.parquet.read_table(...), my program's memory usage becomes very 
> bloated, although I don't keep the table objects after converting them to 
> Pandas DFs.
> You can try this in an interactive Python shell to reproduce this problem:
> ```{python}
> from tqdm import tqdm
> from pyarrow.parquet import read_table
> PATH = '/tmp/big.snappy.parquet'
> for _ in tqdm(range(10)):
> read_table(PATH, use_threads=False, memory_map=False)
> (note that I'm not assigning the read_table(...) result to anything, so 
> I'm not creating any new objects at all)
> ```
> During the For loop above, if you view the memory usage (e.g. using htop 
> program), you'll see that it keeps creeping up. Either the program crashes 
> during the 10 iterations, or if the 10 iterations complete, the program will 
> still occupy a huge amount of memory, although no objects are kept. That 
> memory is only released when you exit() from Python.
> This problem means that my compute jobs using PyArrow currently need to use 
> bigger server instances than I think is necessary, which translates to 
> significant extra cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits

2019-11-06 Thread V Luong (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968654#comment-16968654
 ] 

V Luong commented on ARROW-6910:


[~apitrou] [~wesm] I'm re-testing this issue using the newly-released 0.15.1, 
with the following code, in an interactive Python 3.7 shell:

-
from pyarrow.parquet import read_table
import os
from tqdm import tqdm


PARQUET_S3_PATH = 's3://public-parquet-test-data/big.snappy.parquet'
PARQUET_HTTP_PATH = 
'http://public-parquet-test-data.s3.amazonaws.com/big.snappy.parquet'
PARQUET_TMP_PATH = '/tmp/big.snappy.parquet'


os.system('wget --output-document={} {}'.format(PARQUET_TMP_PATH, 
PARQUET_HTTP_PATH))


for _ in tqdm(range(10)):
read_table(
source=PARQUET_TMP_PATH,
columns=None,
use_threads=False,
metadata=None,
use_pandas_metadata=False,
memory_map=False,
filesystem=None,
filters=None)
---

I observe the following mysterious behavior:
- If I don't do anything after the above For loop, the program still occupies 
8-10GB of memory and does not release it. I keep it at this idle state for a 
good 10-15 minutes and confirm that memory is still occupied.
- Then, I try to do something random, like "import pyarrow; 
print(pyarrow.__version__)" in the interactive shell, and then the memory is 
immediately released.

This behavior remains unintuitive to me, and it seems users still don't have a 
firm control on the memory used by PyArrow. Each read_table(...) call does not 
seem memory-neutral by default as of 0.15.1 yet. This means long-running 
iterative programs, especially ML training involving repeated loading up these 
files, will inevitably OOM.

> [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not 
> released until program exits
> --
>
> Key: ARROW-6910
> URL: https://issues.apache.org/jira/browse/ARROW-6910
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
>Reporter: V Luong
>Assignee: Wes McKinney
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
> Attachments: arrow6910.png
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> I realize that when I read up a lot of Parquet files using 
> pyarrow.parquet.read_table(...), my program's memory usage becomes very 
> bloated, although I don't keep the table objects after converting them to 
> Pandas DFs.
> You can try this in an interactive Python shell to reproduce this problem:
> ```{python}
> from tqdm import tqdm
> from pyarrow.parquet import read_table
> PATH = '/tmp/big.snappy.parquet'
> for _ in tqdm(range(10)):
> read_table(PATH, use_threads=False, memory_map=False)
> (note that I'm not assigning the read_table(...) result to anything, so 
> I'm not creating any new objects at all)
> ```
> During the For loop above, if you view the memory usage (e.g. using htop 
> program), you'll see that it keeps creeping up. Either the program crashes 
> during the 10 iterations, or if the 10 iterations complete, the program will 
> still occupy a huge amount of memory, although no objects are kept. That 
> memory is only released when you exit() from Python.
> This problem means that my compute jobs using PyArrow currently need to use 
> bigger server instances than I think is necessary, which translates to 
> significant extra cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits

2019-10-19 Thread V Luong (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955242#comment-16955242
 ] 

V Luong commented on ARROW-6910:


Great, thank you a great deal [~wesm]!

> [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not 
> released until program exits
> --
>
> Key: ARROW-6910
> URL: https://issues.apache.org/jira/browse/ARROW-6910
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
>Reporter: V Luong
>Assignee: Wes McKinney
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
> Attachments: arrow6910.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I realize that when I read up a lot of Parquet files using 
> pyarrow.parquet.read_table(...), my program's memory usage becomes very 
> bloated, although I don't keep the table objects after converting them to 
> Pandas DFs.
> You can try this in an interactive Python shell to reproduce this problem:
> ```{python}
> from tqdm import tqdm
> from pyarrow.parquet import read_table
> PATH = '/tmp/big.snappy.parquet'
> for _ in tqdm(range(100)):
> read_table(PATH, use_threads=False, memory_map=False)
> (note that I'm not assigning the read_table(...) result to anything, so 
> I'm not creating any new objects at all)
> ```
> During the For loop above, if you view the memory usage (e.g. using htop 
> program), you'll see that it keeps creeping up. Either the program crashes 
> during the 100 iterations, or if the 100 iterations complete, the program 
> will still occupy a huge amount of memory, although no objects are kept. That 
> memory is only released when you exit() from Python.
> This problem means that my compute jobs using PyArrow currently need to use 
> bigger server instances than I think is necessary, which translates to 
> significant extra cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits

2019-10-19 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955206#comment-16955206
 ] 

Wes McKinney commented on ARROW-6910:
-

I can confirm that setting the "dirty_page_ms" jemalloc option to 0 causes 
memory to be released to the OS right away. This is likely to reduce 
application performance, but it may make sense to make the default option but 
allow it to be configured at runtime. I'm working on a page

see http://jemalloc.net/jemalloc.3.html

> [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not 
> released until program exits
> --
>
> Key: ARROW-6910
> URL: https://issues.apache.org/jira/browse/ARROW-6910
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
>Reporter: V Luong
>Priority: Critical
> Fix For: 1.0.0, 0.15.1
>
> Attachments: arrow6910.png
>
>
> I realize that when I read up a lot of Parquet files using 
> pyarrow.parquet.read_table(...), my program's memory usage becomes very 
> bloated, although I don't keep the table objects after converting them to 
> Pandas DFs.
> You can try this in an interactive Python shell to reproduce this problem:
> ```{python}
> from tqdm import tqdm
> from pyarrow.parquet import read_table
> PATH = '/tmp/big.snappy.parquet'
> for _ in tqdm(range(100)):
> read_table(PATH, use_threads=False, memory_map=False)
> (note that I'm not assigning the read_table(...) result to anything, so 
> I'm not creating any new objects at all)
> ```
> During the For loop above, if you view the memory usage (e.g. using htop 
> program), you'll see that it keeps creeping up. Either the program crashes 
> during the 100 iterations, or if the 100 iterations complete, the program 
> will still occupy a huge amount of memory, although no objects are kept. That 
> memory is only released when you exit() from Python.
> This problem means that my compute jobs using PyArrow currently need to use 
> bigger server instances than I think is necessary, which translates to 
> significant extra cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits

2019-10-18 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955038#comment-16955038
 ] 

Wes McKinney commented on ARROW-6910:
-

I can access it. I'll try to have a closer look in the next couple of days to 
see if I can determine what is going on. 

> [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not 
> released until program exits
> --
>
> Key: ARROW-6910
> URL: https://issues.apache.org/jira/browse/ARROW-6910
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
>Reporter: V Luong
>Priority: Critical
> Fix For: 1.0.0, 0.15.1
>
> Attachments: arrow6910.png
>
>
> I realize that when I read up a lot of Parquet files using 
> pyarrow.parquet.read_table(...), my program's memory usage becomes very 
> bloated, although I don't keep the table objects after converting them to 
> Pandas DFs.
> You can try this in an interactive Python shell to reproduce this problem:
> ```{python}
> from tqdm import tqdm
> from pyarrow.parquet import read_table
> PATH = '/tmp/big.snappy.parquet'
> for _ in tqdm(range(100)):
> read_table(PATH, use_threads=False, memory_map=False)
> (note that I'm not assigning the read_table(...) result to anything, so 
> I'm not creating any new objects at all)
> ```
> During the For loop above, if you view the memory usage (e.g. using htop 
> program), you'll see that it keeps creeping up. Either the program crashes 
> during the 100 iterations, or if the 100 iterations complete, the program 
> will still occupy a huge amount of memory, although no objects are kept. That 
> memory is only released when you exit() from Python.
> This problem means that my compute jobs using PyArrow currently need to use 
> bigger server instances than I think is necessary, which translates to 
> significant extra cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits

2019-10-17 Thread V Luong (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954147#comment-16954147
 ] 

V Luong commented on ARROW-6910:


Using the code above, after just 10 iterations of reading up the file with 1 
thread, the program has grown to occupy 15-18GB of memory and does not release 
it.

> [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not 
> released until program exits
> --
>
> Key: ARROW-6910
> URL: https://issues.apache.org/jira/browse/ARROW-6910
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
>Reporter: V Luong
>Priority: Critical
> Fix For: 1.0.0, 0.15.1
>
> Attachments: arrow6910.png
>
>
> I realize that when I read up a lot of Parquet files using 
> pyarrow.parquet.read_table(...), my program's memory usage becomes very 
> bloated, although I don't keep the table objects after converting them to 
> Pandas DFs.
> You can try this in an interactive Python shell to reproduce this problem:
> ```{python}
> from tqdm import tqdm
> from pyarrow.parquet import read_table
> PATH = '/tmp/big.snappy.parquet'
> for _ in tqdm(range(100)):
> read_table(PATH, use_threads=False, memory_map=False)
> (note that I'm not assigning the read_table(...) result to anything, so 
> I'm not creating any new objects at all)
> ```
> During the For loop above, if you view the memory usage (e.g. using htop 
> program), you'll see that it keeps creeping up. Either the program crashes 
> during the 100 iterations, or if the 100 iterations complete, the program 
> will still occupy a huge amount of memory, although no objects are kept. That 
> memory is only released when you exit() from Python.
> This problem means that my compute jobs using PyArrow currently need to use 
> bigger server instances than I think is necessary, which translates to 
> significant extra cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits

2019-10-17 Thread V Luong (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954134#comment-16954134
 ] 

V Luong commented on ARROW-6910:


[~wesm] [~jorisvandenbossche] [~apitrou] can you try "wget 
http://public-parquet-test-data.s3.amazonaws.com/big.snappy.parquet; now? I'll 
also edit the code in the description to reproduce the problem.

> [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not 
> released until program exits
> --
>
> Key: ARROW-6910
> URL: https://issues.apache.org/jira/browse/ARROW-6910
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
>Reporter: V Luong
>Priority: Critical
> Fix For: 1.0.0, 0.15.1
>
> Attachments: arrow6910.png
>
>
> I realize that when I read up a lot of Parquet files using 
> pyarrow.parquet.read_table(...), my program's memory usage becomes very 
> bloated, although I don't keep the table objects after converting them to 
> Pandas DFs.
> You can try this in an interactive Python shell to reproduce this problem:
> ```{python}
> from pyarrow.parquet import read_table
> for path in paths_of_a_bunch_of_big_parquet_files:
> read_table(path, use_threads=True, memory_map=False)
> (note that I'm not assigning the read_table(...) result to anything, so 
> I'm not creating any new objects at all)
> ```
> After that For loop above, if you view the memory using (e.g. using htop 
> program), you'll see that the Python program has taken up a lot of memory. 
> That memory is only released when you exit() from Python.
> This problem means that my compute jobs using PyArrow currently need to use 
> bigger server instances than I think is necessary, which translates to 
> significant extra cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits

2019-10-17 Thread V Luong (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954029#comment-16954029
 ] 

V Luong commented on ARROW-6910:


ok let me check again on another machine [~wesm] and let you know

> [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not 
> released until program exits
> --
>
> Key: ARROW-6910
> URL: https://issues.apache.org/jira/browse/ARROW-6910
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
>Reporter: V Luong
>Priority: Critical
> Fix For: 1.0.0, 0.15.1
>
> Attachments: arrow6910.png
>
>
> I realize that when I read up a lot of Parquet files using 
> pyarrow.parquet.read_table(...), my program's memory usage becomes very 
> bloated, although I don't keep the table objects after converting them to 
> Pandas DFs.
> You can try this in an interactive Python shell to reproduce this problem:
> ```{python}
> from pyarrow.parquet import read_table
> for path in paths_of_a_bunch_of_big_parquet_files:
> read_table(path, use_threads=True, memory_map=False)
> (note that I'm not assigning the read_table(...) result to anything, so 
> I'm not creating any new objects at all)
> ```
> After that For loop above, if you view the memory using (e.g. using htop 
> program), you'll see that the Python program has taken up a lot of memory. 
> That memory is only released when you exit() from Python.
> This problem means that my compute jobs using PyArrow currently need to use 
> bigger server instances than I think is necessary, which translates to 
> significant extra cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits

2019-10-17 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954026#comment-16954026
 ] 

Wes McKinney commented on ARROW-6910:
-

{code}
$ aws s3 cp s3://public-parquet-test-data/big.parquet . --recursive
fatal error: Unable to locate credentials
{code}

Not a regular S3 user, sorry

> [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not 
> released until program exits
> --
>
> Key: ARROW-6910
> URL: https://issues.apache.org/jira/browse/ARROW-6910
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
>Reporter: V Luong
>Priority: Critical
> Fix For: 1.0.0, 0.15.1
>
> Attachments: arrow6910.png
>
>
> I realize that when I read up a lot of Parquet files using 
> pyarrow.parquet.read_table(...), my program's memory usage becomes very 
> bloated, although I don't keep the table objects after converting them to 
> Pandas DFs.
> You can try this in an interactive Python shell to reproduce this problem:
> ```{python}
> from pyarrow.parquet import read_table
> for path in paths_of_a_bunch_of_big_parquet_files:
> read_table(path, use_threads=True, memory_map=False)
> (note that I'm not assigning the read_table(...) result to anything, so 
> I'm not creating any new objects at all)
> ```
> After that For loop above, if you view the memory using (e.g. using htop 
> program), you'll see that the Python program has taken up a lot of memory. 
> That memory is only released when you exit() from Python.
> This problem means that my compute jobs using PyArrow currently need to use 
> bigger server instances than I think is necessary, which translates to 
> significant extra cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits

2019-10-17 Thread V Luong (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954025#comment-16954025
 ] 

V Luong commented on ARROW-6910:


[~wesm] could you try "aws s3 sync s3://public-parquet-test-data/big.parquet 
..." or "aws s3 cp s3://public-parquet-test-data/big.parquet ... --recursive" 
on your terminal? I have not used wget with S3 data before, so am not sure how 
to make it work.

> [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not 
> released until program exits
> --
>
> Key: ARROW-6910
> URL: https://issues.apache.org/jira/browse/ARROW-6910
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
>Reporter: V Luong
>Priority: Critical
> Fix For: 1.0.0, 0.15.1
>
> Attachments: arrow6910.png
>
>
> I realize that when I read up a lot of Parquet files using 
> pyarrow.parquet.read_table(...), my program's memory usage becomes very 
> bloated, although I don't keep the table objects after converting them to 
> Pandas DFs.
> You can try this in an interactive Python shell to reproduce this problem:
> ```{python}
> from pyarrow.parquet import read_table
> for path in paths_of_a_bunch_of_big_parquet_files:
> read_table(path, use_threads=True, memory_map=False)
> (note that I'm not assigning the read_table(...) result to anything, so 
> I'm not creating any new objects at all)
> ```
> After that For loop above, if you view the memory using (e.g. using htop 
> program), you'll see that the Python program has taken up a lot of memory. 
> That memory is only released when you exit() from Python.
> This problem means that my compute jobs using PyArrow currently need to use 
> bigger server instances than I think is necessary, which translates to 
> significant extra cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits

2019-10-17 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954022#comment-16954022
 ] 

Wes McKinney commented on ARROW-6910:
-

Can you give me an HTTPS link to download that file? I tried {{wget 
https://public-parquet-test-data.s3.amazonaws.com/big.parquet}} and it didn't 
work

> [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not 
> released until program exits
> --
>
> Key: ARROW-6910
> URL: https://issues.apache.org/jira/browse/ARROW-6910
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
>Reporter: V Luong
>Priority: Critical
> Fix For: 1.0.0, 0.15.1
>
> Attachments: arrow6910.png
>
>
> I realize that when I read up a lot of Parquet files using 
> pyarrow.parquet.read_table(...), my program's memory usage becomes very 
> bloated, although I don't keep the table objects after converting them to 
> Pandas DFs.
> You can try this in an interactive Python shell to reproduce this problem:
> ```{python}
> from pyarrow.parquet import read_table
> for path in paths_of_a_bunch_of_big_parquet_files:
> read_table(path, use_threads=True, memory_map=False)
> (note that I'm not assigning the read_table(...) result to anything, so 
> I'm not creating any new objects at all)
> ```
> After that For loop above, if you view the memory using (e.g. using htop 
> program), you'll see that the Python program has taken up a lot of memory. 
> That memory is only released when you exit() from Python.
> This problem means that my compute jobs using PyArrow currently need to use 
> bigger server instances than I think is necessary, which translates to 
> significant extra cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits

2019-10-17 Thread V Luong (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954021#comment-16954021
 ] 

V Luong commented on ARROW-6910:


[~wesm][~jorisvandenbossche] I've made a Parquet data set available at 
s3://public-parquet-test-data/big.parquet for testing. It's only moderately 
big. I repeatedly load various files thousands of time during iterative model 
training jobs that last for days. In 0.14.1 my long-running jobs succeeded, but 
in 0.15.0 the same jobs crashed after 30 mins or 1 hour. My inspection as 
shared above indicates that memory usage increases with the number of times 
read_table(...) is called and is not released, so long-running jobs would 
inevitably die.

> [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not 
> released until program exits
> --
>
> Key: ARROW-6910
> URL: https://issues.apache.org/jira/browse/ARROW-6910
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
>Reporter: V Luong
>Priority: Critical
> Fix For: 1.0.0, 0.15.1
>
> Attachments: arrow6910.png
>
>
> I realize that when I read up a lot of Parquet files using 
> pyarrow.parquet.read_table(...), my program's memory usage becomes very 
> bloated, although I don't keep the table objects after converting them to 
> Pandas DFs.
> You can try this in an interactive Python shell to reproduce this problem:
> ```{python}
> from pyarrow.parquet import read_table
> for path in paths_of_a_bunch_of_big_parquet_files:
> read_table(path, use_threads=True, memory_map=False)
> (note that I'm not assigning the read_table(...) result to anything, so 
> I'm not creating any new objects at all)
> ```
> After that For loop above, if you view the memory using (e.g. using htop 
> program), you'll see that the Python program has taken up a lot of memory. 
> That memory is only released when you exit() from Python.
> This problem means that my compute jobs using PyArrow currently need to use 
> bigger server instances than I think is necessary, which translates to 
> significant extra cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits

2019-10-17 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954005#comment-16954005
 ] 

Wes McKinney commented on ARROW-6910:
-

I don't think this is a bug. I wrote a script to make and read a ~1+GB file in 
a loop and look at the process's RSS. Here is a plot of what the process's RSS 
looks like over the course of 100 iterations

 !arrow6910.png! 

This suggests this is related to internal behavior of our allocator (jemalloc 
here) which retains unused heap memory to speed up future in-process 
allocations rather than releasing the memory to the operating system. I'm not 
an expert on these kinds of system matters, [~apitrou] or others would know more

> [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not 
> released until program exits
> --
>
> Key: ARROW-6910
> URL: https://issues.apache.org/jira/browse/ARROW-6910
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
>Reporter: V Luong
>Priority: Critical
> Fix For: 1.0.0, 0.15.1
>
> Attachments: arrow6910.png
>
>
> I realize that when I read up a lot of Parquet files using 
> pyarrow.parquet.read_table(...), my program's memory usage becomes very 
> bloated, although I don't keep the table objects after converting them to 
> Pandas DFs.
> You can try this in an interactive Python shell to reproduce this problem:
> ```{python}
> from pyarrow.parquet import read_table
> for path in paths_of_a_bunch_of_big_parquet_files:
> read_table(path, use_threads=True, memory_map=False)
> (note that I'm not assigning the read_table(...) result to anything, so 
> I'm not creating any new objects at all)
> ```
> After that For loop above, if you view the memory using (e.g. using htop 
> program), you'll see that the Python program has taken up a lot of memory. 
> That memory is only released when you exit() from Python.
> This problem means that my compute jobs using PyArrow currently need to use 
> bigger server instances than I think is necessary, which translates to 
> significant extra cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits

2019-10-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953751#comment-16953751
 ] 

Joris Van den Bossche commented on ARROW-6910:
--

[~MBALearnsToCode] If it might not be a duplicate, could you try to provide a 
reproducible example? 

> [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not 
> released until program exits
> --
>
> Key: ARROW-6910
> URL: https://issues.apache.org/jira/browse/ARROW-6910
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.15.0
>Reporter: V Luong
>Priority: Critical
> Fix For: 1.0.0, 0.15.1
>
>
> I realize that when I read up a lot of Parquet files using 
> pyarrow.parquet.read_table(...), my program's memory usage becomes very 
> bloated, although I don't keep the table objects after converting them to 
> Pandas DFs.
> You can try this in an interactive Python shell to reproduce this problem:
> ```{python}
> from pyarrow.parquet import read_table
> for path in paths_of_a_bunch_of_big_parquet_files:
> read_table(path, use_threads=True, memory_map=False)
> (note that I'm not assigning the read_table(...) result to anything, so 
> I'm not creating any new objects at all)
> ```
> After that For loop above, if you view the memory using (e.g. using htop 
> program), you'll see that the Python program has taken up a lot of memory. 
> That memory is only released when you exit() from Python.
> This problem means that my compute jobs using PyArrow currently need to use 
> bigger server instances than I think is necessary, which translates to 
> significant extra cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits

2019-10-17 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953748#comment-16953748
 ] 

Wes McKinney commented on ARROW-6910:
-

I see. Is there something you can do to make the issue more reproducible, like 
one or more example files? 

> [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not 
> released until program exits
> --
>
> Key: ARROW-6910
> URL: https://issues.apache.org/jira/browse/ARROW-6910
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.15.0
>Reporter: V Luong
>Priority: Critical
> Fix For: 1.0.0, 0.15.1
>
>
> I realize that when I read up a lot of Parquet files using 
> pyarrow.parquet.read_table(...), my program's memory usage becomes very 
> bloated, although I don't keep the table objects after converting them to 
> Pandas DFs.
> You can try this in an interactive Python shell to reproduce this problem:
> ```{python}
> from pyarrow.parquet import read_table
> for path in paths_of_a_bunch_of_big_parquet_files:
> read_table(path, use_threads=True, memory_map=False)
> (note that I'm not assigning the read_table(...) result to anything, so 
> I'm not creating any new objects at all)
> ```
> After that For loop above, if you view the memory using (e.g. using htop 
> program), you'll see that the Python program has taken up a lot of memory. 
> That memory is only released when you exit() from Python.
> This problem means that my compute jobs using PyArrow currently need to use 
> bigger server instances than I think is necessary, which translates to 
> significant extra cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)