[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits
[ https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968683#comment-16968683 ] Wes McKinney commented on ARROW-6910: - The place to start will be twiddling with the jemalloc conf settings here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.cc#L48 > [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not > released until program exits > -- > > Key: ARROW-6910 > URL: https://issues.apache.org/jira/browse/ARROW-6910 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 >Reporter: V Luong >Assignee: Wes McKinney >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0, 0.15.1 > > Attachments: arrow6910.png > > Time Spent: 4h 50m > Remaining Estimate: 0h > > I realize that when I read up a lot of Parquet files using > pyarrow.parquet.read_table(...), my program's memory usage becomes very > bloated, although I don't keep the table objects after converting them to > Pandas DFs. > You can try this in an interactive Python shell to reproduce this problem: > ```{python} > from tqdm import tqdm > from pyarrow.parquet import read_table > PATH = '/tmp/big.snappy.parquet' > for _ in tqdm(range(10)): > read_table(PATH, use_threads=False, memory_map=False) > (note that I'm not assigning the read_table(...) result to anything, so > I'm not creating any new objects at all) > ``` > During the For loop above, if you view the memory usage (e.g. using htop > program), you'll see that it keeps creeping up. Either the program crashes > during the 10 iterations, or if the 10 iterations complete, the program will > still occupy a huge amount of memory, although no objects are kept. That > memory is only released when you exit() from Python. > This problem means that my compute jobs using PyArrow currently need to use > bigger server instances than I think is necessary, which translates to > significant extra cost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits
[ https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968675#comment-16968675 ] V Luong commented on ARROW-6910: ok [~wesm] let me create a new JIRA ticket for 0.15.1 > [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not > released until program exits > -- > > Key: ARROW-6910 > URL: https://issues.apache.org/jira/browse/ARROW-6910 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 >Reporter: V Luong >Assignee: Wes McKinney >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0, 0.15.1 > > Attachments: arrow6910.png > > Time Spent: 4h 50m > Remaining Estimate: 0h > > I realize that when I read up a lot of Parquet files using > pyarrow.parquet.read_table(...), my program's memory usage becomes very > bloated, although I don't keep the table objects after converting them to > Pandas DFs. > You can try this in an interactive Python shell to reproduce this problem: > ```{python} > from tqdm import tqdm > from pyarrow.parquet import read_table > PATH = '/tmp/big.snappy.parquet' > for _ in tqdm(range(10)): > read_table(PATH, use_threads=False, memory_map=False) > (note that I'm not assigning the read_table(...) result to anything, so > I'm not creating any new objects at all) > ``` > During the For loop above, if you view the memory usage (e.g. using htop > program), you'll see that it keeps creeping up. Either the program crashes > during the 10 iterations, or if the 10 iterations complete, the program will > still occupy a huge amount of memory, although no objects are kept. That > memory is only released when you exit() from Python. > This problem means that my compute jobs using PyArrow currently need to use > bigger server instances than I think is necessary, which translates to > significant extra cost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits
[ https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968670#comment-16968670 ] Wes McKinney commented on ARROW-6910: - If you can open a new JIRA for further investigation that would be helpful, the original issue you reported is no longer present, as chronicled in the linked pull request > [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not > released until program exits > -- > > Key: ARROW-6910 > URL: https://issues.apache.org/jira/browse/ARROW-6910 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 >Reporter: V Luong >Assignee: Wes McKinney >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0, 0.15.1 > > Attachments: arrow6910.png > > Time Spent: 4h 50m > Remaining Estimate: 0h > > I realize that when I read up a lot of Parquet files using > pyarrow.parquet.read_table(...), my program's memory usage becomes very > bloated, although I don't keep the table objects after converting them to > Pandas DFs. > You can try this in an interactive Python shell to reproduce this problem: > ```{python} > from tqdm import tqdm > from pyarrow.parquet import read_table > PATH = '/tmp/big.snappy.parquet' > for _ in tqdm(range(10)): > read_table(PATH, use_threads=False, memory_map=False) > (note that I'm not assigning the read_table(...) result to anything, so > I'm not creating any new objects at all) > ``` > During the For loop above, if you view the memory usage (e.g. using htop > program), you'll see that it keeps creeping up. Either the program crashes > during the 10 iterations, or if the 10 iterations complete, the program will > still occupy a huge amount of memory, although no objects are kept. That > memory is only released when you exit() from Python. > This problem means that my compute jobs using PyArrow currently need to use > bigger server instances than I think is necessary, which translates to > significant extra cost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits
[ https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968667#comment-16968667 ] Wes McKinney commented on ARROW-6910: - What platform are you on? It's possible the background thread reclamation is not enabled in your build Adding {{import gc; gc.collect()}} to your scripts may not be a bad idea > [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not > released until program exits > -- > > Key: ARROW-6910 > URL: https://issues.apache.org/jira/browse/ARROW-6910 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 >Reporter: V Luong >Assignee: Wes McKinney >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0, 0.15.1 > > Attachments: arrow6910.png > > Time Spent: 4h 50m > Remaining Estimate: 0h > > I realize that when I read up a lot of Parquet files using > pyarrow.parquet.read_table(...), my program's memory usage becomes very > bloated, although I don't keep the table objects after converting them to > Pandas DFs. > You can try this in an interactive Python shell to reproduce this problem: > ```{python} > from tqdm import tqdm > from pyarrow.parquet import read_table > PATH = '/tmp/big.snappy.parquet' > for _ in tqdm(range(10)): > read_table(PATH, use_threads=False, memory_map=False) > (note that I'm not assigning the read_table(...) result to anything, so > I'm not creating any new objects at all) > ``` > During the For loop above, if you view the memory usage (e.g. using htop > program), you'll see that it keeps creeping up. Either the program crashes > during the 10 iterations, or if the 10 iterations complete, the program will > still occupy a huge amount of memory, although no objects are kept. That > memory is only released when you exit() from Python. > This problem means that my compute jobs using PyArrow currently need to use > bigger server instances than I think is necessary, which translates to > significant extra cost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits
[ https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968654#comment-16968654 ] V Luong commented on ARROW-6910: [~apitrou] [~wesm] I'm re-testing this issue using the newly-released 0.15.1, with the following code, in an interactive Python 3.7 shell: - from pyarrow.parquet import read_table import os from tqdm import tqdm PARQUET_S3_PATH = 's3://public-parquet-test-data/big.snappy.parquet' PARQUET_HTTP_PATH = 'http://public-parquet-test-data.s3.amazonaws.com/big.snappy.parquet' PARQUET_TMP_PATH = '/tmp/big.snappy.parquet' os.system('wget --output-document={} {}'.format(PARQUET_TMP_PATH, PARQUET_HTTP_PATH)) for _ in tqdm(range(10)): read_table( source=PARQUET_TMP_PATH, columns=None, use_threads=False, metadata=None, use_pandas_metadata=False, memory_map=False, filesystem=None, filters=None) --- I observe the following mysterious behavior: - If I don't do anything after the above For loop, the program still occupies 8-10GB of memory and does not release it. I keep it at this idle state for a good 10-15 minutes and confirm that memory is still occupied. - Then, I try to do something random, like "import pyarrow; print(pyarrow.__version__)" in the interactive shell, and then the memory is immediately released. This behavior remains unintuitive to me, and it seems users still don't have a firm control on the memory used by PyArrow. Each read_table(...) call does not seem memory-neutral by default as of 0.15.1 yet. This means long-running iterative programs, especially ML training involving repeated loading up these files, will inevitably OOM. > [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not > released until program exits > -- > > Key: ARROW-6910 > URL: https://issues.apache.org/jira/browse/ARROW-6910 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 >Reporter: V Luong >Assignee: Wes McKinney >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0, 0.15.1 > > Attachments: arrow6910.png > > Time Spent: 4h 50m > Remaining Estimate: 0h > > I realize that when I read up a lot of Parquet files using > pyarrow.parquet.read_table(...), my program's memory usage becomes very > bloated, although I don't keep the table objects after converting them to > Pandas DFs. > You can try this in an interactive Python shell to reproduce this problem: > ```{python} > from tqdm import tqdm > from pyarrow.parquet import read_table > PATH = '/tmp/big.snappy.parquet' > for _ in tqdm(range(10)): > read_table(PATH, use_threads=False, memory_map=False) > (note that I'm not assigning the read_table(...) result to anything, so > I'm not creating any new objects at all) > ``` > During the For loop above, if you view the memory usage (e.g. using htop > program), you'll see that it keeps creeping up. Either the program crashes > during the 10 iterations, or if the 10 iterations complete, the program will > still occupy a huge amount of memory, although no objects are kept. That > memory is only released when you exit() from Python. > This problem means that my compute jobs using PyArrow currently need to use > bigger server instances than I think is necessary, which translates to > significant extra cost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits
[ https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955242#comment-16955242 ] V Luong commented on ARROW-6910: Great, thank you a great deal [~wesm]! > [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not > released until program exits > -- > > Key: ARROW-6910 > URL: https://issues.apache.org/jira/browse/ARROW-6910 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 >Reporter: V Luong >Assignee: Wes McKinney >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0, 0.15.1 > > Attachments: arrow6910.png > > Time Spent: 40m > Remaining Estimate: 0h > > I realize that when I read up a lot of Parquet files using > pyarrow.parquet.read_table(...), my program's memory usage becomes very > bloated, although I don't keep the table objects after converting them to > Pandas DFs. > You can try this in an interactive Python shell to reproduce this problem: > ```{python} > from tqdm import tqdm > from pyarrow.parquet import read_table > PATH = '/tmp/big.snappy.parquet' > for _ in tqdm(range(100)): > read_table(PATH, use_threads=False, memory_map=False) > (note that I'm not assigning the read_table(...) result to anything, so > I'm not creating any new objects at all) > ``` > During the For loop above, if you view the memory usage (e.g. using htop > program), you'll see that it keeps creeping up. Either the program crashes > during the 100 iterations, or if the 100 iterations complete, the program > will still occupy a huge amount of memory, although no objects are kept. That > memory is only released when you exit() from Python. > This problem means that my compute jobs using PyArrow currently need to use > bigger server instances than I think is necessary, which translates to > significant extra cost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits
[ https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955206#comment-16955206 ] Wes McKinney commented on ARROW-6910: - I can confirm that setting the "dirty_page_ms" jemalloc option to 0 causes memory to be released to the OS right away. This is likely to reduce application performance, but it may make sense to make the default option but allow it to be configured at runtime. I'm working on a page see http://jemalloc.net/jemalloc.3.html > [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not > released until program exits > -- > > Key: ARROW-6910 > URL: https://issues.apache.org/jira/browse/ARROW-6910 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 >Reporter: V Luong >Priority: Critical > Fix For: 1.0.0, 0.15.1 > > Attachments: arrow6910.png > > > I realize that when I read up a lot of Parquet files using > pyarrow.parquet.read_table(...), my program's memory usage becomes very > bloated, although I don't keep the table objects after converting them to > Pandas DFs. > You can try this in an interactive Python shell to reproduce this problem: > ```{python} > from tqdm import tqdm > from pyarrow.parquet import read_table > PATH = '/tmp/big.snappy.parquet' > for _ in tqdm(range(100)): > read_table(PATH, use_threads=False, memory_map=False) > (note that I'm not assigning the read_table(...) result to anything, so > I'm not creating any new objects at all) > ``` > During the For loop above, if you view the memory usage (e.g. using htop > program), you'll see that it keeps creeping up. Either the program crashes > during the 100 iterations, or if the 100 iterations complete, the program > will still occupy a huge amount of memory, although no objects are kept. That > memory is only released when you exit() from Python. > This problem means that my compute jobs using PyArrow currently need to use > bigger server instances than I think is necessary, which translates to > significant extra cost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits
[ https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955038#comment-16955038 ] Wes McKinney commented on ARROW-6910: - I can access it. I'll try to have a closer look in the next couple of days to see if I can determine what is going on. > [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not > released until program exits > -- > > Key: ARROW-6910 > URL: https://issues.apache.org/jira/browse/ARROW-6910 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 >Reporter: V Luong >Priority: Critical > Fix For: 1.0.0, 0.15.1 > > Attachments: arrow6910.png > > > I realize that when I read up a lot of Parquet files using > pyarrow.parquet.read_table(...), my program's memory usage becomes very > bloated, although I don't keep the table objects after converting them to > Pandas DFs. > You can try this in an interactive Python shell to reproduce this problem: > ```{python} > from tqdm import tqdm > from pyarrow.parquet import read_table > PATH = '/tmp/big.snappy.parquet' > for _ in tqdm(range(100)): > read_table(PATH, use_threads=False, memory_map=False) > (note that I'm not assigning the read_table(...) result to anything, so > I'm not creating any new objects at all) > ``` > During the For loop above, if you view the memory usage (e.g. using htop > program), you'll see that it keeps creeping up. Either the program crashes > during the 100 iterations, or if the 100 iterations complete, the program > will still occupy a huge amount of memory, although no objects are kept. That > memory is only released when you exit() from Python. > This problem means that my compute jobs using PyArrow currently need to use > bigger server instances than I think is necessary, which translates to > significant extra cost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits
[ https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954147#comment-16954147 ] V Luong commented on ARROW-6910: Using the code above, after just 10 iterations of reading up the file with 1 thread, the program has grown to occupy 15-18GB of memory and does not release it. > [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not > released until program exits > -- > > Key: ARROW-6910 > URL: https://issues.apache.org/jira/browse/ARROW-6910 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 >Reporter: V Luong >Priority: Critical > Fix For: 1.0.0, 0.15.1 > > Attachments: arrow6910.png > > > I realize that when I read up a lot of Parquet files using > pyarrow.parquet.read_table(...), my program's memory usage becomes very > bloated, although I don't keep the table objects after converting them to > Pandas DFs. > You can try this in an interactive Python shell to reproduce this problem: > ```{python} > from tqdm import tqdm > from pyarrow.parquet import read_table > PATH = '/tmp/big.snappy.parquet' > for _ in tqdm(range(100)): > read_table(PATH, use_threads=False, memory_map=False) > (note that I'm not assigning the read_table(...) result to anything, so > I'm not creating any new objects at all) > ``` > During the For loop above, if you view the memory usage (e.g. using htop > program), you'll see that it keeps creeping up. Either the program crashes > during the 100 iterations, or if the 100 iterations complete, the program > will still occupy a huge amount of memory, although no objects are kept. That > memory is only released when you exit() from Python. > This problem means that my compute jobs using PyArrow currently need to use > bigger server instances than I think is necessary, which translates to > significant extra cost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits
[ https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954134#comment-16954134 ] V Luong commented on ARROW-6910: [~wesm] [~jorisvandenbossche] [~apitrou] can you try "wget http://public-parquet-test-data.s3.amazonaws.com/big.snappy.parquet; now? I'll also edit the code in the description to reproduce the problem. > [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not > released until program exits > -- > > Key: ARROW-6910 > URL: https://issues.apache.org/jira/browse/ARROW-6910 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 >Reporter: V Luong >Priority: Critical > Fix For: 1.0.0, 0.15.1 > > Attachments: arrow6910.png > > > I realize that when I read up a lot of Parquet files using > pyarrow.parquet.read_table(...), my program's memory usage becomes very > bloated, although I don't keep the table objects after converting them to > Pandas DFs. > You can try this in an interactive Python shell to reproduce this problem: > ```{python} > from pyarrow.parquet import read_table > for path in paths_of_a_bunch_of_big_parquet_files: > read_table(path, use_threads=True, memory_map=False) > (note that I'm not assigning the read_table(...) result to anything, so > I'm not creating any new objects at all) > ``` > After that For loop above, if you view the memory using (e.g. using htop > program), you'll see that the Python program has taken up a lot of memory. > That memory is only released when you exit() from Python. > This problem means that my compute jobs using PyArrow currently need to use > bigger server instances than I think is necessary, which translates to > significant extra cost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits
[ https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954029#comment-16954029 ] V Luong commented on ARROW-6910: ok let me check again on another machine [~wesm] and let you know > [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not > released until program exits > -- > > Key: ARROW-6910 > URL: https://issues.apache.org/jira/browse/ARROW-6910 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 >Reporter: V Luong >Priority: Critical > Fix For: 1.0.0, 0.15.1 > > Attachments: arrow6910.png > > > I realize that when I read up a lot of Parquet files using > pyarrow.parquet.read_table(...), my program's memory usage becomes very > bloated, although I don't keep the table objects after converting them to > Pandas DFs. > You can try this in an interactive Python shell to reproduce this problem: > ```{python} > from pyarrow.parquet import read_table > for path in paths_of_a_bunch_of_big_parquet_files: > read_table(path, use_threads=True, memory_map=False) > (note that I'm not assigning the read_table(...) result to anything, so > I'm not creating any new objects at all) > ``` > After that For loop above, if you view the memory using (e.g. using htop > program), you'll see that the Python program has taken up a lot of memory. > That memory is only released when you exit() from Python. > This problem means that my compute jobs using PyArrow currently need to use > bigger server instances than I think is necessary, which translates to > significant extra cost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits
[ https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954026#comment-16954026 ] Wes McKinney commented on ARROW-6910: - {code} $ aws s3 cp s3://public-parquet-test-data/big.parquet . --recursive fatal error: Unable to locate credentials {code} Not a regular S3 user, sorry > [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not > released until program exits > -- > > Key: ARROW-6910 > URL: https://issues.apache.org/jira/browse/ARROW-6910 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 >Reporter: V Luong >Priority: Critical > Fix For: 1.0.0, 0.15.1 > > Attachments: arrow6910.png > > > I realize that when I read up a lot of Parquet files using > pyarrow.parquet.read_table(...), my program's memory usage becomes very > bloated, although I don't keep the table objects after converting them to > Pandas DFs. > You can try this in an interactive Python shell to reproduce this problem: > ```{python} > from pyarrow.parquet import read_table > for path in paths_of_a_bunch_of_big_parquet_files: > read_table(path, use_threads=True, memory_map=False) > (note that I'm not assigning the read_table(...) result to anything, so > I'm not creating any new objects at all) > ``` > After that For loop above, if you view the memory using (e.g. using htop > program), you'll see that the Python program has taken up a lot of memory. > That memory is only released when you exit() from Python. > This problem means that my compute jobs using PyArrow currently need to use > bigger server instances than I think is necessary, which translates to > significant extra cost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits
[ https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954025#comment-16954025 ] V Luong commented on ARROW-6910: [~wesm] could you try "aws s3 sync s3://public-parquet-test-data/big.parquet ..." or "aws s3 cp s3://public-parquet-test-data/big.parquet ... --recursive" on your terminal? I have not used wget with S3 data before, so am not sure how to make it work. > [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not > released until program exits > -- > > Key: ARROW-6910 > URL: https://issues.apache.org/jira/browse/ARROW-6910 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 >Reporter: V Luong >Priority: Critical > Fix For: 1.0.0, 0.15.1 > > Attachments: arrow6910.png > > > I realize that when I read up a lot of Parquet files using > pyarrow.parquet.read_table(...), my program's memory usage becomes very > bloated, although I don't keep the table objects after converting them to > Pandas DFs. > You can try this in an interactive Python shell to reproduce this problem: > ```{python} > from pyarrow.parquet import read_table > for path in paths_of_a_bunch_of_big_parquet_files: > read_table(path, use_threads=True, memory_map=False) > (note that I'm not assigning the read_table(...) result to anything, so > I'm not creating any new objects at all) > ``` > After that For loop above, if you view the memory using (e.g. using htop > program), you'll see that the Python program has taken up a lot of memory. > That memory is only released when you exit() from Python. > This problem means that my compute jobs using PyArrow currently need to use > bigger server instances than I think is necessary, which translates to > significant extra cost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits
[ https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954022#comment-16954022 ] Wes McKinney commented on ARROW-6910: - Can you give me an HTTPS link to download that file? I tried {{wget https://public-parquet-test-data.s3.amazonaws.com/big.parquet}} and it didn't work > [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not > released until program exits > -- > > Key: ARROW-6910 > URL: https://issues.apache.org/jira/browse/ARROW-6910 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 >Reporter: V Luong >Priority: Critical > Fix For: 1.0.0, 0.15.1 > > Attachments: arrow6910.png > > > I realize that when I read up a lot of Parquet files using > pyarrow.parquet.read_table(...), my program's memory usage becomes very > bloated, although I don't keep the table objects after converting them to > Pandas DFs. > You can try this in an interactive Python shell to reproduce this problem: > ```{python} > from pyarrow.parquet import read_table > for path in paths_of_a_bunch_of_big_parquet_files: > read_table(path, use_threads=True, memory_map=False) > (note that I'm not assigning the read_table(...) result to anything, so > I'm not creating any new objects at all) > ``` > After that For loop above, if you view the memory using (e.g. using htop > program), you'll see that the Python program has taken up a lot of memory. > That memory is only released when you exit() from Python. > This problem means that my compute jobs using PyArrow currently need to use > bigger server instances than I think is necessary, which translates to > significant extra cost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits
[ https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954021#comment-16954021 ] V Luong commented on ARROW-6910: [~wesm][~jorisvandenbossche] I've made a Parquet data set available at s3://public-parquet-test-data/big.parquet for testing. It's only moderately big. I repeatedly load various files thousands of time during iterative model training jobs that last for days. In 0.14.1 my long-running jobs succeeded, but in 0.15.0 the same jobs crashed after 30 mins or 1 hour. My inspection as shared above indicates that memory usage increases with the number of times read_table(...) is called and is not released, so long-running jobs would inevitably die. > [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not > released until program exits > -- > > Key: ARROW-6910 > URL: https://issues.apache.org/jira/browse/ARROW-6910 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 >Reporter: V Luong >Priority: Critical > Fix For: 1.0.0, 0.15.1 > > Attachments: arrow6910.png > > > I realize that when I read up a lot of Parquet files using > pyarrow.parquet.read_table(...), my program's memory usage becomes very > bloated, although I don't keep the table objects after converting them to > Pandas DFs. > You can try this in an interactive Python shell to reproduce this problem: > ```{python} > from pyarrow.parquet import read_table > for path in paths_of_a_bunch_of_big_parquet_files: > read_table(path, use_threads=True, memory_map=False) > (note that I'm not assigning the read_table(...) result to anything, so > I'm not creating any new objects at all) > ``` > After that For loop above, if you view the memory using (e.g. using htop > program), you'll see that the Python program has taken up a lot of memory. > That memory is only released when you exit() from Python. > This problem means that my compute jobs using PyArrow currently need to use > bigger server instances than I think is necessary, which translates to > significant extra cost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits
[ https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954005#comment-16954005 ] Wes McKinney commented on ARROW-6910: - I don't think this is a bug. I wrote a script to make and read a ~1+GB file in a loop and look at the process's RSS. Here is a plot of what the process's RSS looks like over the course of 100 iterations !arrow6910.png! This suggests this is related to internal behavior of our allocator (jemalloc here) which retains unused heap memory to speed up future in-process allocations rather than releasing the memory to the operating system. I'm not an expert on these kinds of system matters, [~apitrou] or others would know more > [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not > released until program exits > -- > > Key: ARROW-6910 > URL: https://issues.apache.org/jira/browse/ARROW-6910 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 >Reporter: V Luong >Priority: Critical > Fix For: 1.0.0, 0.15.1 > > Attachments: arrow6910.png > > > I realize that when I read up a lot of Parquet files using > pyarrow.parquet.read_table(...), my program's memory usage becomes very > bloated, although I don't keep the table objects after converting them to > Pandas DFs. > You can try this in an interactive Python shell to reproduce this problem: > ```{python} > from pyarrow.parquet import read_table > for path in paths_of_a_bunch_of_big_parquet_files: > read_table(path, use_threads=True, memory_map=False) > (note that I'm not assigning the read_table(...) result to anything, so > I'm not creating any new objects at all) > ``` > After that For loop above, if you view the memory using (e.g. using htop > program), you'll see that the Python program has taken up a lot of memory. > That memory is only released when you exit() from Python. > This problem means that my compute jobs using PyArrow currently need to use > bigger server instances than I think is necessary, which translates to > significant extra cost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits
[ https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953751#comment-16953751 ] Joris Van den Bossche commented on ARROW-6910: -- [~MBALearnsToCode] If it might not be a duplicate, could you try to provide a reproducible example? > [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not > released until program exits > -- > > Key: ARROW-6910 > URL: https://issues.apache.org/jira/browse/ARROW-6910 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.15.0 >Reporter: V Luong >Priority: Critical > Fix For: 1.0.0, 0.15.1 > > > I realize that when I read up a lot of Parquet files using > pyarrow.parquet.read_table(...), my program's memory usage becomes very > bloated, although I don't keep the table objects after converting them to > Pandas DFs. > You can try this in an interactive Python shell to reproduce this problem: > ```{python} > from pyarrow.parquet import read_table > for path in paths_of_a_bunch_of_big_parquet_files: > read_table(path, use_threads=True, memory_map=False) > (note that I'm not assigning the read_table(...) result to anything, so > I'm not creating any new objects at all) > ``` > After that For loop above, if you view the memory using (e.g. using htop > program), you'll see that the Python program has taken up a lot of memory. > That memory is only released when you exit() from Python. > This problem means that my compute jobs using PyArrow currently need to use > bigger server instances than I think is necessary, which translates to > significant extra cost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6910) [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits
[ https://issues.apache.org/jira/browse/ARROW-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953748#comment-16953748 ] Wes McKinney commented on ARROW-6910: - I see. Is there something you can do to make the issue more reproducible, like one or more example files? > [Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not > released until program exits > -- > > Key: ARROW-6910 > URL: https://issues.apache.org/jira/browse/ARROW-6910 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.15.0 >Reporter: V Luong >Priority: Critical > Fix For: 1.0.0, 0.15.1 > > > I realize that when I read up a lot of Parquet files using > pyarrow.parquet.read_table(...), my program's memory usage becomes very > bloated, although I don't keep the table objects after converting them to > Pandas DFs. > You can try this in an interactive Python shell to reproduce this problem: > ```{python} > from pyarrow.parquet import read_table > for path in paths_of_a_bunch_of_big_parquet_files: > read_table(path, use_threads=True, memory_map=False) > (note that I'm not assigning the read_table(...) result to anything, so > I'm not creating any new objects at all) > ``` > After that For loop above, if you view the memory using (e.g. using htop > program), you'll see that the Python program has taken up a lot of memory. > That memory is only released when you exit() from Python. > This problem means that my compute jobs using PyArrow currently need to use > bigger server instances than I think is necessary, which translates to > significant extra cost. -- This message was sent by Atlassian Jira (v8.3.4#803005)