[issue39207] concurrent.futures.ProcessPoolExecutor does not properly reap jobs and spawns too many workers

2022-01-21 Thread Gregory P. Smith


Gregory P. Smith  added the comment:

fyi - this appears to have caused https://bugs.python.org/issue46464

--
nosy: +gregory.p.smith

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39207] concurrent.futures.ProcessPoolExecutor does not properly reap jobs and spawns too many workers

2020-04-19 Thread miss-islington


miss-islington  added the comment:


New changeset 1ac6e379297cc1cf8acf6c1b011fccc7b3da2cbe by Kyle Stanley in 
branch 'master':
bpo-39207: Spawn workers on demand in ProcessPoolExecutor (GH-19453)
https://github.com/python/cpython/commit/1ac6e379297cc1cf8acf6c1b011fccc7b3da2cbe


--
nosy: +miss-islington

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39207] concurrent.futures.ProcessPoolExecutor does not properly reap jobs and spawns too many workers

2020-04-19 Thread Antoine Pitrou


Change by Antoine Pitrou :


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39207] concurrent.futures.ProcessPoolExecutor does not properly reap jobs and spawns too many workers

2020-04-09 Thread Kyle Stanley


Change by Kyle Stanley :


--
keywords: +patch
pull_requests: +18807
stage: needs patch -> patch review
pull_request: https://github.com/python/cpython/pull/19453

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39207] concurrent.futures.ProcessPoolExecutor does not properly reap jobs and spawns too many workers

2020-03-29 Thread Kyle Stanley


Kyle Stanley  added the comment:

So, I think a potential approach to this issue with ProcessPoolExecutor would 
be making it a bit more similar to ThreadPoolExecutor: instead of spawning 
*max_workers* processes on startup (see `_adjust_process_count()`), spawn each 
new process in submit() (see `_adjust_thread_count()`), and only do so when 
there are no idle processes (and the current count is below max_workers, of 
course).

This wouldn't reap the idle processes, but I think re-using them is a behavior 
we want to retain. It seems like it would be quite costly to constantly close 
and start new processes each time a work item is completed.

>From my perspective, the main issue is that the processes are being spawned 
>all at once instead of being spawned as needed. This can result in a 
>substantial amount of extra cumulative idle time throughout the lifespan of 
>the ProcessPoolExecutor.

I should have some time to work on this in the next month or sooner, so I'll 
assign myself to this issue.

(I also changed the version to just Python 3.9, as this seems like too 
significant of a behavioral change to backport to 3.8. Let me know if anyone 
disagrees.)

--
assignee:  -> aeros
stage:  -> needs patch
versions: +Python 3.9 -Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39207] concurrent.futures.ProcessPoolExecutor does not properly reap jobs and spawns too many workers

2020-01-14 Thread Kyle Stanley


Kyle Stanley  added the comment:

> It would certainly be better to start the worker processes on demand. It  
> probably also requires careful thought about how to detect that more workers 
> are required.

Alright. In that case, I'll do some experimentation when I get the chance and 
see if I can come up with an effective way to spawn the worker processes as 
needed, without incurring a significant performance penalty.  

Note: I have a few other projects and obligations in the meantime, so I'm not 
100% certain when I'll have the time to work on this. If anyone else is 
interested in working on this as well, certainly feel free to do so.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39207] concurrent.futures.ProcessPoolExecutor does not properly reap jobs and spawns too many workers

2020-01-14 Thread Antoine Pitrou


Antoine Pitrou  added the comment:

It would certainly be better to start the worker processes on demand. It  
probably also requires careful thought about how to detect that more workers 
are required.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39207] concurrent.futures.ProcessPoolExecutor does not properly reap jobs and spawns too many workers

2020-01-14 Thread Kyle Stanley


Kyle Stanley  added the comment:

> I think this should be fixed like ThreadPoolExecutor.

Are there are any downsides or complications with changing this behavior for 
ProcessPoolExecutor to consider, such as what I mentioned above? From my 
understanding, there would be a performance penalty associated with spawning 
the processes on-demand as opposed to the current behavior of spawning 
*max_workers* processes at the same time, and using each of them as needed.

Also, I think it's worth considering the following: do the same arguments for 
changing the behavior for ThreadPoolExecutor also apply to ProcessPoolExecutor? 
Although they share the same executor API, they have rather different use cases.

That being said, if it's decided that we do want to make this behavior 
consistent with ThreadPoolExecutor, I would be willing to look into 
implementing it. I just want to make sure that it's carefully considered first.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39207] concurrent.futures.ProcessPoolExecutor does not properly reap jobs and spawns too many workers

2020-01-14 Thread Inada Naoki


Inada Naoki  added the comment:

Uh, my understanding "But ProcessPoolExecutor starts worker processes on demand 
from old." was wrong.

I think this should be fixed like ThreadPoolExecutor.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39207] concurrent.futures.ProcessPoolExecutor does not properly reap jobs and spawns too many workers

2020-01-10 Thread Kyle Stanley


Kyle Stanley  added the comment:

> What "ignores the max_workers argument" means?

>From my understanding, their argument was that the parameter name 
>"max_workers" and documentation implies that it will spawn processes as needed 
>up to *max_workers* based on the number of jobs scheduled. 

> And would you create a simple reproducible example?

I can't speak directly for the OP, but this simple example may demonstrate what 
they're talking about:

Linux 5.4.8
Python 3.8.1

```
import concurrent.futures as cf
import os
import random

def get_rand_nums(ls, n):
return [random.randint(1, 100) for i in range(n)]

def show_processes():
print("All python processes:")
os.system("ps -C python")

def main():
nums = []
with cf.ProcessPoolExecutor(max_workers=6) as executor:
futs = []
show_processes()
for _ in range(3):
fut = executor.submit(get_rand_nums, nums, 10_000_000)
futs.append(fut)
show_processes()
for fut in cf.as_completed(futs):
nums.extend(fut.result())
show_processes()

assert len(nums) == 30_000_000

if __name__ == '__main__':
main()
```

Output:

```
[aeros:~/programming/python]$ python ppe_max_workers.py
All python processes: # Main python process
PID TTY  TIME CMD
  23683 pts/100:00:00 python
All python processes: # Main python process + 6 unused subprocesses
PID TTY  TIME CMD
  23683 pts/100:00:00 python
  23685 pts/100:00:00 python
  23686 pts/100:00:00 python
  23687 pts/100:00:00 python
  23688 pts/100:00:00 python
  23689 pts/100:00:00 python
  23690 pts/100:00:00 python
All python processes: # Main python process + 3 used subprocesses + 3 unused 
subprocesses
PID TTY  TIME CMD
  23683 pts/100:00:00 python
  23685 pts/100:00:07 python
  23686 pts/100:00:07 python
  23687 pts/100:00:07 python
  23688 pts/100:00:00 python
  23689 pts/100:00:00 python
  23690 pts/100:00:00 python
```

As seen above, all processes up to *max_workers* were spawned immediately after 
the jobs were submitted to ProcessPoolExecutor, regardless of the actual number 
of jobs (3). It is also apparent that only three of those spawned processes 
were utilized by the CPU, as indicated by the values in the TIME field. The 
other three processes were not used.

If it wasn't for this behavior, I think there would be a significant 
performance loss, as the executor would have to continuously calculate how many 
processes are needed and spawn them throughout it's lifespan. AFAIK, it _seems_ 
more efficient to spawn *max_workers* processes when the jobs are scheduled, 
and then use them as needed; rather than spawning the processes as needed.

As a result, I think the current behavior should remain the same; unless 
someone can come up with a backwards-compatible alternative version and 
demonstrate its advantages over the current one.

However, I do think the current documentation could do a better at explaining 
how max_workers actually behaves. See the current explanation: 
https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ProcessPoolExecutor.

The current version does not address any of the above points. In fact, the 
first line seems like it might imply the opposite of what it's actually doing 
(at least based on my above example):

"An Executor subclass that executes calls asynchronously *using a pool of at 
most max_workers processes*." (asterisks added for emphasis)

"using a pool of at most max_workers processes" could imply to users that 
*max_workers* sets an upper bound limit on the number of processes in the pool, 
but that *max_workers* is only reached if all of those processes are _needed_. 
Unless I'm misunderstanding something, that's not the case.

I would suggest converting this into a documentation issue, assuming that the 
experts for the concurrent.futures confirm that the present behavior is 
intentional and that I'm correctly understanding the OP.

--
nosy: +aeros

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39207] concurrent.futures.ProcessPoolExecutor does not properly reap jobs and spawns too many workers

2020-01-10 Thread Terry J. Reedy


Change by Terry J. Reedy :


--
nosy: +bquinlan, pitrou

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39207] concurrent.futures.ProcessPoolExecutor does not properly reap jobs and spawns too many workers

2020-01-03 Thread Inada Naoki


Inada Naoki  added the comment:

> ProcessPoolExecutor does not properly spin down and spin up new processes.

It is because Process "Pool" is for reusing processes.
If you don't want to reuse process, you can use the Process.
https://docs.python.org/3/library/multiprocessing.html#the-process-class

Or you can create ProcessPoolExecutor before starting bunch of jobs and 
shutdown it after complete the jobs.


> ProcessPoolExecutor also spins up too many processes and ignores the 
> max_workers argument.  An example is my setting max_workers=10,
[snip]
> Instead, ProcessPoolExecutor spawns all 10 max_workers

What "ignores the max_workers argument" means?

And would you create a simple reproducible example?

Old ThreadPoolExecutor had the behavior (#24882).  But ProcessPoolExecutor 
starts worker processes on demand from old.

--
nosy: +inada.naoki

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39207] concurrent.futures.ProcessPoolExecutor does not properly reap jobs and spawns too many workers

2020-01-03 Thread Yusef Shaban


New submission from Yusef Shaban :

This came up from a supporting library but the actual issue is within 
concurrent.futures.ProcessPool.

Discussion can be found at https://github.com/agronholm/apscheduler/issues/414

ProcessPoolExecutor does not properly spin down and spin up new processes. 
Instead, it simply re-claims existing processes to re-purpose them for new 
jobs. Is there no option or way to make it so that instead of re-claiming 
existing processes, it spins down the process and then spins up another one. 
This behavior is a lot better for garbage collection and will help to prevent 
memory leaks. 

ProcessPoolExecutor also spins up too many processes and ignores the 
max_workers argument. An example is my setting max_workers=10, but I am only 
utilizing 3 processes. One would expect given the documentation that I would 
have at most 4 processes, the main process, and the 3 worker processes. 
Instead, ProcessPoolExecutor spawns all 10 max_workers and lets the other 7 
just sit there, even though they are not necessary.

--
components: Library (Lib)
messages: 359260
nosy: yus2047889
priority: normal
severity: normal
status: open
title: concurrent.futures.ProcessPoolExecutor does not properly reap jobs and 
spawns too many workers
type: behavior
versions: Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com