[Python-Dev] Re: Latest PEP 554 updates.

2020-05-05 Thread Eric Snow
On Mon, May 4, 2020 at 11:30 AM Eric Snow  wrote:
> Further feedback is welcome, though I feel like the PR is ready (or
> very close to ready) for pronouncement.  Thanks again to all.

FYI, after consulting with the steering council I've decided to change
the target release to 3.10, when we expect to have per-interpreter GIL
landed.  That will help maximize the impact of the module and avoid
any confusion.  I'm undecided on releasing a 3.9-only module on PyPI.
If I do it will only be for folks to try it out early and I probably
won't advertise it much.

-eric
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/PZO7ZQB7OAOEJ7AXMJNMDKZC3B3UVDZA/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround

2020-05-05 Thread Victor Stinner
Hi,

I wrote a "per-interpreter GIL" proof-of-concept: each interpreter
gets its own GIL. I chose to benchmark a factorial function in pure
Python to simulate a CPU-bound workload. I wrote the simplest possible
function just to be able to run a benchmark, to check if the PEP 554
would be relevant.

The proof-of-concept proves that subinterpreters can make a CPU-bound
workload faster than sequential execution or threads and that they
have the same speed than multiprocessing. The performance scales well
with the number of CPUs.


Performance
===

Factorial:

n = 50_000
fact = 1
for i in range(1, n + 1):
fact = fact * i

2 CPUs:

Sequential: 1.00 sec +- 0.01 sec
Threads: 1.08 sec +- 0.01 sec
Multiprocessing: 529 ms +- 6 ms
Subinterpreters: 553 ms +- 6 ms

4 CPUs:

Sequential: 1.99 sec +- 0.01 sec
Threads: 3.15 sec +- 0.97 sec
Multiprocessing: 560 ms +- 12 ms
Subinterpreters: 583 ms +- 7 ms

8 CPUs:

Sequential: 4.01 sec +- 0.02 sec
Threads: 9.91 sec +- 0.54 sec
Multiprocessing: 1.02 sec +- 0.01 sec
Subinterpreters: 1.10 sec +- 0.00 sec

Benchmarks run on my laptop which has 8 logical CPUs (4 physical CPU
cores with Hyper Threading).

Threads are between 1.1x (2 CPUs) and 2.5x (8 CPUs) SLOWER than
sequential execution.

Subinterpreters are between 1.8x (2 CPUs) and 3.6x (8 CPUs) FASTER
than sequential execution.

Subinterpreters and multiprocessing have basically the same speed on
this benchmark.

See demo-pyperf.py attached to https://bugs.python.org/issue40512 for
the code of the benchmark.


Implementation
==

See https://bugs.python.org/issue40512 and related issues for the
implementation. I already merged changes, but most code is disabled by
default: a new special undocumented
--with-experimental-isolated-subinterpreters build mode is required to
test it.

To reproduce the benchmark, use::

# up to date checkout of Python master branch
./configure \
--with-experimental-isolated-subinterpreters \
--enable-optimizations \
--with-lto
make
./python demo-pyperf.py


Limits of subinterpreters design


Subinterpreters have a few design limits:

* A Python object must not be shared between two interpreters.
* Each interpreter has a minimum memory footprint, since Python
internal states and modules are duplicated.
* Others that I forgot :-)


Incomplete implementation
=

My proof-of-concept is just good enough to compute factorial with the
code that I wrote above :-) Any other code is very likely to crash in
various funny ways.

I added a few "#ifdef EXPERIMENTAL_ISOLATED_SUBINTERPRETERS" for the
proof-of-concept. Most are temporary workarounds until some parts of
the code are modified to become compatible with subinterpreters, like
tuple free lists or Unicode interned strings.

Right now, there are still some states which are shared between
subinterpreters: like None and True singletons, but also statically
allocated types. Avoid shared states should enhance performances.

See https://bugs.python.org/issue40512 for the current status and a
list of tasks.

Most of these tasks are already tracked in Eric Snow's "Multi Core
Python" project:
https://github.com/ericsnowcurrently/multi-core-python/issues

Victor
--
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/S5GZZCEREZLA2PEMTVFBCDM52H4JSENR/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround

2020-05-05 Thread Brett Cannon
Just to be clear, this is executing the **same** workload in parallel, **not** 
trying to parallelize factorial. E.g. the 8 CPU calculation is calculating 
50,000! 8 separate times and not calculating 50,000! once by spreading the work 
across 8 CPUs. This measurement is still showing parallel work, but now I'm 
really curious to see the first calculation where you're measuring how much 
faster a calculation is thanks to sub-interpreters. :)

I also realize this is not optimized in any way, so being this close to 
multiprocessing already is very encouraging!
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/2PJRTWADEURRQ6SMI6PTD26YRFFH47RZ/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround

2020-05-05 Thread Guido van Rossum
This sounds like a significant milestone!

Is there some kind of optimized communication possible yet between
subinterpreters? (Otherwise I still worry that it's no better than
subprocesses -- and it could be worse because when one subinterpreter
experiences a hard crash or runs out of memory, all others have to die with
it.)

On Tue, May 5, 2020 at 2:54 PM Victor Stinner  wrote:

> Hi,
>
> I wrote a "per-interpreter GIL" proof-of-concept: each interpreter
> gets its own GIL. I chose to benchmark a factorial function in pure
> Python to simulate a CPU-bound workload. I wrote the simplest possible
> function just to be able to run a benchmark, to check if the PEP 554
> would be relevant.
>
> The proof-of-concept proves that subinterpreters can make a CPU-bound
> workload faster than sequential execution or threads and that they
> have the same speed than multiprocessing. The performance scales well
> with the number of CPUs.
>
>
> Performance
> ===
>
> Factorial:
>
> n = 50_000
> fact = 1
> for i in range(1, n + 1):
> fact = fact * i
>
> 2 CPUs:
>
> Sequential: 1.00 sec +- 0.01 sec
> Threads: 1.08 sec +- 0.01 sec
> Multiprocessing: 529 ms +- 6 ms
> Subinterpreters: 553 ms +- 6 ms
>
> 4 CPUs:
>
> Sequential: 1.99 sec +- 0.01 sec
> Threads: 3.15 sec +- 0.97 sec
> Multiprocessing: 560 ms +- 12 ms
> Subinterpreters: 583 ms +- 7 ms
>
> 8 CPUs:
>
> Sequential: 4.01 sec +- 0.02 sec
> Threads: 9.91 sec +- 0.54 sec
> Multiprocessing: 1.02 sec +- 0.01 sec
> Subinterpreters: 1.10 sec +- 0.00 sec
>
> Benchmarks run on my laptop which has 8 logical CPUs (4 physical CPU
> cores with Hyper Threading).
>
> Threads are between 1.1x (2 CPUs) and 2.5x (8 CPUs) SLOWER than
> sequential execution.
>
> Subinterpreters are between 1.8x (2 CPUs) and 3.6x (8 CPUs) FASTER
> than sequential execution.
>
> Subinterpreters and multiprocessing have basically the same speed on
> this benchmark.
>
> See demo-pyperf.py attached to https://bugs.python.org/issue40512 for
> the code of the benchmark.
>
>
> Implementation
> ==
>
> See https://bugs.python.org/issue40512 and related issues for the
> implementation. I already merged changes, but most code is disabled by
> default: a new special undocumented
> --with-experimental-isolated-subinterpreters build mode is required to
> test it.
>
> To reproduce the benchmark, use::
>
> # up to date checkout of Python master branch
> ./configure \
> --with-experimental-isolated-subinterpreters \
> --enable-optimizations \
> --with-lto
> make
> ./python demo-pyperf.py
>
>
> Limits of subinterpreters design
> 
>
> Subinterpreters have a few design limits:
>
> * A Python object must not be shared between two interpreters.
> * Each interpreter has a minimum memory footprint, since Python
> internal states and modules are duplicated.
> * Others that I forgot :-)
>
>
> Incomplete implementation
> =
>
> My proof-of-concept is just good enough to compute factorial with the
> code that I wrote above :-) Any other code is very likely to crash in
> various funny ways.
>
> I added a few "#ifdef EXPERIMENTAL_ISOLATED_SUBINTERPRETERS" for the
> proof-of-concept. Most are temporary workarounds until some parts of
> the code are modified to become compatible with subinterpreters, like
> tuple free lists or Unicode interned strings.
>
> Right now, there are still some states which are shared between
> subinterpreters: like None and True singletons, but also statically
> allocated types. Avoid shared states should enhance performances.
>
> See https://bugs.python.org/issue40512 for the current status and a
> list of tasks.
>
> Most of these tasks are already tracked in Eric Snow's "Multi Core
> Python" project:
> https://github.com/ericsnowcurrently/multi-core-python/issues
>
> Victor
> --
> Night gathers, and now my watch begins. It shall not end until my death.
> ___
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-le...@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/S5GZZCEREZLA2PEMTVFBCDM52H4JSENR/
> Code of Conduct: http://python.org/psf/codeofconduct/
>


-- 
--Guido van Rossum (python.org/~guido)
*Pronouns: he/him **(why is my pronoun here?)*

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/73WU67MRODKOEFMG2LNV5EZ3PFR7D52Z/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround

2020-05-05 Thread Joseph Jenne via Python-Dev
I'm seeing a drop in performance of both multiprocess and subinterpreter 
based runs in the 8-CPU case, where performance drops by about half 
despite having enough logical CPUs, while the other cases scale quite 
well. Is there some issue with python multiprocessing/subinterpreters on 
the same logical core?


On 5/5/20 2:46 PM, Victor Stinner wrote:

Hi,

I wrote a "per-interpreter GIL" proof-of-concept: each interpreter
gets its own GIL. I chose to benchmark a factorial function in pure
Python to simulate a CPU-bound workload. I wrote the simplest possible
function just to be able to run a benchmark, to check if the PEP 554
would be relevant.

The proof-of-concept proves that subinterpreters can make a CPU-bound
workload faster than sequential execution or threads and that they
have the same speed than multiprocessing. The performance scales well
with the number of CPUs.


Performance
===

Factorial:

 n = 50_000
 fact = 1
 for i in range(1, n + 1):
 fact = fact * i

2 CPUs:

 Sequential: 1.00 sec +- 0.01 sec
 Threads: 1.08 sec +- 0.01 sec
 Multiprocessing: 529 ms +- 6 ms
 Subinterpreters: 553 ms +- 6 ms

4 CPUs:

 Sequential: 1.99 sec +- 0.01 sec
 Threads: 3.15 sec +- 0.97 sec
 Multiprocessing: 560 ms +- 12 ms
 Subinterpreters: 583 ms +- 7 ms

8 CPUs:

 Sequential: 4.01 sec +- 0.02 sec
 Threads: 9.91 sec +- 0.54 sec
 Multiprocessing: 1.02 sec +- 0.01 sec
 Subinterpreters: 1.10 sec +- 0.00 sec

Benchmarks run on my laptop which has 8 logical CPUs (4 physical CPU
cores with Hyper Threading).

Threads are between 1.1x (2 CPUs) and 2.5x (8 CPUs) SLOWER than
sequential execution.

Subinterpreters are between 1.8x (2 CPUs) and 3.6x (8 CPUs) FASTER
than sequential execution.

Subinterpreters and multiprocessing have basically the same speed on
this benchmark.

See demo-pyperf.py attached to https://bugs.python.org/issue40512 for
the code of the benchmark.


Implementation
==

See https://bugs.python.org/issue40512 and related issues for the
implementation. I already merged changes, but most code is disabled by
default: a new special undocumented
--with-experimental-isolated-subinterpreters build mode is required to
test it.

To reproduce the benchmark, use::

 # up to date checkout of Python master branch
 ./configure \
 --with-experimental-isolated-subinterpreters \
 --enable-optimizations \
 --with-lto
 make
 ./python demo-pyperf.py


Limits of subinterpreters design


Subinterpreters have a few design limits:

* A Python object must not be shared between two interpreters.
* Each interpreter has a minimum memory footprint, since Python
internal states and modules are duplicated.
* Others that I forgot :-)


Incomplete implementation
=

My proof-of-concept is just good enough to compute factorial with the
code that I wrote above :-) Any other code is very likely to crash in
various funny ways.

I added a few "#ifdef EXPERIMENTAL_ISOLATED_SUBINTERPRETERS" for the
proof-of-concept. Most are temporary workarounds until some parts of
the code are modified to become compatible with subinterpreters, like
tuple free lists or Unicode interned strings.

Right now, there are still some states which are shared between
subinterpreters: like None and True singletons, but also statically
allocated types. Avoid shared states should enhance performances.

See https://bugs.python.org/issue40512 for the current status and a
list of tasks.

Most of these tasks are already tracked in Eric Snow's "Multi Core
Python" project:
https://github.com/ericsnowcurrently/multi-core-python/issues

Victor
--
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/S5GZZCEREZLA2PEMTVFBCDM52H4JSENR/
Code of Conduct: http://python.org/psf/codeofconduct/

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/DAF6PTIY5A55TW3PHQLL347FID2UVT73/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround

2020-05-05 Thread Nathaniel Smith
On Tue, May 5, 2020 at 3:47 PM Guido van Rossum  wrote:
>
> This sounds like a significant milestone!
>
> Is there some kind of optimized communication possible yet between 
> subinterpreters? (Otherwise I still worry that it's no better than 
> subprocesses -- and it could be worse because when one subinterpreter 
> experiences a hard crash or runs out of memory, all others have to die with 
> it.)

As far as I understand it, the subinterpreter folks have given up on
optimized passing of objects, and are only hoping to do optimized
(zero-copy) passing of raw memory buffers.

On my laptop, some rough measurements [1] suggest that simply piping
bytes between processes goes at ~2.8 gigabytes/second, and that
pickle/unpickle is ~10x slower than that. So that would suggest that
once subinterpreters are fully optimized, they might provide a maximum
~10% speedup vs multiprocessing, for a program that's doing nothing
except passing pickled objects back and forth. Of course, any real
program that's spawning parallel workers will presumably be designed
so its workers spend most of their time doing work on that data, not
just passing it back and forth. That makes a 10% speedup highly
unrealistic; in real-world programs it will be much smaller.

So IIUC, subinterpreter communication is currently about the same
speed as multiprocessing communication, and the plan is to keep it
that way.

-n

[1] Of course there are a lot of assumptions in my quick
back-of-the-envelope calculation: pickle speed depends on the details
of the objects being pickled, there are other serialization formats,
there are other IPC methods that might be faster but are more
complicated (shared memory), the stdlib 'multiprocessing' library
might not be as good as it could be (the above measurements are for an
ideal multiprocessing library, I haven't tested the one we currently
have in the stdlib), etc. So maybe there's some situation where
subinterpreters look better. But I've been pointing out this issue to
Eric et al for years and they haven't disputed it, so I guess they
haven't found one yet.

-- 
Nathaniel J. Smith -- https://vorpus.org
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/GZEJOIT6SZDUVPND64VKFFKFX6AJWZ7W/
Code of Conduct: http://python.org/psf/codeofconduct/