[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-18 Thread Nick Coghlan
On Mon., 18 Nov. 2019, 8:19 am Nathaniel Smith,  wrote:

>
> > - Eventually make it easier for embedding applications to control which
> Python code runs in which thread state by moving the thread state
> activation dance out of the application and into the CPython shared library
>
> That seems like a good goal, but I don't understand how it's related
> to passing threadstate explicitly as a function argument. If the plan
> is to move towards passing threadstates both implicitly AND explicitly
> everywhere, that seems like it would make things more error-prone, not
> less, because the two states could get out of sync. Could you
> elaborate?
>

What I said my original reply: if an API that accepts an explicit thread
state ever calls an API that expects an implicit one, we'll need to
internally implement the dance to activate the supplied thread state before
making that call.

At the moment, we expect callers of the public API to do that dance, and
it's tricky to get it right in all cases.

My hope (and it's a subjective hope, not an objective fact) is that
implementing the dance more often ourselves will help us identify future
abstractions that will make the public API easier to use correctly in
multi-threaded applications.

Cheers,
Nick.



> -n
>
> --
> Nathaniel J. Smith -- https://vorpus.org
>
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/W3DJG5NTRCQJ45SYGYSJLGWC5AM2Z3W5/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-18 Thread Antoine Pitrou
On Mon, 18 Nov 2019 12:39:00 -0500
Random832  wrote:
> On Mon, Nov 18, 2019, at 05:26, Antoine Pitrou wrote:
> > > For the first goal, I don't think this is possible, or desirable.
> > > Obviously if we remove the GIL somehow then at a minimum we'll need to
> > > make the global threadstate a thread-local. But I think we'll always
> > > have to keep it around as a thread-local, at least, because there are
> > > situations where you simply cannot pass in the threadstate as an
> > > argument. One example comes up when doing FFI: there are C libraries
> > > that take callbacks, and will run them later in some arbitrary thread.
> > > When wrapping these in Python, we need a way to bundle up a Python
> > > function into a C function that can be called from any thread. So,
> > > ctypes and cffi and cython all have ways to do this bundling, and they
> > > all start with some delicate dance to figure out whether or not the
> > > current thread holds the GIL, acquiring the GIL if not, then checking
> > > whether or not this thread has a Python threadstate assigned, creating
> > > it if not, etc. This is completely dependent on having the threadstate
> > > available in ambient context. If threadstates were always passed as
> > > arguments, then it would become impossible to wrap these C libraries.  
> > 
> > Most well-designed C libraries let you pass an additional "void*"
> > parameter for user callbacks to be called with.  A couple of them
> > don't, unfortunately (OpenSSL perhaps?  I don't remember).  
> 
> I think you've missed the fact that the C library runs the callback on an 
> arbitrary thread. The threadstate associated with the thread that made the 
> original call is therefore *not the one you want*; you want a threadstate 
> associated with the thread the callback is run on.

Ah, right, I had overlooked that mention.  This does complicate things
a bit.  In that case you would want to pass the interpreter state and
then use this particular interpreter's mapping of OS thread to
threadstate.

(assuming that per-interpreter mapping exists, which is another
question; but it will have to exist at some point for PEP 554)

Regards

Antoine.

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/S5RDSBW7NZ3RAIHCFHQYQW4X6JD4N3G5/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-18 Thread Random832
On Mon, Nov 18, 2019, at 05:26, Antoine Pitrou wrote:
> > For the first goal, I don't think this is possible, or desirable.
> > Obviously if we remove the GIL somehow then at a minimum we'll need to
> > make the global threadstate a thread-local. But I think we'll always
> > have to keep it around as a thread-local, at least, because there are
> > situations where you simply cannot pass in the threadstate as an
> > argument. One example comes up when doing FFI: there are C libraries
> > that take callbacks, and will run them later in some arbitrary thread.
> > When wrapping these in Python, we need a way to bundle up a Python
> > function into a C function that can be called from any thread. So,
> > ctypes and cffi and cython all have ways to do this bundling, and they
> > all start with some delicate dance to figure out whether or not the
> > current thread holds the GIL, acquiring the GIL if not, then checking
> > whether or not this thread has a Python threadstate assigned, creating
> > it if not, etc. This is completely dependent on having the threadstate
> > available in ambient context. If threadstates were always passed as
> > arguments, then it would become impossible to wrap these C libraries.
> 
> Most well-designed C libraries let you pass an additional "void*"
> parameter for user callbacks to be called with.  A couple of them
> don't, unfortunately (OpenSSL perhaps?  I don't remember).

I think you've missed the fact that the C library runs the callback on an 
arbitrary thread. The threadstate associated with the thread that made the 
original call is therefore *not the one you want*; you want a threadstate 
associated with the thread the callback is run on.

Alternately, if a thread state is not in any sense associated with a thread 
(would these situations then mean you simply always create a brand-new 
interpreter state?), maybe it shouldn't be called a thread state at all.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/NSRJVU7PZOVBWWYS3R5QLRRYP2N6NKTY/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-18 Thread Victor Stinner
Le sam. 16 nov. 2019 à 20:55, Neil Schemenauer
 a écrit :
>  If you use threadstate often,
> passing it explicitly (which likely uses a CPU register) could be a
> win.  If you use it rarely, that CPU register would be better
> utilized for passing function arguments you actually use.

Currently, I would say that it's used "rarely". But. If we want to
implement subinterpreters, we have to use way more often. Since each
interpreter must have its isolated namespace, I expect that even 1+1
should use tstate to get the 2 "singleton" from its private namespace,
rather than using a "global" singleton. Basically, all builtin types
and all builtin modules should be modified to have one namespace per
interpreter.

For C extensions, it's an old project to have a "state" passed to
module functions, and so be able to have 2 separated instances of the
same C extension, rather than having a single global namespace.
Examples:

https://www.python.org/dev/peps/pep-0489/
https://www.python.org/dev/peps/pep-0573/

I would like to implement subinterpreters. IMHO the project is
feasible and if it works, it would make Python more competitive with
other programming languages!

IMHO fixing the C API (or write a new one) and subinterpreters are the
only two most feasible and most realistic projects to optimize CPython
right now.

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/SZ7TP3AO3L4MT7RMZ53BIVXX6IZIVGDR/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-18 Thread Antoine Pitrou
On Fri, 15 Nov 2019 14:21:53 -0800
Nathaniel Smith  wrote:
> As you know, I'm skeptical that PEP 554 will produce benefits that are
> worth the effort, but let's assume for the moment that it is, and
> we're all 100% committed to moving all globals into the threadstate.
> Even given that, the motivation for this change seems a bit unclear to
> me.
> 
> I guess the possible goals are:
> 
> - Get rid of the "ambient" threadstate entirely
> - Make accessing the threadstate faster
> 
> For the first goal, I don't think this is possible, or desirable.
> Obviously if we remove the GIL somehow then at a minimum we'll need to
> make the global threadstate a thread-local. But I think we'll always
> have to keep it around as a thread-local, at least, because there are
> situations where you simply cannot pass in the threadstate as an
> argument. One example comes up when doing FFI: there are C libraries
> that take callbacks, and will run them later in some arbitrary thread.
> When wrapping these in Python, we need a way to bundle up a Python
> function into a C function that can be called from any thread. So,
> ctypes and cffi and cython all have ways to do this bundling, and they
> all start with some delicate dance to figure out whether or not the
> current thread holds the GIL, acquiring the GIL if not, then checking
> whether or not this thread has a Python threadstate assigned, creating
> it if not, etc. This is completely dependent on having the threadstate
> available in ambient context. If threadstates were always passed as
> arguments, then it would become impossible to wrap these C libraries.

Most well-designed C libraries let you pass an additional "void*"
parameter for user callbacks to be called with.  A couple of them
don't, unfortunately (OpenSSL perhaps?  I don't remember).

Regards

Antoine.

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/PUUP6NN6U6L7XTVYJQGPUW6LT5P6Y253/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-17 Thread Nathaniel Smith
On Sun, Nov 17, 2019 at 1:58 PM Nick Coghlan  wrote:
> On Sat., 16 Nov. 2019, 8:26 am Nathaniel Smith,  wrote:
>>
>> As you know, I'm skeptical that PEP 554 will produce benefits that are
>> worth the effort, but let's assume for the moment that it is, and
>> we're all 100% committed to moving all globals into the threadstate.
>> Even given that, the motivation for this change seems a bit unclear to
>> me.
>>
>> I guess the possible goals are:
>>
>> - Get rid of the "ambient" threadstate entirely
>> - Make accessing the threadstate faster
>
> - Eventually make it easier for CPython maintainers to know which functions 
> require access to a live thread state, and which are stateless helper 
> functions

So the idea would be that eventually we'd remove all uses of implicit
state lookup inside CPython, and add some kind of CI check to make
sure that they're never used?

> - Eventually make it easier for embedding applications to control which 
> Python code runs in which thread state by moving the thread state activation 
> dance out of the application and into the CPython shared library

That seems like a good goal, but I don't understand how it's related
to passing threadstate explicitly as a function argument. If the plan
is to move towards passing threadstates both implicitly AND explicitly
everywhere, that seems like it would make things more error-prone, not
less, because the two states could get out of sync. Could you
elaborate?

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/5JKNEYXI6ZILC3P6JBXW7NKAUVMXBRQN/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-17 Thread Nick Coghlan
On Sat., 16 Nov. 2019, 8:26 am Nathaniel Smith,  wrote:

> As you know, I'm skeptical that PEP 554 will produce benefits that are
> worth the effort, but let's assume for the moment that it is, and
> we're all 100% committed to moving all globals into the threadstate.
> Even given that, the motivation for this change seems a bit unclear to
> me.
>
> I guess the possible goals are:
>
> - Get rid of the "ambient" threadstate entirely
> - Make accessing the threadstate faster
>

- Eventually make it easier for CPython maintainers to know which functions
require access to a live thread state, and which are stateless helper
functions
- Eventually make it easier for embedding applications to control which
Python code runs in which thread state by moving the thread state
activation dance out of the application and into the CPython shared library

(We actually broke the thread state activation in hexchat not that long ago
- there was a subtle latent defect in how they were handling it, and the
changes to interpreter cleanup escalated it to a full blown crash)

The need for the implicit thread state is never going to go away, but there
are definitely opportunities to make the way we manage it less bug prone.
(e.g. In the HPy work, I would expect each handle to be at least bound to
an interpreter, and there could even be a higher level construct to
associate callbacks with a specific thread state)

Cheers,
Nick.



>
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/YIAVY4QFRNLO6FBVFHZNSFCBNLJ4WIGV/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-16 Thread Neil Schemenauer
On AMD64 Linux, the location of the thread local data seems to be
stored in the GS CPU register[1].  It seems likely other platforms
and other operating systems could do something similar.  Passing
threadstate as an explicit argument could be either faster or slower
depending on how often you use it.  If you use threadstate often,
passing it explicitly (which likely uses a CPU register) could be a
win.  If you use it rarely, that CPU register would be better
utilized for passing function arguments you actually use.

Doing some experiments with optimized (i.e. using platform specific)
TLS would seem a useful step before undertaking a major refactoring.
Explicit passing could be a lot of code churn for no practical gain.

1. 
https://stackoverflow.com/questions/6611346/how-are-the-fs-gs-registers-used-in-linux-amd64
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/R3XFSL5F6ZOV7VJYYZDEKA7JY327DYLD/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-15 Thread Nick Coghlan
On Sat., 16 Nov. 2019, 7:29 am Eric Snow, 
wrote:

> On Thu, Nov 14, 2019 at 4:12 AM Victor Stinner 
> wrote:
> > Another approach would be to pass a "PyContext*" pointer which
> > contains tstate, but also additional fields. But I chose to state with
> > a direct "PyThreadState* tstate" to avoid one indirection to every
> > tstate access. Currently, tstate seems to be enough for the current
> > code base.
>
> FWIW, I favor this approach as well.  As long as it is an opaque type,
> a PyContext allows us to be more flexible in adapting to the future.
> For now it could even be a simple alias for PyThreadState.
> Regardless, I'm not convinced that using a PyContext will have a real
> impact on runtime performance.
>
> Also, we already use "context" in a number of ways in Python.  So
> "PyContext" might not be the best name.  It probably needs to be a
> name without "context" in it or one with a concrete clue (e.g.
> 'PyRuntimeContext").
>

I think we should just stick with "PyThreadState", as that makes it clear
that in normal circumstances, it means "the Python State for the currently
running Thread".

If a function accepting this parameter needs to call back in to Python
code, or invokes a function pointer that might call back into the public C
API, it's going to need to enforce that assumption by switching the active
thread state if necessary.

You can already navigate from the thread state to the interpreter state and
runtime state, so it should cover everything that we need.

Cheers,
Nick.


>
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/2XQQCEYCYKUFEJMSMO324NC3IOBKEOQ4/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-15 Thread Nathaniel Smith
As you know, I'm skeptical that PEP 554 will produce benefits that are
worth the effort, but let's assume for the moment that it is, and
we're all 100% committed to moving all globals into the threadstate.
Even given that, the motivation for this change seems a bit unclear to
me.

I guess the possible goals are:

- Get rid of the "ambient" threadstate entirely
- Make accessing the threadstate faster

For the first goal, I don't think this is possible, or desirable.
Obviously if we remove the GIL somehow then at a minimum we'll need to
make the global threadstate a thread-local. But I think we'll always
have to keep it around as a thread-local, at least, because there are
situations where you simply cannot pass in the threadstate as an
argument. One example comes up when doing FFI: there are C libraries
that take callbacks, and will run them later in some arbitrary thread.
When wrapping these in Python, we need a way to bundle up a Python
function into a C function that can be called from any thread. So,
ctypes and cffi and cython all have ways to do this bundling, and they
all start with some delicate dance to figure out whether or not the
current thread holds the GIL, acquiring the GIL if not, then checking
whether or not this thread has a Python threadstate assigned, creating
it if not, etc. This is completely dependent on having the threadstate
available in ambient context. If threadstates were always passed as
arguments, then it would become impossible to wrap these C libraries.
So we can't do that.

That said, it's fine – even if we do remove the GIL, we still won't
have a *single OS thread* executing code from two different
interpreters at the same time! So storing the threadstate in a
thread-local is fine, and we can keep the ability to grab the
threadstate at any moment, regardless of whether it was passed as an
argument.

But that means the only reason for passing the threadstate around as
an argument is if it's faster than looking it up. And AFAICT, no-one
in this thread actually knows if that's true? You mentioned that
there's an "atomic operation" there currently, but I think on x86 at
least _Py_atomic_load_relaxed is literally a no-op. Larry did some
experiments with the old pthreads thread-local storage API, but no-one
seems to have done any measurements on the new, much-faster
thread-local storage API, and no-one's done any measurements of the
cost of passing around threadstates explicitly. For all we know,
passing the threadstate around is actually slower than looking it up
every time. And we don't even know yet whether the threadstate even
will move into thread-local storage.

It seems a bit weird to start doing massive internal refactoring
before measuring those things.

-n

On Tue, Nov 12, 2019 at 2:03 PM Victor Stinner  wrote:
>
> Hi,
>
> Are you ok to modify internal C functions to pass explicitly tstate?
>
> --
>
> I started to modify internal C functions to pass explicitly "tstate"
> when calling C functions: the Python thread state (PyThreadState).
> Example of C code (after my changes):
>
> if (_Py_EnterRecursiveCall(tstate, " while calling a Python object")) 
> {
> return NULL;
> }
> PyObject *result = (*call)(callable, args, kwargs);
> _Py_LeaveRecursiveCall(tstate);
> return _Py_CheckFunctionResult(tstate, callable, result, NULL);
>
> In Python 3.8, the tstate is implicit:
>
> if (Py_EnterRecursiveCall(" while calling a Python object")) {
> return NULL;
> }
> PyObject *result = (*call)(callable, args, kwargs);
> Py_LeaveRecursiveCall();
> return _Py_CheckFunctionResult(callable, result, NULL);
>
> There are different reasons to pass explicitly tstate, but my main
> motivation is to rework Python code base to move away from implicit
> global states to states passed explicitly, to implement the PEP 554
> "Multiple Interpreters in the Stdlib". In short, the final goal is to
> run multiple isolated Python interpreters in the same process: run
> pure Python code on multiple CPUs in parallel with a single process
> (whereas multiprocessing runs multiple processes).
>
> Currently, subinterpreters are a hack: they still share a lot of
> things, the code base is not ready to implement isolated interpreters
> with one "GIL" (interpreter lock) per interpreter, and to run multiple
> interpreters in parallel. Many _PyRuntimeState fields (the global
> _PyRuntime variable) should be moved to PyInterpreterState (or maybe
> PyThreadState): per interpreter.
>
> Another simpler but more annoying example are Py_None and Py_True
> singletons which are globals. We cannot share these singletons between
> interpreters because updating their reference counter would be a
> performance bottleneck. If we put a "superglobal-GIL" to ensure that
> Py_None reference counter remains consistent, it would basically
> "serialize" all threads, rather than running them in parallel.
>
> The idea of passing tstate 

[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-15 Thread Eric Snow
On Tue, Nov 12, 2019 at 3:11 PM Victor Stinner  wrote:
> Are you ok to modify internal C functions to pass explicitly tstate?

I'm also in favor (strongly)!  (no surprises there)

The only concerns I've heard is that on some platforms there is a
measurable overhead once you hit a threshold of a specific small
number of parameters.  Adding this extra parameter will put some
functions over that threshold.  I don't have any more information than
that.

> There are different reasons to pass explicitly tstate, but my main
> motivation is to rework Python code base to move away from implicit
> global states to states passed explicitly, to implement the PEP 554
> "Multiple Interpreters in the Stdlib". In short, the final goal is to
> run multiple isolated Python interpreters in the same process: run
> pure Python code on multiple CPUs in parallel with a single process
> (whereas multiprocessing runs multiple processes).

FTR, PEP 554 is explicitly independent of efforts to stop sharing the
GIL between interpreters.  I argue there that it is a good idea
regardless.

The existing functionality the PEP exposes, though, clearly benefits
from better isolation between interpreters (including not sharing the
GIL). :)

On Thu, Nov 14, 2019 at 4:12 AM Victor Stinner  wrote:
> Another approach would be to pass a "PyContext*" pointer which
> contains tstate, but also additional fields. But I chose to state with
> a direct "PyThreadState* tstate" to avoid one indirection to every
> tstate access. Currently, tstate seems to be enough for the current
> code base.

FWIW, I favor this approach as well.  As long as it is an opaque type,
a PyContext allows us to be more flexible in adapting to the future.
For now it could even be a simple alias for PyThreadState.
Regardless, I'm not convinced that using a PyContext will have a real
impact on runtime performance.

Also, we already use "context" in a number of ways in Python.  So
"PyContext" might not be the best name.  It probably needs to be a
name without "context" in it or one with a concrete clue (e.g.
'PyRuntimeContext").

Anyway, thanks for driving this discussion, Victor!

-eric
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/C7EQXGL3RCOLQNBCK7CVRDT52FWJFAVT/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-15 Thread Stefan Behnel
Victor Stinner schrieb am 12.11.19 um 23:03:
> Are you ok to modify internal C functions to pass explicitly tstate?

FWIW, I started doing the same internally in Cython a while back, because
like others, I also considered it wasteful to look it up all over the
place, often multiple times inside of one function (usually related to
try-finally and exception handling). I think it similarly makes sense
inside of CPython. I would also find it reasonable to make it part of a new
C-API.

Stefan
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/OZMEP27S6Q4OQ4CMCFPSRPM4FGUI2ZHQ/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-15 Thread Nick Coghlan
On Wed., 13 Nov. 2019, 8:06 am Victor Stinner,  wrote:

> Hi,
>
> Are you ok to modify internal C functions to pass explicitly tstate?
>

I'll join the chorus of +1's.

With the work you've already done to clearly separate the public APIs from
the internal ones, it's now much clearer which functions should be
accepting an explicit thread state, and which ones should be looking it up
implicitly.

Cheers,
Nick.




>
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/Q4IPXMQIM5YRLZLHADUGSUT4ZLXQ6MYY/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-14 Thread Steve Dower

On 13Nov2019 1954, Larry Hastings wrote:


On 11/13/19 5:52 AM, Victor Stinner wrote:

Le mer. 13 nov. 2019 à 14:28, Larry Hastings  a écrit :

I did exactly that in the Gilectomy prototype.  Pulling it out of TLS was too 
slow,

What do you mean? Getting tstate from a TLS was a performance
bottleneck by itself? Reading a TLS variable seems to be quite
efficient.


I'm pretty sure you understand the sentence "Pulling it out of TLS was 
too slow".  At the time CPython used the POSIX APIs for accessing thread 
local storage, and I didn't know about and therefore did not try this 
"__thread" GCC extension.  I do remember trying some other API that was 
purported to be faster--maybe a GCC library function for faster TLS 
access?--but I didn't get that to work either before I gave up on it out 
of frustration.


Also, I dimly recall that I moved several things from globals into the 
ThreadState structure, and probably added one or two of my own.  So 
nearly every function call was referencing ThreadState at one point or 
another.  Passing it as a parameter was a definite win over calling the 
POSIX TLS APIs.


Passing it as a parameter is also a huge win for embedders, as it gets 
very complicated to merge locking/threading models when the host 
application has its own requirements.


Overall, I'm very supportive of passing context through parameters 
rather than implicitly through TLS.


(Though we've got a long way to go before it'll be possible for 
embedders to not be held hostage by CPython's threading model... one 
step at a time! :) )


Cheers,
Steve
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/TLMDK7JZQIUWQUUKFHOPNEFQCJKFL5JM/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-14 Thread Random832
On Thu, Nov 14, 2019, at 07:43, Antoine Pitrou wrote:
> On Wed, 13 Nov 2019 14:52:32 +0100
> Victor Stinner  wrote:
> > 
> > #define _PyRuntimeState_GetThreadState(runtime) \
> > 
> > ((PyThreadState*)_Py_atomic_load_relaxed(&(runtime)->gilstate.tstate_current))
> > #define _PyThreadState_GET() _PyRuntimeState_GetThreadState(&_PyRuntime)
> > 
> > _PyThreadState_GET() uses "_Py_atomic_load_relaxed". I'm not used to
> > C99 atomic conventions. The "memory_order_relaxed" documentation says:
> > 
> > "Relaxed operation: there are no synchronization or ordering
> > constraints imposed on other reads or writes, only this operation's
> > atomicity is guaranteed (see Relaxed ordering below)"
> > 
> > Note: I'm not even sure why Python currently uses an atomic operation.
> 
> Is it protected by a lock?  If not, you need to use an atomic.
> Since it's theoretically possible to read the current thread state
> without the GIL held (though not very useful), then an atomic is
> required.

It sounds like you are saying PyRuntimeState_GetThreadState has two duties, 
then: "get this thread's thread state" (from the GIL holder - how do other 
threads get their own thread state), and "get the GIL-holding thread's thread 
state (from non-GIL holder thread).

The former shouldn't need atomic/overhead locking (unless the thread state can 
be written from other threads), even if the latter does.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/4BNRSO47Z54MRR3ZS32W6DXYRVZ7U53W/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-14 Thread Antoine Pitrou
On Wed, 13 Nov 2019 14:52:32 +0100
Victor Stinner  wrote:
> 
> #define _PyRuntimeState_GetThreadState(runtime) \
> 
> ((PyThreadState*)_Py_atomic_load_relaxed(&(runtime)->gilstate.tstate_current))
> #define _PyThreadState_GET() _PyRuntimeState_GetThreadState(&_PyRuntime)
> 
> _PyThreadState_GET() uses "_Py_atomic_load_relaxed". I'm not used to
> C99 atomic conventions. The "memory_order_relaxed" documentation says:
> 
> "Relaxed operation: there are no synchronization or ordering
> constraints imposed on other reads or writes, only this operation's
> atomicity is guaranteed (see Relaxed ordering below)"
> 
> Note: I'm not even sure why Python currently uses an atomic operation.

Is it protected by a lock?  If not, you need to use an atomic.
Since it's theoretically possible to read the current thread state
without the GIL held (though not very useful), then an atomic is
required.

Regards

Antoine.

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/7VL3QKACQLDL3QCWKCTUHUCIERFNE6R7/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-14 Thread Victor Stinner
Le jeu. 14 nov. 2019 à 04:55, Larry Hastings  a écrit :
> I'm pretty sure you understand the sentence "Pulling it out of TLS was too 
> slow".  At the time CPython used the POSIX APIs for accessing thread local 
> storage, and I didn't know about and therefore did not try this "__thread" 
> GCC extension.  I do remember trying some other API that was purported to be 
> faster--maybe a GCC library function for faster TLS access?--but I didn't get 
> that to work either before I gave up on it out of frustration.

I asked for confirmation, since I was surprised. But when I looked at
assembly with my friend, we played with __thread not with
pthread_getspecific().

So thanks for confirming that "getting tstate" can be a performance
bottleneck: that's a very good reason to pass it explicitly.

> I also took the opportunity to pass my "reference count manager" data as a 
> separate parameter, which again was per-thread and again was a major win at 
> the time.

Another approach would be to pass a "PyContext*" pointer which
contains tstate, but also additional fields. But I chose to state with
a direct "PyThreadState* tstate" to avoid one indirection to every
tstate access. Currently, tstate seems to be enough for the current
code base.

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/SHPPBIERUHCAH5UFW6WAVOQ2Z2NEKAH3/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-13 Thread Larry Hastings


On 11/13/19 5:52 AM, Victor Stinner wrote:

Le mer. 13 nov. 2019 à 14:28, Larry Hastings  a écrit :

I did exactly that in the Gilectomy prototype.  Pulling it out of TLS was too 
slow,

What do you mean? Getting tstate from a TLS was a performance
bottleneck by itself? Reading a TLS variable seems to be quite
efficient.


I'm pretty sure you understand the sentence "Pulling it out of TLS was 
too slow".  At the time CPython used the POSIX APIs for accessing thread 
local storage, and I didn't know about and therefore did not try this 
"__thread" GCC extension.  I do remember trying some other API that was 
purported to be faster--maybe a GCC library function for faster TLS 
access?--but I didn't get that to work either before I gave up on it out 
of frustration.


Also, I dimly recall that I moved several things from globals into the 
ThreadState structure, and probably added one or two of my own.  So 
nearly every function call was referencing ThreadState at one point or 
another.  Passing it as a parameter was a definite win over calling the 
POSIX TLS APIs.


I also took the opportunity to pass my "reference count manager" data as 
a separate parameter, which again was per-thread and again was a major 
win at the time.



//arry/

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/CIGL2NQGXUSUJNWW3FCAEVWTWL2QGVY2/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-13 Thread Jim J. Jewett
I wouldn't worry too much about the the Singletons in this issue; they could be 
solved in any of several ways, all of which would be improvements conceptually 
-- if performance and backwards compatibility were resolved.

In theory, the incr/decr pair should be delegated to the memory store, with 
Petr's suggestion of immortal immutables being one example.  The catch is that 
the current scheme is really fast in the normal case; even hardcoding just 
True/False/None to magic addresses might be slower.  

You don't have to solve that just to speed up access to state variables that 
are not exposed directly to python code.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/3KDVELTYRTY72RL7X24VZHBXSKAOY2YH/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-13 Thread Victor Stinner
Petr, Eric: sure, my question is only about the internal C functions.
I have no plan to change the existing C API.

Le mer. 13 nov. 2019 à 14:52, Eric V. Smith  a écrit :
> The last time we discussed this, there was pushback due to performance
> concerns. I don't recall if that was actually measured, or just a vague
> unease.

Maybe I was the one who raised a concern about the atomic variable
performance. But I never ran a benchmark on that.


> I agree with Petr that not breaking existing
> APIs is of course critical. A parallel set of APIs is needed. But the
> existing APIs should become thin wrappers, until Python 5000 (aka never)
> when they can go away.

There is a project of a new C API for Python:
https://github.com/pyhandle/hpy

I suggested to add a mandatory "context" parameter since day 1. See
the current API draft, it has a "ctx" argument:
https://github.com/pyhandle/hpy/blob/3266dc295b0be20b41c99f4f4e944d117b3fc875/api.md

Example: "HPy v = HPy_Something(ctx);"

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/H23H6U7JTXEESZAGGGCTF5JHINKLGHJP/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-13 Thread Victor Stinner
Le mer. 13 nov. 2019 à 14:28, Larry Hastings  a écrit :
> I did exactly that in the Gilectomy prototype.  Pulling it out of TLS was too 
> slow,

What do you mean? Getting tstate from a TLS was a performance
bottleneck by itself? Reading a TLS variable seems to be quite
efficient.

Mark Shannon wrote: "The current means of accessing the thread state
does seem rather convoluted, whereas accessing from a thread local is
quite efficient (at least with GCC) https://godbolt.org/z/z-vNPN "
https://github.com/python/cpython/pull/17052#issuecomment-552538438

Copy of his C code:
"""
extern __thread int extern_tl;
int get_extern_thread_local(void) {
return extern_tl;
}

__thread int tl;
int get_thread_local(void) {
return tl;
}
"""

And the generated assembly (by godbolt.org service):
"""
get_extern_thread_local():
mov rax, QWORD PTR extern_tl@gottpoff[rip]
mov eax, DWORD PTR fs:[rax]
ret

get_thread_local():
mov eax, DWORD PTR fs:tl@tpoff
ret

tl:
.zero 4
"""

TLS variable read is basically one or two MOV in the Intel x86
assembly (using GCC 9.2).

With a friend, I looked at the assembly to read and write atomic
variables. In short, only the write requires a memory fence, whereas
the read is basically just a MOV (again, in Intel x86).

#define _PyRuntimeState_GetThreadState(runtime) \

((PyThreadState*)_Py_atomic_load_relaxed(&(runtime)->gilstate.tstate_current))
#define _PyThreadState_GET() _PyRuntimeState_GetThreadState(&_PyRuntime)

_PyThreadState_GET() uses "_Py_atomic_load_relaxed". I'm not used to
C99 atomic conventions. The "memory_order_relaxed" documentation says:

"Relaxed operation: there are no synchronization or ordering
constraints imposed on other reads or writes, only this operation's
atomicity is guaranteed (see Relaxed ordering below)"

Note: I'm not even sure why Python currently uses an atomic operation.
Not why just a regular global variable? By if we change something, I
would prefer to move to a TLS variable instead, to support
subinterpreters.

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/WBCBLDGZ7QBWPOQUIWFNYG7L4UMDIXU5/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-13 Thread Eric V. Smith

On 11/12/2019 5:03 PM, Victor Stinner wrote:

Hi,

Are you ok to modify internal C functions to pass explicitly tstate?


The last time we discussed this, there was pushback due to performance 
concerns. I don't recall if that was actually measured, or just a vague 
unease.


I've long advocated (mostly to myself, and Larry when he would listen!) 
that we should do this. I agree with Petr that not breaking existing 
APIs is of course critical. A parallel set of APIs is needed. But the 
existing APIs should become thin wrappers, until Python 5000 (aka never) 
when they can go away.


And this not only helps with being explicit, it should help with 
testing. No more depending on some hidden global state.


Eric
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/TFCJUGXONTYQXYDHG2OSWLCFNZBAHUFT/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-13 Thread Larry Hastings


On 11/12/19 2:03 PM, Victor Stinner wrote:

Hi,

Are you ok to modify internal C functions to pass explicitly tstate?


I did exactly that in the Gilectomy prototype.  Pulling it out of TLS 
was too slow, and storing it in a global wouldn't work with multiple 
actually-concurrent threads.



//arry/

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/VP2SJAKF7EZFDS2W6N5WDGQAXAS3CMFF/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Pass the Python thread state to internal C functions

2019-11-13 Thread Petr Viktorin

On 2019-11-12 23:03, Victor Stinner wrote:

Hi,

Are you ok to modify internal C functions to pass explicitly tstate?


In short, yes, but:
- don't make things slower :)
- don't break the public API or the stable ABI


I'm a fan of explicitly passing state everywhere, rather than keeping it 
in "global" variables.


Currently, surprisingly many internal functions do a PyThreadState_GET 
for themselves, then call another function that does the same. That's 
wasteful, but impossible to change in the public API.
Your changes (of which I only saw a very limited subset) seem to follow 
a simple rule: public API functions call PyThreadState_GET, and then 
call internal functions that pass it around.
That's sounds beautifully easy to explain! Later, we'll just need to 
find a way to make the tstate API public (and opt-in).



The "per-interpreter None", however, is a different issue. I don't see 
how that can be done without breaking the stable ABI. I still think 
immortal immutable objects could be shared across interpreters.






--

I started to modify internal C functions to pass explicitly "tstate"
when calling C functions: the Python thread state (PyThreadState).
Example of C code (after my changes):

 if (_Py_EnterRecursiveCall(tstate, " while calling a Python object")) {
 return NULL;
 }
 PyObject *result = (*call)(callable, args, kwargs);
 _Py_LeaveRecursiveCall(tstate);
 return _Py_CheckFunctionResult(tstate, callable, result, NULL);

In Python 3.8, the tstate is implicit:

 if (Py_EnterRecursiveCall(" while calling a Python object")) {
 return NULL;
 }
 PyObject *result = (*call)(callable, args, kwargs);
 Py_LeaveRecursiveCall();
 return _Py_CheckFunctionResult(callable, result, NULL);

There are different reasons to pass explicitly tstate, but my main
motivation is to rework Python code base to move away from implicit
global states to states passed explicitly, to implement the PEP 554
"Multiple Interpreters in the Stdlib". In short, the final goal is to
run multiple isolated Python interpreters in the same process: run
pure Python code on multiple CPUs in parallel with a single process
(whereas multiprocessing runs multiple processes).

Currently, subinterpreters are a hack: they still share a lot of
things, the code base is not ready to implement isolated interpreters
with one "GIL" (interpreter lock) per interpreter, and to run multiple
interpreters in parallel. Many _PyRuntimeState fields (the global
_PyRuntime variable) should be moved to PyInterpreterState (or maybe
PyThreadState): per interpreter.

Another simpler but more annoying example are Py_None and Py_True
singletons which are globals. We cannot share these singletons between
interpreters because updating their reference counter would be a
performance bottleneck. If we put a "superglobal-GIL" to ensure that
Py_None reference counter remains consistent, it would basically
"serialize" all threads, rather than running them in parallel.

The idea of passing tstate to internal C functions is to prepare code
to get the per-interpreter None from tstate.

tstate is basically the "root" to access all states which are per
interpreter. For example, PyInterpreterState can be read from
tstate->interp.

Right now, tstate is only passed to a few functions, but you should
expect to see it passed to way more functions later, once more
structures will be moved to PyInterpreterState.

--

On my latest merged PR 17052 ("Add _PyObject_VectorcallTstate()"),
Mark Shannon wrote: "I don't see how this could ever be faster, nor do
I see how it is more correct."
https://github.com/python/cpython/pull/17052#issuecomment-552538438

Currently, tstate is get using these internal APIs:

#define _PyRuntimeState_GetThreadState(runtime) \
 
((PyThreadState*)_Py_atomic_load_relaxed(&(runtime)->gilstate.tstate_current))
#define _PyThreadState_GET() _PyRuntimeState_GetThreadState(&_PyRuntime)

or using public APIs:

PyAPI_FUNC(PyThreadState *) PyThreadState_Get(void);
#define PyThreadState_GET() PyThreadState_Get()

I dislike _PyThreadState_GET() for 2 reasons:

* it relies on the _PyRuntime global variable: I would prefer to avoid
global variables
* it uses an atomic operation which can become a perofrmance issue
when more and more code will require tstate

--

An alternative would be to use PyGILState_GetThisThreadState() which
uses a thread local state (TLS) variable to get the Python thread
state ("tstate"), rather that _PyRuntime atomic variable. Except that
the PyGILState API doesn't support subinterpreters yet :-(

https://bugs.python.org/issue15751 "Support subinterpreters in the GIL
state API" is open since 2012.

Note: While the GIL is released, _PyThreadState_GET() is NULL, whereas
PyGILState_GetThisThreadState() is non-NULL.

--

Links:

* https://pythoncapi.readthedocs.io/runtime.html : my notes on moving
globals to per interpreter states
*