subject:"\[issue8299\] Improve GIL in 2.7"


David Beazley  added the comment:

One other comment.  Running the modified fair.py file on my Linux system using 
Python compiled with semaphores shows they they are *definitely* not fair.  
Here's the relevant part of your test:

Treaded, balanced execution, with quickstop:
fast C: 1.580815 (0 left)
fast B: 1.636923 (158919 left)
fast A: 1.788634 (310323 left)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7


David Beazley  added the comment:

I'm not trying to be a pain here, but do you have any explanation as to why, 
with fair scheduling, the observed execution time of multiple CPU-bound threads 
is substantially worse than with unfair scheduling?

>From your own benchmarks, consider this result (Fair scheduling)

Treaded, balanced execution:
fast A: 0.973000 (0 left)
fast C: 0.992000 (0 left)
fast B: 1.013000 (0 left)

Versus this result with unfair scheduling:

Treaded, balanced execution:
fast A: 0.362000 (0 left)
fast B: 0.464000 (0 left)
fast C: 0.549000 (0 left)

If I'm reading this right, it takes the three threads with fair locking almost 
twice as long to complete (1.01s) as the three threads with unfair locking 
(0.55s) .  If so, why would I want fair locking?   Wouldn't I want the solution 
that offers the fastest overall execution time?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-16 Thread Kristján Valur Jónsson


Kristján Valur Jónsson  added the comment:

What your fair.py is doing is demonstrating the superior behaviour of a 
time-based GIL interrupt to a bytecode based one.  I have no quibbles with that 
and I agree that it is superior.  But I also think that your example is a very 
artificial one.  On average the duration of the bytecodes evens out between 
threads and so they should give a fair first approximation.

But this is not what I was talking about when considering fairness.  Your test 
only measures each thread end to end, and it demonstrates how a bytecode based 
gil yilelding system wil let the two threads work in lockstep, even though each 
loop in one thread is cheaper than the other.  Fair enough.  But fair 
scheduling between threads doesn't show up here.

To demonstrate fair / unfair, I've modified fair.py to add two more runs, where 
three threads of the "fast" variety are run for identical number of rounds.  
This is when you will se a difference between the linux and the mac based GIL 
implementations.  For your info, on my windows dual core office box, with 
regular windows gil:

D:\pydev\python\trunk\PCbuild>python.exe d:\pyscript\fair.py
Sequential execution
slow: 3.384000 (0 left)
fast: 0.177000 (0 left)
Threaded execution
slow: 3.435000 (0 left)
fast: 3.568000 (0 left)
Treaded, balanced execution:
fast A: 0.973000 (0 left)
fast C: 0.992000 (0 left)
fast B: 1.013000 (0 left)
Treaded, balanced execution, with quickstop:
fast A: 0.977000 (0 left)
fast C: 0.976000 (252 left)
fast B: 0.978000 (17601 left)

And now, same box, with the unfair GIL:

D:\pydev\python\trunk\PCbuild>python.exe d:\pyscript\fair.py
Sequential execution
slow: 3.338000 (0 left)
fast: 0.177000 (0 left)
Threaded execution
fast: 0.382000 (0 left)
slow: 3.539000 (0 left)
Treaded, balanced execution:
fast A: 0.362000 (0 left)
fast B: 0.464000 (0 left)
fast C: 0.549000 (0 left)
Treaded, balanced execution, with quickstop:
fast B: 0.389000 (0 left)
fast A: 0.447000 (240480 left)
fast C: 0.36 (613098 left)

The two last cases are the interesting ones.  With unfair scheduling, one 
thread takes almost twice as long to complete its 100 inserts than another. 
 And if they are all stopped when the quickest one finishes, one thread has 
more than 60 iterations to go.

This is what I mean by fair/unfair scheduling.

Cheers,

Kristján

p.s.  Yes, I agree that time based GIL yielding is better.  I intentionally 
didn't want to confuse the matter with that in 2.x.  I wanted to address the 
other issues that are wrong.

--
Added file: http://bugs.python.org/file16951/fair.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7


David Beazley  added the comment:

I've attached a test "fair.py" that gives an example of the fair CPU scheduling 
issue.  In this test, there are two threads, one of which has fast-running 
ticks, one of which has slow-running ticks.  

Here is their sequential performance (OS-X, Python 2.6):

slow: 5.71
fast: 0.32

Here is their threaded performance (OS-X, Python 2.6.4):

slow : 5.99
fast : 6.04(Notice : Huge jump in execution, unfair CPU)

Here is their threaded performance using the Py3K New GIL:

slow : 5.96
fast : 0.67(Notice : Fair CPU use--time only doubled)

Using Linux with semaphores gives no benefit here.  The fast code is stalled in 
the same way.   For example: here are my Linux results (Ubuntu 8.10, 
Python-2.6.4, dual-core, using semaphores):

Sequential:
slow : 6.24
fast : 0.59
Threaded:
slow : 6.40
fast : 6.69(even slower than the slow code!)

--
Added file: http://bugs.python.org/file16946/fair.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7


David Beazley  added the comment:

I'm sorry, but even in the presence of fair locking, I still don't like this 
patch.   The main problem is that it confuses fair locking with fair CPU 
use---something that this patch does not and can not achieve on any platform.

The main problem is that everything is still based on the execution of 
interpreter ticks.  However, interpreter ticks have wildly varying execution 
times dependent upon the code that's running.   Thus, executing 1000 ticks 
might take significantly longer in one thread than another.   Under a FIFO 
scheduler based on "fair" locking, the thread with the longer-running ticks is 
going to unfairly hog the GIL and the CPU.  For example, if thread 1 takes 95 
usec to execute 1000 ticks and thread 2 takes 5 usec to execute 1000 ticks, 
then thread 1 is going to end up hogging about 95% of the CPU cycles, starving 
thread 2.   To me, that doesn't sound especially "fair." 

It would be much better to have fairness where threads are guaranteed to get an 
equal time slice of CPU cycles regardless of how many ticks they're executing.  
In other words, it would be much better if the two threads above each got 50% 
of the CPU cycles.   The only way you're ever going to be able to do that is to 
base thread scheduling on timing.   The new GIL in Python 3 makes an effort to 
do this even though some issues are still being worked out with it.

On a slightly unrelated note, I just tried some experiments on Linux with the 
GIL implemented as condition variables and with semaphores.   I honestly didn't 
see any noticeable performance difference between the two versions.  I also 
didn't see any kind of purported "fair" scheduling of threads using the 
semaphore version.  Both versions exhibit the same performance problems as 
described in my GIL talk (albeit not to the same extreme as on OS-X).  Based on 
my own reading of the pthreads source code (yes, I have looked), I can't really 
draw any conclusion about the fairness of semaphores.   Under the covers, it's 
all based on futex locks (the "f" in futex referring to "fast", not "fair" by 
the way). I know that the original paper on futexes has some experiments with 
fair lock scheduling, but I honestly don't know if that is being used in the 
Linux kernel or by pthreads.   My understanding is that by default, futexes do 
not guarantee fairness.  To know for certain with semaphor
 es, much more low-level investigation would be required.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7


Antoine Pitrou  added the comment:

> Googling a bit gave me this:
> http://lists.apple.com/archives/darwin-kernel/2005/Dec/msg00022.html
> It would appear that mac os X was at least lacking full posix semaphore 
> support in 2005.

Hmm.  OS X really sucks.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7


Kristján Valur Jónsson  added the comment:

Googling a bit gave me this:
http://lists.apple.com/archives/darwin-kernel/2005/Dec/msg00022.html
It would appear that mac os X was at least lacking full posix semaphore support 
in 2005.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7


Kristján Valur Jónsson  added the comment:

David, I urge you to reconsider:
The "emulated" semaphore is broken because it is unfair.  It is clearly a 
programming error, born out of naivete about how to implement such primitives.  
Proper semaphores therefore cannot be implemented using the "exact same 
mechanism" because proper semaphores are fair, this one isn't.  You do 
understand why exactly it is unfair, don't you?

Second, with a fair GIL you still get poor performance on multicore with low 
values of "tickinterval" but at least you get predictable scheduling.  The 
emulated semaphore is bad in two ways:  Unpredictable scheduling with thread 
starvation _and_ poor multicore performance.  I don't understand why you prefer 
having two problems to one.

I also think it is worth investigating when exactly the "emulaton" semaphore 
became the "standard".  Did something break in the config script at some point?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-15 Thread David Beazley


David Beazley  added the comment:

I hope everyone realizes that all of this bike-shedding about emulated 
semaphores versus "real" semaphores is mostly a non-issue.  For one thing,  go 
look at how a "real" semaphore is implemented by reading the source code to 
pthreads or some other thread library.  You'll find that semaphores are 
implemented using the exact same mechanisms that underly condition variables 
and in some cases, are actually implemented using a mutex lock and a condition 
variable exactly as Python is doing.

Second, the performance of using "real" semaphores still sucks.   So, all of 
the arguing about "fairness" and whatnot seems to be a total waste of time in 
my opinion because it doesn't address the underlying problem.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-15 Thread R. David Murray


R. David Murray  added the comment:

Also note that his results were much worse on MacOS than anyone was seeing on 
Linux, which may support this theory :)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-15 Thread R. David Murray


R. David Murray  added the comment:

My understanding is that David noticed the problem originally on MacOS.  If the 
emulation is indeed being used on that platform (and a little googling 
indicates the MacOS posix semaphore implementation is considered at least 
slightly broken, and FreeBSD didn't support it until 7.2), then perhaps that is 
why he was looking at that code.

--
nosy: +r.david.murray

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7


Antoine Pitrou  added the comment:

> You do realize, that if we enable the USE_SEMAPHORE, we get the GIL
> behaviour as seen on windows and with my ROUNDROBIN_GIL
> implementation, right?

I haven't studied this argument, but I don't see how that contradicts
anything. The main issue witnessed with the 2.x GIL -- and the point of
Dave Beazley's original talk -- is CPU inefficiency (due to far too many
lock operations).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7


Kristján Valur Jónsson  added the comment:

You do realize, that if we enable the USE_SEMAPHORE, we get the GIL behaviour 
as seen on windows and with my ROUNDROBIN_GIL implementation, right?

Also, at the GIL open space talk on PyCon, David did show us the "emulation" 
source code as if it were _the_ gil.  Well, maybe he can explain it.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7


Antoine Pitrou  added the comment:

> Yes, we put #error in both places (defining and undefining
> USE_SEMAPHORES).  The colleague in question is Christian Tismer, he is
> unlikely to have gotten it wrong.

Ok, so can you or Christian open an issue about it? We should try to fix
it.

> I am also curious why David Beazley kept talking about the "binary
> semaphore" when it is apparent that that is supposed to be a "hack" to
> use on platforms that don't have posix semaphores.

I think David often uses technical terms (such as "semaphore" or
"signal") in more generic meanings than what you might expect :)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7


Kristján Valur Jónsson  added the comment:

Yes, we put #error in both places (defining and undefining USE_SEMAPHORES).  
The colleague in question is Christian Tismer, he is unlikely to have gotten it 
wrong.  I am also curious why David Beazley kept talking about the "binary 
semaphore" when it is apparent that that is supposed to be a "hack" to use on 
platforms that don't have posix semaphores.

This gets curiouser and curiouser.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7


Antoine Pitrou  added the comment:

> However, I just asked a colleague with a os X to compile python 2.7
> and _POSIX_SEMAPHORES isn't defined, and so, it is running using the
> emulation.  Why, I wonder?  Isn't it defined in unistd.h?

Perhaps a bad combination of defines. Has he checked that the semaphore
path isn't used at all? (just put a #error in the other path) If so,
opening an issue would be good.

I would hope we can drop the emulation path one day.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7


Kristján Valur Jónsson  added the comment:

Oh dear.  I was assuming that the mutex+condition variable were the actual 
implementation mostly in use on pthreads.  This is because of David's GIL open 
talk at pycon, where we were looking at the source and bickering about the 
placement of "pthread_cond_signal()" being after the "pthread_mutex_unlock()" 
call.  

In which case, more than half of this thread is invalid.  I could, perhaps, 
start a new defect: "semaphore emulation using condition variable is broken".

However, I just asked a colleague with a os X to compile python 2.7 and 
_POSIX_SEMAPHORES isn't defined, and so, it is running using the emulation.  
Why, I wonder?  Isn't it defined in unistd.h?

Martin, I don't know if you were suggesting that a "fair" mutex would make the 
emulated semaphore fair too.  You probably weren't, but just in case, the 
fairness of the mutex is immaterial because it is only held for a short time to 
guard the internal state of the "semaphore".  You won't see threads queing up 
on it, but they will queue on the Contition variable.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7


Antoine Pitrou  added the comment:

> if _POSIX_SEMAPHORES is defined, thread_pthread.h is designed to use
> the (fair) semaphore.  If it is not present, or
> HAVE_BROKEN_POSIX_SEMAPHORES defined, the semaphore is supposed to be
> emulated using a condition variable.
> Now, I don't have access to a mac or linux machine, but does a modern
> python build perhaps actually have USE_SEMAPHORES defined?

Yes, it does.
Actually, I find it unlikely that any modern Unix would fall back on the
non-semaphore version. All this code is (mostly) very old.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

Kristján Valur Jónsson added the comment:

Here is yet another point:
if _POSIX_SEMAPHORES is defined, thread_pthread.h is designed to use the (fair)
semaphore. If it is not present, or HAVE_BROKEN_POSIX_SEMAPHORES defined, the
semaphore is supposed to be emulated using a condition variable.
Now, I don't have access to a mac or linux machine, but does a modern python
build perhaps actually have USE_SEMAPHORES defined? if so, then this entire
rant about a broken lock on pthreads is nonsense.

Please note that the "emulated" semaphore is unfair, as I've pointed out,
whereas a posix_sem object strives to be fair. So this "emulation" is not
working..

Martin, you are right that some mutexes are indeed fair. There has been a move
towards using unfair mutexes, particularly on multicore machines. This is
because they reduce the "lock convoying" problem.
A fair mutex hands off the lock to a waiting thread. That thread is then made
runnable. But on a busy system, it may take a while for that thread to
actually start running and use the locked resource. The reesult is that the
locked resource is unavailable for a longer time. An unfair mutex will wake up
a waiting thread, yet have that thread compete for the mutex with any
interloper that might arrive and claim it. See e.g.
http://www.bluebytesoftware.com/blog/PermaLink,guid,e40c2675-43a3-410f-8f85-616ef7b031aa.aspx
and http://developer.amd.com/documentation/articles/Pages/282007123.aspx

___
Python tracker

___
___
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-13 Thread David Beazley


David Beazley  added the comment:

What bothers me most about this discussion is that the Windows implementation 
(legacy GIL) is being held up as an example of what we should be doing on 
posix.  Yet, if I go run the same thread tests that I presented in my GIL talks 
on a multicore Windows machine, the performance is every bit as bad, if not 
worse, than what I reported in my talk.Therefore, why would we want that?   
I just don't get it.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-13 Thread Martin v . Löwis


Martin v. Löwis  added the comment:

> Maybe the state of this discussion is my fault for not being clear
> enough. Let's abandon terms such as "broken" and "roundrobin."  CS
> theory has the perfectly useful terms "fair" and "unfair."  The fact
> of the matter is this: the pthread GIL (implemented as LEGACY gil) is
> an "unfair" syncronization primitve.

That's not really true. The Linux condition variable (from glibc
linuxthreads), for example, implements "fair" synchronization. Other
implementations may do the same.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-13 Thread Antoine Pitrou


Antoine Pitrou  added the comment:

Kristjan,

> Maybe the state of this discussion is my fault for not being clear enough.

It's quite a bit simpler. The first 2.7 beta has been released and
there's IMO no way such patches will be accepted. It doesn't seem to be
a pressing enough issue to be considered a real bug. As you said
yourself, most people actually aren't really affected, or not enough.

> CPU threads are scheduled fairly on windows, and incredibly unfairly
> on pthreads.

pthreads doesn't schedule anything. The kernel does. I'm sure that on
non-tiny periods (>= 5s) they are scheduled quite fairly. It's just that
switching occurs less often and less regularly than you'd might hope.

> Antoine, I understand that your point about do_yield, yet the results
> for 3 seconds without it are telling on their own, and worthy of being
> studied, which is why I suggested disabling it.

As I said, they will render 2.x results completely wrong (at least under
Linux).

> Also, I think you will find that he imbalance in the throughput of the
> threads won't go away even after 30 seconds.

I'm actually not really interested in confirming this, but as I said
there's no reason to think that the Linux kernel does a bad job.
(the one reputed to do a bad job at scheduling, especially for desktop
environments, is the Windows kernel)

> I've improved my patch some more.  I'll upload it soon.

If you are interested in taking it further, I would recommend publishing
your patch (and prebuilt binaries, if you care) somewhere else as well,
because as I said there's probably no way it gets integrated during what
remains of the 2.x timeline.

Of course, other developers might disagree with me, in which case your
patch /can/ be integrated. But I don't see a lot of interest showing
honestly.

> We just need to have an order of magnitude thing there.

Duration of opcodes can vary by more than an order of magnitude.
ccbench includes such testing by the way (different CPU-bound workloads)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-13 Thread Kristján Valur Jónsson


Kristján Valur Jónsson  added the comment:

Maybe the state of this discussion is my fault for not being clear enough.
Let's abandon terms such as "broken" and "roundrobin."  CS theory has the 
perfectly useful terms "fair" and "unfair."  The fact of the matter is this:
the pthread GIL (implemented as LEGACY gil) is an "unfair" syncronization 
primitve.  The GIL on windows, (and the poorly named ROUNDROBIN_GIL) is a  
"fair" synchronization primitive.

Unfair mutexes have their place, and such is the behaviour of the windows 
condition variable (and the pthreads mutex, I suspect).  But they are not 
useful if you want to provide fair access to a resource that is held all the 
time.

Until after that GIL business at PyCon I wasn't aware of this fundamental 
difference between the GIL on windows and pthreads platforms in 2.x.  It is 
astonishing to me that no one appears to have noticed the difference, or made 
much of it.  CPU threads are scheduled fairly on windows, and incredibly 
unfairly on pthreads.

with the ROUNDROBIN_GIL I'm not proposing anything radical, I'm just suggesting 
that we adopt the superior behaviour that has been on windows all along.  Yes, 
people actually do use windows.

Antoine, I understand that your point about do_yield, yet the results for 3 
seconds without it are telling on their own, and worthy of being studied, which 
is why I suggested disabling it.

Also, I think you will find that he imbalance in the throughput of the threads 
won't go away even after 30 seconds.  Unfortunately, the unfairness is such 
that it may actually diverge.

I've improved my patch some more.  I'll upload it soon.  In particular, I've 
addea a PyThread_gil_yield() method to enable whatever underlying gil there is 
to possible deal with this particular locking case differently, if possible, 
perhaps suggesting to the OS not to switch cores.


I´ve also created a simple program in visual studio to examine a GIL outside 
the context of python.  I'll put it here tomorrow too, for those interested.  
It allows for simpler experimentation, although because the loop is small, you 
won't see the effect of the instruction cache problems.


David, I actually think that the checkinterval is a perfectly good mechanism, 
especially if augmented with an interrupt mechanism  What does it matter if 
some opcodes are slower than others?  when we are checking every 100 or 1000 
(or 1 as I am proposing) that hardly matters.  We just need to have an 
order of magnitude thing there.  But there are other ways to do it.  You can 
use a timer on windows, and on pthreads too, I think.

But the whole point of this patch is to take a step back, and to see if there 
is a way to fix the "gil problem" in a simpler way by first trying to 
understand it fully, and then apply minimal changes to solve it.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-11 Thread David Beazley


David Beazley  added the comment:

I'm sorry, I still don't get the supposed benefits of this round-robin patch 
over the legacy GIL.   Given that using interpreter ticks as a basis for thread 
scheduling is problematic to begin with (mostly due to the fact that ticks have 
totally unpredictable execution times), I'd much rather see further GIL work 
continue to build upon the time-based scheduler that's been implemented in 
Python 3.2.  For instance, I think being able to specify a thread-switching 
interval in seconds (sys.setswitchinternal) makes much more sense than 
continuing to fool around with check intervals and all of this tick business.

The new GIL implementation is by no means perfect, but people are working on 
it.   I'd much rather know if anything that you've worked out with this patch 
can be applied to that version of the GIL.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-11 Thread Antoine Pitrou


Antoine Pitrou  added the comment:

> Antoine (2):  The need to have do_yield is a symptom of the brokenness
> of the GIL.

Of course it is. But the point of the benchmark is to give valid results
even with the old broken GIL.

I could remove do_yield and still have it give valid results, but that
would mean running each step for 30 seconds instead of 2. I don't like
having to wait several minutes for benchmark numbers :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-11 Thread Kristján Valur Jónsson


Kristján Valur Jónsson  added the comment:

David, I don't necessarily think it is reasonable to yield every 100 opcodes, 
but that is the _intent_ of the current code base. Checkinterval is set to 100. 
 If you don't want that, then set it higher.  Your statement is like saying: 
"Why would you want to have your windows fit tightly, it sounds like a horrible 
thing for the air quality indoors" (I actually got this when living in 
Germany).  The answer is, of course, that a snugly fitting window can still be 
opened if you want, but more importantly, you _can_ close it properly.

And because the condition variable isn't strictly FIFO, it actually doesn't 
switch every time (an observation.  The scheduler may decide o do its own 
things inside the condition variable / semaphore).  What the ROUNDROBIN_GIL 
ensures, however, is that the condition variable is _entered_ every 
checkinterval.  

What I'm trying to demonsrate to you is the brokenness of the legacy GIL (as 
observed by Antoine long ago) and how it is not broken on windows.  It is 
broken because the currently running thread is biased to reaquire the GIL 
immediately in an unpredictable fashion that is not being managed by the (OS) 
thread scheduler.  Because it doesn't enter the condition variable wait when 
others are competing for it, the scheduler has no means of providing "fairness" 
to the application.

So, to summarise this:  I'm not proposing that we context switch every 100 
opcodes, but I am proposing that we context switch consistently according to 
whatever checkinterval is put in place.

Antoine, in case you misunderstood:  I´m saying that the ROUNDROBIN_GIL and the 
Windows GIL are the same.  If you don't believe me, take a look at the 
NonRecursiveLock implementation for windows.  I'm also starting to think that 
you didn't actually bother to look at the patch.  Please compare 
PyLock_gil_acquire() for LEGACY_GIL and ROUNDROBIN_GIL and see if you can spot 
the difference.  Really, it's just two lines of code.

Maybe it needs restating. The bug is this (python pseudocode)
with gil.cond:
  while not gil.locked: #this line is the bug
gil.cond.wait()
  gil.locked = True

vs.

with gil.cond:
  if gil.n_waiting or gil.locked:
gil.n_waiting += 1
while True:
  gil.cond.wait() #always wait at least once
  if not gil.locked:
break
gil.n_waiting -= 1
  gil.locked = True

 The cond.wait() is where fairness ensues, where the OS can decide to serve 
threads roughly on a first come, first serve basis. If you are biased towards 
not entering it at all (when yielding the GIL), then you have taken away the 
OS' chance of scheduling. 

Antoine (2):  The need to have do_yield is a symptom of the brokenness of the 
GIL.  You have a checkinterval of 100, which elapses some 1000 times per 
second, and yet you have to put in place special fudge code to ensure that we 
do get switches every few seconds?  The whole point of the checkinterval is for 
you _not_ to have to dot the code with sleep() calls.  Surely you don't expect 
the average application developer to do that if he wants his two cpu bound 
threads to compete fairly for the GIL?  This is why I added the -y switch:  To 
emulate normal application code.

Also, the 0.7 imbalance observed in the SHA1 disappears on windows, (and using 
ROUNDROBIN_GIL).  It is not due to the windows scheduler, it is due to the 
broken legacy_gil.


This last slew of comments has been about the ROUNDROBIN_GIL only.  I haven't 
dazzled you yet with PRIORITY_GIL, but that solves both problems because it is 
_fair_, and it allows us to increase the checkinterval to 1, thus 
elimintating the rapid switching overhead, and yet gives fast response to IO.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-11 Thread Antoine Pitrou


Antoine Pitrou  added the comment:

> SHA1 hashing (C)
> 
> threads= 1:  1275 iterations/s. balance
> threads= 2:  1267 ( 99%)0.7238
> threads= 3:  1271 ( 99%)0.2405
> threads= 4:  1270 ( 99%)0.1508
> 
> Using the forced "do_yield" helps balance things, but not much.  We
> still have a .7 balance in SHA1 hashing for two threads.

Which is not unreasonable, since SHA1 releases the GIL. The unbalance
would be produced by the Windows scheduler, not by Python.

Note: "do_yield" is not meant to "balance" things as much as to make
measurements meaningful at all. Without switching at all during say 2
seconds, the numbers become totally worthless.

> If no one objects, I'd like to submit this changed ccbench.py to the trunk.

Please let me take a look.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-11 Thread David Beazley


David Beazley  added the comment:

Sorry, but I don't see how you can say that the round-robin GIL and the legacy 
GIL have the same behavior based solely on the result of a performance 
benchmark.   Do you have any kind of thread scheduling trace that proves they 
are scheduling threads in exactly the same manner?   Maybe they both have lousy 
performance, but for different reasons.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-11 Thread David Beazley


David Beazley  added the comment:

I must be missing something, but why, exactly would you want multiple CPU-bound 
threads to yield every 100 ticks?   Frankly, that sounds like a horrible idea 
that is going to hammer your system with excessive context switching overhead 
and cache performance problems---an effect that you, yourself have actually 
observed.   The results of ccbench also show worse performance for the 
round-robin GIL because of this.

Although the legacy GIL signals every 100  ticks, threads do not context switch 
that rapidly.  In fact, on single CPU systems, they context switch at about the 
same rate as the system time-slice (5-10 milliseconds on most systems). The 
new GIL implemented by Antoine also does not rapidly switch CPU-bound threads.

Again, I must be missing something, but I don't see how this round-robin GIL 
and all of this forced thread switching is anything that you would ever 
want--especially for CPU-bound threads. It seems to go against just about every 
design goal that people usually have for schedulers (especially the goal of 
minimizing context switching overhead).

Again, maybe I'm just being dense and missing something.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-11 Thread Kristján Valur Jónsson


Kristján Valur Jónsson  added the comment:

Fyi, here is the output using the unmodified Windows GIL, i.e. without my patch 
being active:
C:\pydev\python\trunk\PCbuild>python.exe ..\Tools\ccbench\ccbench.py -t -y
== CPython 2.7a4+.0 (trunk) ==
== AMD64 Windows on 'Intel64 Family 6 Model 23 Stepping 6, GenuineIntel' ==

--- Throughput ---

Pi calculation (Python)

threads= 1:   623 iterations/s. balance
threads= 2:   489 ( 78%)0.0289
threads= 3:   461 ( 74%)0.0369
threads= 4:   460 ( 73%)0.0426

regular expression (C)

threads= 1:   515 iterations/s. balance
threads= 2:   548 (106%)0.0771
threads= 3:   532 (103%)0.0556
threads= 4:   523 (101%)0.1132

SHA1 hashing (C)

threads= 1:  1188 iterations/s. balance
threads= 2:  1212 (102%)0.0232
threads= 3:  1198 (100%)0.0250
threads= 4:  1215 (102%)0.0163

You see results virtually identical to the ROUNDROBIN_GIL implementation.  This 
is just do demonstrate that Windows has had the ROUNDROBIN_GIL behaviour all 
along.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-11 Thread Kristján Valur Jónsson


Kristján Valur Jónsson  added the comment:

I looked at ccbench.  It's a great tool.  I've added two features to it (see 
the attached patch)
-y option to turn off the "do_yield" option in throughput, and so measure 
thread scheduling without assistance, and the throughput option now also 
computes "balance", which is the standard deviation of the throughput of each 
thread normalized by the average.

I give you three results for throughput, to demonstrate the ROUNDROBIN_GIL 
implementation:
1) LEGACY_GIL, no forced switching
C:\pydev\python\trunk\PCbuild>python.exe ..\Tools\ccbench\ccbench.py -y -t
== CPython 2.7a4+.0 (trunk) ==
== AMD64 Windows on 'Intel64 Family 6 Model 23 Stepping 6, GenuineIntel' ==

--- Throughput ---

Pi calculation (Python)

threads= 1:   672 iterations/s. balance
threads= 2:   597 ( 88%)0.4243
threads= 3:   603 ( 89%)0.2475
threads= 4:   596 ( 88%)0.4776

regular expression (C)

threads= 1:   571 iterations/s. balance
threads= 2:   565 ( 98%)0.6203
threads= 3:   567 ( 99%)1.6867
threads= 4:   570 ( 99%)1.1670

SHA1 hashing (C)

threads= 1:  1269 iterations/s. balance
threads= 2:  1268 ( 99%)1.1470
threads= 3:  1270 (100%)0.6024
threads= 4:  1263 ( 99%)0.7419

LEGACY_GIL, with forced switching
C:\pydev\python\trunk\PCbuild>python.exe ..\Tools\ccbench\ccbench.py -t
== CPython 2.7a4+.0 (trunk) ==
== AMD64 Windows on 'Intel64 Family 6 Model 23 Stepping 6, GenuineIntel' ==

--- Throughput ---

Pi calculation (Python)

threads= 1:   663 iterations/s. balance
threads= 2:   605 ( 91%)0.0232
threads= 3:   599 ( 90%)0.1988
threads= 4:   601 ( 90%)0.4648

regular expression (C)

threads= 1:   568 iterations/s. balance
threads= 2:   562 ( 99%)0.1737
threads= 3:   571 (100%)0.3950
threads= 4:   566 ( 99%)0.3158

SHA1 hashing (C)

threads= 1:  1275 iterations/s. balance
threads= 2:  1267 ( 99%)0.7238
threads= 3:  1271 ( 99%)0.2405
threads= 4:  1270 ( 99%)0.1508

Using the forced "do_yield" helps balance things, but not much.  We still have 
a .7 balance in SHA1 hashing for two threads.

Now, for ROUNDROBIN_GIL, and no forced switching:
C:\pydev\python\trunk\PCbuild>python.exe ..\Tools\ccbench\ccbench.py -t -y
== CPython 2.7a4+.0 (trunk) ==
== AMD64 Windows on 'Intel64 Family 6 Model 23 Stepping 6, GenuineIntel' ==

--- Throughput ---

Pi calculation (Python)

threads= 1:   672 iterations/s. balance
threads= 2:   485 ( 72%)0.0289
threads= 3:   448 ( 66%)0.0737
threads= 4:   476 ( 70%)0.0408

regular expression (C)

threads= 1:   569 iterations/s. balance
threads= 2:   551 ( 96%)0.0505
threads= 3:   551 ( 96%)0.1637
threads= 4:   551 ( 96%)0.2020

SHA1 hashing (C)

threads= 1:  1271 iterations/s. balance
threads= 2:  1262 ( 99%)0.0111
threads= 3:  1207 ( 94%)0.0143
threads= 4:  1202 ( 94%)0.0317

Notice the much better balance value, and this is without the forced sleep.
Also note a lower througput when computing pi with threads.  This is because 
yielding every 100 opcodes now actually works, and the aforementioned 
instruction cache problem kicks in.  Increasing the checkinterval to 1000 
solves this:
C:\pydev\python\trunk\PCbuild>python.exe ..\Tools\ccbench\ccbench.py -t -y -i100
0
== CPython 2.7a4+.0 (trunk) ==
== AMD64 Windows on 'Intel64 Family 6 Model 23 Stepping 6, GenuineIntel' ==

--- Throughput ---

Pi calculation (Python)

threads= 1:   673 iterations/s. balance
threads= 2:   628 ( 93%)0.
threads= 3:   603 ( 89%)0.0284
threads= 4:   606 ( 90%)0.0328

regular expression (C)

threads= 1:   570 iterations/s. balance
threads= 2:   569 ( 99%)0.2729
threads= 3:   562 ( 98%)0.6595
threads= 4:   560 ( 98%)1.2440

SHA1 hashing (C)

threads= 1:  1265 iterations/s. balance
threads= 2:  1256 ( 99%)0.
threads= 3:  1264 ( 99%)0.0759
threads= 4:  1255 ( 99%)0.1309

If no one objects, I'd like to submit this changed ccbench.py to the trunk.

--
Added file: http://bugs.python.org/file16867/ccbench.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-09 Thread anatoly techtonik


anatoly techtonik  added the comment:

If it really improves multicore performance and none of our test fail (even in 
memory/resource/time survival tests) then I'd give it a try even after a beta. 
2.x is still the best practical version out there.

--
nosy: +techtonik

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-09 Thread Kristján Valur Jónsson


Kristján Valur Jónsson  added the comment:

David, yes messing about with processor affinities is certainly not nice.
Especially since the issue is cross-platform.
The pthreads api doesn't offer much.  There is pthreadd_setschedparam(), and 
pthreads_setconcurrency().  Unfortunately I don't have a pthreads machine to 
test that with.
On windows, one possibility would be to switch to fibers, in the case of a 
yielding thread.  I don't know if that would change anything, or if the 
thread-to-fiber and vice versa conversion is lightweight enough to be used 
dynamically.

Antoine: I'm not familiar with ccbench.  I´ll look into it.   As for my FIFO 
fix, py3k is trying to do more, namely get rid of the checkinterval. It is most 
certainly a more complex solution and with it its own set of problems.  The 
only thing that needs fixing is to add "fairness" to the GIL.

I know that this is coming a bit late for 2.7 and I'm not pushing it as such 
for 2.7.  But after 2.7 comes 2.8 (and so on ad infinitum)  But I'm also 
pointing out the obvious problem and an obvious simple fix which doesn't 
involve inventing a whole new system.  I would have thought that this should at 
least spark some enthusiasm.

It's unfortunate, maybe, that I only realized so late that the pythread GIL was 
implemented using a homebrew condition variable mechanism.  I always thougth 
(being a windows guy) that it were simply using the pthread_mutex() and thus 
the greedy behaviour of the GIL could be ascribed to that.

Anyway, I´ll continue giving this patch some love.  I wouldn't be surprised if 
it, and especially the "priority" variant, would be appealing to people doing 
e.g. webservers with 2.x technology.

Another thing that the "priority" patch has done is convince me that I really 
need to implement this scheduling mode in stackless, since it does appear to 
help network latency when using FIFO scheduling of threads / tasklets.

Cheers!

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-06 Thread Antoine Pitrou


Antoine Pitrou  added the comment:

> The counter is "stall cycles".
> During the 10 second run on my 2.4Ghz cpu, we had instruction cache
> miss stalls for 2 billion cycles (2000 samples of 100 cycles per
> sample).  That does account for around 10% of the availible cpu.

Ok, thanks.

> 2) The poor performance of competing CPU threads on multicore machines
> is due to the instruction cache behaviour of non-overlapping thread
> execution on different cores.

Have you tried your measurement approach with ccbench?

> We can fix 1) easily, even with a much less invasive patch than the
> ones I have put in here.  I'm a bit surprised at the apparent
> disinterest in such an obvious bug / fix.

As already said, it's too late for 2.7. And the fix in 3.2 is most
probably better.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-06 Thread David Beazley


David Beazley  added the comment:

The analysis of instruction cache behavior is interesting---I could definitely 
see that coming into play given the heavy penalty that one sees going to 
multiple cores (it's a side effect in addition everything else that goes wrong 
such as a huge increase in the number of system calls).

I will only point out that messing around with processor affinities is going to 
be problematic.  There are C/C++ extensions to Python that intentionally 
release the GIL and want to run fully multithreaded across as many cores as 
might be available.  Setting a processor affinities is going to be the exact 
opposite of what you want for code like that.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-06 Thread Kristján Valur Jónsson


Kristján Valur Jónsson  added the comment:

The counter is "stall cycles".
During the 10 second run on my 2.4Ghz cpu, we had instruction cache miss stalls 
for 2 billion cycles (2000 samples of 100 cycles per sample).  That does 
account for around 10% of the availible cpu.

I'm observing something like 20% slowdown, though, so there are probably other 
causes.

Profiling another counter, "instruction fetches", I see this, for a "fast run":
Functions Causing Most Work
NameSamples %
Unknown Frame(s)10.733  99,49

and for a slow run:
Functions Causing Most Work
NameSamples %
Unknown Frame(s)8.056   99,48

This shows a 20% drop in fetched instructions in the interval (five seconds 
this time).  Ideally, we should see 12000 samples in the fast case (2.4 ghz, 
5s) but we see 1 due to what cache misses there are in this case.  The 
cache misses in the "slow" case causes effective instruction fetches to drop by 
20% on top of that.

I think that this is proof positive that the slowdown is due to instruction 
cache misses, at least on this dual core intel machine that I am using.

As for "the OS should handle this", I agree.  But it doesn't.  We are doing 
something unusual:  Convoying two (or more) threads allowing only one to run at 
a time.  The OS scheduler isn't built for that.  It can only assume that there 
will be some parallel execution and so it thinks that it is best to put the two 
sequential threads on different cpus.  But it is wrong, so the cost associated 
with cache misses outweighs the benefit of running on another core (zero, in 
our case).

So, the OS won't handle it, no matter how hard we wish that it would.  It is us 
that know how these gridlocked threads behave, and we do so much better than 
any OS scheduler can guess.  So, rather than beat our heads against the rock, 
I'm going to try to come up with a useful heuristic as to when to switch cores, 
and when not.  It would be useful as a diagnostic tool, if nothing more.

Ok, so we have established two things, I think:
1) the poor response of IO threads in the presence of CPU threads on 
thread_pthreads.h implementations (on multicore) is because of greedy gil wait 
semantics in the current gil.  It's easily fixable by using the implementation 
ROUNDROBIN_GIL implementation I've shown.
2) The poor performance of competing CPU threads on multicore machines is due 
to the instruction cache behaviour of non-overlapping thread execution on 
different cores.

We can fix 1) easily, even with a much less invasive patch than the ones I have 
put in here.  I'm a bit surprised at the apparent disinterest in such an 
obvious bug / fix.

As for 2), well, see above.  Nothing we can do, really, except identify those 
cases where we are releasing GIL just to yield (one case, actually, ceval.c) 
and try to instruct the OS not to switch cores in that case.  I'll see what I 
can come up with.

Cheers.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-06 Thread Antoine Pitrou


Antoine Pitrou  added the comment:

[...]
> _PyObject_Call403 99,02
[...]
> affinity off:
> Functions Causing Most Work
> Name  Samples %
[...]
> _PyObject_Call1.936   99,23
[...]
> _threadstartex1.934   99,13
> 
> When we run on both cores, we get four times as many L1 instruction cache 
> hits!

You mean we get 4x the number of cache /misses/, right?

This analysis is gratuitous if you can't evaluate/measure/calculate the
actual cost (in proportion of total elapsed or CPU time) of the
instruction cache misses. Perhaps it is actually negligible and the
slowdown is caused by something else.

> How best to combat this?  I'll do some experiments on Windows.
> Perhaps we can identify cpu-bound threads and group them on a single
> core.

IMHO, the OS should handle this. I don't think ad-hoc platform-specific
CPU affinity tweaks belong in the Python core.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-06 Thread Kristján Valur Jónsson


Kristján Valur Jónsson  added the comment:

I just did some profiling.  I´m using visual studio team edition which has some 
fancy built in profiling.  I decided to compare the performance of the 
iotest.py script with two cpu threads, running for 10 seconds with processor 
affinity enabled and disabled.  I added this code to the script:
if affinity:
import ctypes
i = ctypes.c_int()
i.value = 1
ctypes.windll.kernel32.SetProcessAffinityMask(-1, 1)

Regular instruction counter sampling showed no differences.  There were no 
indications of excessive time being used in the GIL or any strangeness with the 
locking primitives.  So, I decided to sample on cpu performance counters.  
Following up on my conjecture from yesterday, that this was due to 
inefficiencies in switching between cpus, I settled on sampling the instruction 
fetch stall cycles from the instruction fetch unit.  I sample every 100 
stalls.  I get interesting results.

With affinity:
Functions Causing Most Work
NameSamples %
_PyObject_Call  403 99,02
_PyEval_EvalFrameEx 402 98,77
_PyEval_EvalCodeEx  402 98,77
_PyEval_CallObjectWithKeywords  400 98,28
call_function   395 97,05

affinity off:
Functions Causing Most Work
NameSamples %
_PyEval_EvalFrameEx 1.937   99,28
_PyEval_EvalCodeEx  1.937   99,28
_PyEval_CallObjectWithKeywords  1.936   99,23
_PyObject_Call  1.936   99,23
_threadstartex  1.934   99,13

When we run on both cores, we get four times as many L1 instruction cache hits! 
 So, what appears to be happening is that each time that a switch occurs the L1 
instruction cache for each core must be repopulated with the python evaluation 
loop, it having been evacuated on that core during the hiatus.

Note that for this effect to kick in we need a large piece of code excercising 
the cache, such as the evaluation loop.  Earlier today, I wrote a simple 
(python free) C program to do similar testing, using a GIL, and found no 
performance degradation due to multi core, but that program only had a very 
simple "work" function.

So, this confirms my hypothesis:  The downgrading of the performance of python 
cpu bound threads on multicore machines stems from the shuttling about of the 
python evaluation loop between the instruction caches of the individual cores.

How best to combat this?  I'll do some experiments on Windows.  Perhaps we can 
identify cpu-bound threads and group them on a single core.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-05 Thread Florent Xicluna


Changes by Florent Xicluna :


--
nosy: +flox

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-05 Thread Kristján Valur Jónsson


Kristján Valur Jónsson  added the comment:

Sorry, what I meant with the "original problem" was the phenomenon observed by 
Antoine (IIRC) that the same CPU thread tends to hog the gil, even when 
releaseing it in ceval.c.
What I have been looking at up to now is chiefly IO performance using David's 
iotest.py, and improving the poor performance of IO.  IO will not suffer as 
badly on windows because the IO thread will get its fair slice of execution 
time.  Promted by you, I added this bit of code to the iotest.py:
spins = 0
laststat = 0
def spin():
global spins, laststat
task,args = task_pidigits()
while True:
   r= task(*args)
   spins += 1
   t = time.clock()
   if t-laststat > 1:
   print spins/(t-laststat)
   spins = 0
   laststat = t
   

You are right, however that cpu throughput of multiple cpu bound thread 
suffers.  And in fact, on windows, it appears to suffer the least using the 
LEGACY_GIL implementation.  This is, I conjecture, because there are far fewer 
context switches (because relinqushing the GIL fails).  My conjecture is that 
context switches between threads on two cores are so expensive as to 
dramatically affect performance.  Normal multithreaded programs don't suffer 
from this because the threads are kept busy.  But in our case, we are stopping 
one thread on one core, and starting another on a separate core, and this 
causes latency.

Now, I've improved my patch somewhat.  First off, I fixed some minor errors in 
the PRIORITY_GIL implementation.  But more importantly, I added something 
called FIFOCOND.  It is a condition variable that guarantees the FIFO property. 
 This was prompted by my observation that even Windows' Semaphore doesn't do 
that, rather the windows scheduler may allow the currently executing thread to 
jump ahead in the semaphore queue.  The FIFOCOND condition variable fixes that 
using explicit scheduling, and is intended as a diagnostic tool.
(Antoine, your comment from 13:04 about "roundrobin" inasfar as that we don't 
know anything about the condition variable behaviour.  I was assuming FIFO 
behaviour for the sake of argument, and I thought I´ put it in to the comments 
that we assume a general 'fairness' there.  Put in the FIFOCOND and you will 
have that fairness guaranteed.)


At any rate, I believe my patch provides a useful platform for further 
experimentation.
1) Factoring out the gil as a separate type of lock (which it must be)
2) allowing for different implementation of the GIL
3) shoring up the Condition variable implementation on Windows
4) Providing a FIFOCOND_T type to enforce a particular scheduling order, and 
demonstrating how we can be explicit about thread scheduling.

I have already demonstrated that using the PRIORITY_GIL method fixes the 
problem with IO threads in the presence of CPU bound threads.  Your iotest.py 
script is perfect for this, using 2 worker threads.  On windows, the problem 
with IO wasn't so grave as I have explained (windows by default works as the 
ROUNDROBIN_GIL implementation, not the LEGACY_GIL mode used on pthreads).  The 
PRIORITY_GIL solution is particularly effective with multicore on Windows, but 
it also improves IO throughput if cpu affinity of the server is fixed to one 
CPU, i.e. on singlecore.

I have no fix for CPU bound threads, and I honestly don't think such a fix 
exists, except by causing switches to happen far less frequently, e.g. by 
raising the checkinterval, and so mitigating the problem (which is what the new 
gil in py3k does with its timeout implementation)  But the IO fix for pthreads

To summarise then:
1) The GIL has two problems on multicore machines
 a) performance of CPU threads goes down
 b) performance of IO in the presence of CPU threads is abysmal, but not on 
Windows
2) We can fix problem b) on pthreads with the ROUNDROBIN_GIL implementation.
3) We can improve IO performance in the presence of CPU threads on pthreads and 
Windows using the PRIORITY_GIL implementation, even to become faster than on a 
single core.
4) We cannot do anything about decreased performance of co-operatively 
switching CPU threads on multicore except switching less frequently.   But this 
is quite feasible now with the PRIORITY_GIL implementation because it can 
request an immediate gil drop when IO is ready.  So raising the checkinterval 
will not affect IO performance in a negative way.


Please have a look at the latest patch with IO thread performance in mind.  It 
is currently configured to enable the PRIORITY_GIL implementation without the 
FIFOCOND on windows and pthreads.

--
Added file: http://bugs.python.org/file16770/gil2.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.c

[issue8299] Improve GIL in 2.7


David Beazley  added the comment:

It's not a simple mutex because if you did that, you would have performance 
problems much worse than those described in issue 7946.  
http://bugs.python.org/issue7946

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-03 Thread Torsten Landschoff


Torsten Landschoff  added the comment:

Silly question, I know, but why isn't the GIL just implemented as a lock of the 
host operating system? After all, we want mutual exclusion, I don't see why 
condition variables are required for this.

I have to admin that I did not look at the source, so the reason might be 
documented there.

--
nosy: +torsten

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-03 Thread Antoine Pitrou


Antoine Pitrou  added the comment:

Kristjan, I agree with Martin, it's probably too late to make such
changes for 2.7.
Additionally, your "round-robin" scheme only seems round-robin when
there are two threads competing. Otherwise, you could have three threads
A, B and C, and the GIL bouncing between A and B.

I would advocate opening a separate issue to improve the Windows
condition variable code under 3.x.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7


David Beazley  added the comment:

Just ran the CPU-bound GIL test on my wife's dual core Windows Vista machine.  
The code runs twice as slow using two threads as it does using no threads 
(original observed behavior in my GIL talk).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7


David Beazley  added the comment:

I'm not sure where you're getting your information, but the original GIL 
problem *DEFINITELY* exists on multicore Windows machines.   I've had numerous 
participants try it in training classes and workshops they've all observed 
severely degraded performance for CPU-bound threads on Windows (XP, Vista, and 
Windows 7).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-03 Thread Kristján Valur Jónsson


Kristján Valur Jónsson  added the comment:

Antoine:  Please take a look, the change is really simple, particularly the 
ROUNDROBIN_GIL variant which fixes the originally observed problem.
the GIL is still a lock, implemented using a mutex and a semaphore.  It is 
modified to work exactly as the lock always has done on windows (which is why 
the original problem isn't present on that platform).

The simplicity of the change stems from the fact that the gil is still just a 
mutex-type object, which is aqcuired and released just as it has always been.  
The change is in the internal rules of the mutex, making sure that threads 
queue up properly and (optionally) that they are released in a priority based 
order.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7

2010-04-03 Thread Kristján Valur Jónsson


Kristján Valur Jónsson  added the comment:

Martin: Well, this patch was originally conceived more as a demonstration of 
the GIL problem and an alternative fix proposal.
However, it is possible to configure it so that there is no change from 
existing functionality, simply by not including thread_gil.h in 
thread_pthread.h and thread_nt.h.  The only change would then be the presence 
of the new PyThread_type_gil and associating locking functions which delegate 
directly to the old PyThread_type_lock functions.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8299] Improve GIL in 2.7