Re: [BUG] long freezes on thinkpad t60

2007-07-02 Thread Nick Piggin

Linus Torvalds wrote:

Nick,
 call me a worry-wart, but I slept on this, and started worrying..

On Tue, 26 Jun 2007, Linus Torvalds wrote:

So try it with just a byte counter, and test some stupid micro-benchmark 
on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make 
it the normal spinlock sequence just because it isn't noticeably slower.



So I thought about this a bit more, and I like your sequence counter 
approach, but it still worried me.


In the current spinlock code, we have a very simple setup for a 
successful grab of the spinlock:


CPU#0   CPU#1

...

Yeah, thanks.

Now, I have good reason to believe that all Intel and AMD CPU's have a 
stricter-than-documented memory ordering, and that your spinlock may 
actually work perfectly well. But it still worries me. As far as I can 
tell, there's a theoretical problem with your spinlock implementation.


So I'd like you to ask around some CPU people, and get people from both 
Intel and AMD to sign off on your spinlocks as safe. I suspect you already 
have the required contacts, but if you don't, I can send things off to the 
appropriate people at least inside Intel.


Haven't made too much progress on this, but I have asked [EMAIL PROTECTED] who
might be able to at least know the right person to ask :P (might be faster
to ask Andi to ask :))

If you know someone at Intel then that would be appreciated.

It would be nice if it is safe (and can be guaranteed to be safe in future).

However OTOH, the fastpath may be even faster if we do it in a "definitely
safe" way.

That is, do the xaddw against 16-bits with the head in 8 of those and the
tail in the other 8. Then compare the byte registers of the register
returned by xaddw for the test. Although the xaddw is going to be slower
than an xaddb, this way we subsequently avoid the extra load completely,
while avoiding ordering issues.

In the slowpath we would have to have a token locked op in there (like
the current spinlocks do), but this could be taken out iff our inquiries
come back positive.

Anyway, I'll try redoing the patch and getting some numbers.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-07-02 Thread Nick Piggin

Linus Torvalds wrote:

Nick,
 call me a worry-wart, but I slept on this, and started worrying..

On Tue, 26 Jun 2007, Linus Torvalds wrote:

So try it with just a byte counter, and test some stupid micro-benchmark 
on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make 
it the normal spinlock sequence just because it isn't noticeably slower.



So I thought about this a bit more, and I like your sequence counter 
approach, but it still worried me.


In the current spinlock code, we have a very simple setup for a 
successful grab of the spinlock:


CPU#0   CPU#1

...

Yeah, thanks.

Now, I have good reason to believe that all Intel and AMD CPU's have a 
stricter-than-documented memory ordering, and that your spinlock may 
actually work perfectly well. But it still worries me. As far as I can 
tell, there's a theoretical problem with your spinlock implementation.


So I'd like you to ask around some CPU people, and get people from both 
Intel and AMD to sign off on your spinlocks as safe. I suspect you already 
have the required contacts, but if you don't, I can send things off to the 
appropriate people at least inside Intel.


Haven't made too much progress on this, but I have asked [EMAIL PROTECTED] who
might be able to at least know the right person to ask :P (might be faster
to ask Andi to ask :))

If you know someone at Intel then that would be appreciated.

It would be nice if it is safe (and can be guaranteed to be safe in future).

However OTOH, the fastpath may be even faster if we do it in a definitely
safe way.

That is, do the xaddw against 16-bits with the head in 8 of those and the
tail in the other 8. Then compare the byte registers of the register
returned by xaddw for the test. Although the xaddw is going to be slower
than an xaddb, this way we subsequently avoid the extra load completely,
while avoiding ordering issues.

In the slowpath we would have to have a token locked op in there (like
the current spinlocks do), but this could be taken out iff our inquiries
come back positive.

Anyway, I'll try redoing the patch and getting some numbers.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-27 Thread Davide Libenzi
On Wed, 27 Jun 2007, Linus Torvalds wrote:

> On Wed, 27 Jun 2007, Davide Libenzi wrote:
> 
> > On Wed, 27 Jun 2007, Linus Torvalds wrote:
> > > 
> > > Stores never "leak up". They only ever leak down (ie past subsequent 
> > > loads 
> > > or stores), so you don't need to worry about them. That's actually 
> > > already 
> > > documented (although not in those terms), and if it wasn't true, then we 
> > > couldn't do the spin unlock with just a regular store anyway.
> > 
> > Yes, Intel has never done that. They'll probably never do it since it'll 
> > break a lot of system software (unless they use a new mode-bit that 
> > allows system software to enable lose-ordering). Although I clearly 
> > remember to have read in one of their P4 optimization manuals to not 
> > assume this in the future.
> 
> That optimization manual was confused. 
> 
> The Intel memory ordering documentation *clearly* states that only reads 
> pass writes, not the other way around.

Yes, they were stating that clearly. IIWNOC (If I Were Not On Crack) I 
remember them saying to not assume any ordering besides the data 
dependency and the CPU self-consistency in the future CPUs, and to use 
*fence instructions when certain semantics were required.
But google did not help me in finding that doc, so maybe I were really on 
crack :)



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-27 Thread Linus Torvalds


On Wed, 27 Jun 2007, Davide Libenzi wrote:

> On Wed, 27 Jun 2007, Linus Torvalds wrote:
> > 
> > Stores never "leak up". They only ever leak down (ie past subsequent loads 
> > or stores), so you don't need to worry about them. That's actually already 
> > documented (although not in those terms), and if it wasn't true, then we 
> > couldn't do the spin unlock with just a regular store anyway.
> 
> Yes, Intel has never done that. They'll probably never do it since it'll 
> break a lot of system software (unless they use a new mode-bit that 
> allows system software to enable lose-ordering). Although I clearly 
> remember to have read in one of their P4 optimization manuals to not 
> assume this in the future.

That optimization manual was confused. 

The Intel memory ordering documentation *clearly* states that only reads 
pass writes, not the other way around.

Some very confused people have thought that "pass" is a two-way thing. 
It's not. "Passing" in the Intel memory ordering means "go _ahead_ of", 
exactly the same way it means in traffic. You don't "pass" people by 
falling behind them.

It's also obvious from reading the manual, because any other reading would 
be very strange: it says

 1. Reads can be carried out speculatively and in any order

 2. Reads can pass buffered writes, but the processor is self-consistent

 3. Writes to memory are always carried out in program order [.. and then 
lists exceptions that are not interesting - it's clflush and the 
non-temporal stores, not any normal writes ]

 4. Writes can be buffered

 5. Writes are not performed speculatively; they are only performed for 
instructions that have actually been retired.

 6. Data from buffered writes can be forwarded to waiting reads within the 
processor.

 7. Reads or writes cannot pass (be carried out ahead of) I/O 
instructions, locked instructions or serializing instructions.

 8. Reads cannot pass LFENCE and MFENCE instructions.

 9. Writes cannot pass SFENCE or MFENCE instructions.

The thing to note is:

 a) in (1), Intel says that reads can occur in any order, but (2) makes it 
clear that that is only relevant wrt other _reads_

 b) in (2), they say "pass", but then they actually explain that "pass" 
means "be carried out ahead of" in (7). 

HOWEVER, it should be obvious in (2) even _without_ the explicit 
clarification in (7) that "pass" is a one-way thing, because otherwise 
(2) is totally _meaningless_. It would be meaningless for two reasons:

 - (1) already said that reads can be done in any order, so if that 
   was a "any order wrt writes", then (2) would be pointless. So (2) 
   must mean something *else* than "any order", and the only sane 
   reading of it that isn't "any order" is that "pass" is a one-way 
   thing: you pass somebody when you go ahead of them, you do *not* 
   pass somebody when you fall behind them!

 - if (2) really meant that reads and writes can just be re-ordered, 
   then the choice of words makes no sense. It would be much more 
   sensible to say that "reads can be carried out in any order wrt 
   writes", instead of talking explicitly about "passing buffered 
   writes"

Anyway, I'm pretty damn sure my reading is correct. And no, it's not a "it 
happens to work". It's _architecturally_required_ to work, and nobody has 
ever complained about the use of a simple store to unlock a spinlock 
(which would only work if the "reads can pass" only means "*later* reads 
can pass *earlier* writes").

And it turns out that I think #1 is going away. Yes, the uarch will 
internally re-order reads, wrt each other, but if it isn't architecturally 
visible, then from an architectural standpoint #1 simply doesn't happen.

I can't guarantee that will happen, of course, but from talking to both 
AMD and Intel people, I think that they'll just document the stricter 
rules as the de-facto rules.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-27 Thread Davide Libenzi
On Wed, 27 Jun 2007, Linus Torvalds wrote:

> > IOW shouldn't an mfence always be there? Not only loads could leak up 
> > into the wait phase, but stores too, if they have no dependency with the 
> > "head" and "tail" loads.
> 
> Stores never "leak up". They only ever leak down (ie past subsequent loads 
> or stores), so you don't need to worry about them. That's actually already 
> documented (although not in those terms), and if it wasn't true, then we 
> couldn't do the spin unlock with just a regular store anyway.

Yes, Intel has never done that. They'll probably never do it since it'll 
break a lot of system software (unless they use a new mode-bit that 
allows system software to enable lose-ordering). Although I clearly 
remember to have read in one of their P4 optimization manuals to not 
assume this in the future.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-27 Thread Linus Torvalds


On Wed, 27 Jun 2007, Davide Libenzi wrote:
> > 
> > Now, I have good reason to believe that all Intel and AMD CPU's have a 
> > stricter-than-documented memory ordering, and that your spinlock may 
> > actually work perfectly well. But it still worries me. As far as I can 
> > tell, there's a theoretical problem with your spinlock implementation.
> 
> Nice catch ;) But wasn't Intel suggesting in not relying on the old 
> "strict" ordering rules?

Actually, both Intel and AMD engineers have been talking about making the 
documentation _stricter_, rather than looser. They apparently already are 
pretty damn strict, because not being stricter than the docs imply just 
ends up exposing too many potential problems in software that didn't 
really follow the rules.

For example, it's quite possible to do loads out of order, but guarantee 
that the result is 100% equivalent with a totally in-order machine. One 
way you do that is to keep track of the cacheline for any speculative 
loads, and if it gets invalidated before the speculative instruction has 
completed, you just throw the speculation away.

End result: you can do any amount of speculation you damn well please at a 
micro-architectural level, but if the speculation would ever have been 
architecturally _visible_, it never happens!

(Yeah, that is just me in my non-professional capacity of hw engineer 
wanna-be: I'm not saying that that is necessarily what Intel or AMD 
actually ever do, and they may have other approaches entirely).

> IOW shouldn't an mfence always be there? Not only loads could leak up 
> into the wait phase, but stores too, if they have no dependency with the 
> "head" and "tail" loads.

Stores never "leak up". They only ever leak down (ie past subsequent loads 
or stores), so you don't need to worry about them. That's actually already 
documented (although not in those terms), and if it wasn't true, then we 
couldn't do the spin unlock with just a regular store anyway.

(There's basically never any reason to "speculate" stores before other mem 
ops. It's hard, and pointless. Stores you want to just buffer and move as 
_late_ as possible, loads you want to speculate and move as _early_ as 
possible. Anything else doesn't make sense).

So I'm fairly sure that the only thing you really need to worry about in 
this thing is the load-load ordering (the load for the spinlock compare vs 
any loads "inside" the spinlock), and I'm reasonably certain that no 
existing x86 (and likely no future x86) will make load-load reordering 
effects architecturally visible, even if the implementation may do so 
*internally* when it's not possible to see it in the end result.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-27 Thread Davide Libenzi
On Wed, 27 Jun 2007, Linus Torvalds wrote:

> On Tue, 26 Jun 2007, Linus Torvalds wrote:
> > 
> > So try it with just a byte counter, and test some stupid micro-benchmark 
> > on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make 
> > it the normal spinlock sequence just because it isn't noticeably slower.
> 
> So I thought about this a bit more, and I like your sequence counter 
> approach, but it still worried me.
> 
> In the current spinlock code, we have a very simple setup for a 
> successful grab of the spinlock:
> 
>   CPU#0   CPU#1
> 
>   A (= code before the spinlock)
>   lock release
> 
>   lock decb mem   (serializing instruction)
> 
>   B (= code after the spinlock)
> 
> and there is no question that memory operations in B cannot leak into A.
> 
> With the sequence counters, the situation is more complex:
> 
>   CPU #0  CPU #1
> 
>   A (= code before the spinlock)
> 
>   lock xadd mem   (serializing instruction)
> 
>   B (= code afte xadd, but not inside lock)
> 
>   lock release
> 
>   cmp head, tail
> 
>   C (= code inside the lock)
> 
> Now, B is basically the empty set, but that's not the issue I worry about. 
> The thing is, I can guarantee by the Intel memory ordering rules that 
> neither B nor C will ever have memops that leak past the "xadd", but I'm 
> not at all as sure that we cannot have memops in C that leak into B!
> 
> And B really isn't protected by the lock - it may run while another CPU 
> still holds the lock, and we know the other CPU released it only as part 
> of the compare. But that compare isn't a serializing instruction!
> 
> IOW, I could imagine a load inside C being speculated, and being moved 
> *ahead* of the load that compares the spinlock head with the tail! IOW, 
> the load that is _inside_ the spinlock has effectively moved to outside 
> the protected region, and the spinlock isn't really a reliable mutual 
> exclusion barrier any more!
> 
> (Yes, there is a data-dependency on the compare, but it is only used for a 
> conditional branch, and conditional branches are control dependencies and 
> can be speculated, so CPU speculation can easily break that apparent 
> dependency chain and do later loads *before* the spinlock load completes!)
> 
> Now, I have good reason to believe that all Intel and AMD CPU's have a 
> stricter-than-documented memory ordering, and that your spinlock may 
> actually work perfectly well. But it still worries me. As far as I can 
> tell, there's a theoretical problem with your spinlock implementation.

Nice catch ;) But wasn't Intel suggesting in not relying on the old 
"strict" ordering rules? IOW shouldn't an mfence always be there? Not only 
loads could leak up into the wait phase, but stores too, if they have no 
dependency with the "head" and "tail" loads.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-27 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> With the sequence counters, the situation is more complex:
> 
>   CPU #0  CPU #1
> 
>   A (= code before the spinlock)
> 
>   lock xadd mem   (serializing instruction)
> 
>   B (= code afte xadd, but not inside lock)
> 
>   lock release
> 
>   cmp head, tail
> 
>   C (= code inside the lock)
> 
> Now, B is basically the empty set, but that's not the issue I worry 
> about. The thing is, I can guarantee by the Intel memory ordering 
> rules that neither B nor C will ever have memops that leak past the 
> "xadd", but I'm not at all as sure that we cannot have memops in C 
> that leak into B!
> 
> And B really isn't protected by the lock - it may run while another 
> CPU still holds the lock, and we know the other CPU released it only 
> as part of the compare. But that compare isn't a serializing 
> instruction!
> 
> IOW, I could imagine a load inside C being speculated, and being moved 
> *ahead* of the load that compares the spinlock head with the tail! 
> IOW, the load that is _inside_ the spinlock has effectively moved to 
> outside the protected region, and the spinlock isn't really a reliable 
> mutual exclusion barrier any more!
> 
> (Yes, there is a data-dependency on the compare, but it is only used 
> for a conditional branch, and conditional branches are control 
> dependencies and can be speculated, so CPU speculation can easily 
> break that apparent dependency chain and do later loads *before* the 
> spinlock load completes!)
> 
> Now, I have good reason to believe that all Intel and AMD CPU's have a 
> stricter-than-documented memory ordering, and that your spinlock may 
> actually work perfectly well. But it still worries me. As far as I can 
> tell, there's a theoretical problem with your spinlock implementation.

hm, i agree with you that this is problematic. Especially on an SMT CPU 
it would be a big architectural restriction if prefetches couldnt cross 
cache misses. (and that's the only way i could see Nick's scheme 
working: MESI coherency coupled with the speculative use of that 
cacheline's value never "surviving" a MESI invalidation of that 
cacheline. That would guarantee that once we have the lock, any 
speculative result is fully coherent and no other CPU has modified it.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-27 Thread Linus Torvalds

Nick,
 call me a worry-wart, but I slept on this, and started worrying..

On Tue, 26 Jun 2007, Linus Torvalds wrote:
> 
> So try it with just a byte counter, and test some stupid micro-benchmark 
> on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make 
> it the normal spinlock sequence just because it isn't noticeably slower.

So I thought about this a bit more, and I like your sequence counter 
approach, but it still worried me.

In the current spinlock code, we have a very simple setup for a 
successful grab of the spinlock:

CPU#0   CPU#1

A (= code before the spinlock)
lock release

lock decb mem   (serializing instruction)

B (= code after the spinlock)

and there is no question that memory operations in B cannot leak into A.

With the sequence counters, the situation is more complex:

CPU #0  CPU #1

A (= code before the spinlock)

lock xadd mem   (serializing instruction)

B (= code afte xadd, but not inside lock)

lock release

cmp head, tail

C (= code inside the lock)

Now, B is basically the empty set, but that's not the issue I worry about. 
The thing is, I can guarantee by the Intel memory ordering rules that 
neither B nor C will ever have memops that leak past the "xadd", but I'm 
not at all as sure that we cannot have memops in C that leak into B!

And B really isn't protected by the lock - it may run while another CPU 
still holds the lock, and we know the other CPU released it only as part 
of the compare. But that compare isn't a serializing instruction!

IOW, I could imagine a load inside C being speculated, and being moved 
*ahead* of the load that compares the spinlock head with the tail! IOW, 
the load that is _inside_ the spinlock has effectively moved to outside 
the protected region, and the spinlock isn't really a reliable mutual 
exclusion barrier any more!

(Yes, there is a data-dependency on the compare, but it is only used for a 
conditional branch, and conditional branches are control dependencies and 
can be speculated, so CPU speculation can easily break that apparent 
dependency chain and do later loads *before* the spinlock load completes!)

Now, I have good reason to believe that all Intel and AMD CPU's have a 
stricter-than-documented memory ordering, and that your spinlock may 
actually work perfectly well. But it still worries me. As far as I can 
tell, there's a theoretical problem with your spinlock implementation.

So I'd like you to ask around some CPU people, and get people from both 
Intel and AMD to sign off on your spinlocks as safe. I suspect you already 
have the required contacts, but if you don't, I can send things off to the 
appropriate people at least inside Intel.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-27 Thread Nick Piggin

Linus Torvalds wrote:


On Wed, 27 Jun 2007, Nick Piggin wrote:


I don't know why my unlock sequence should be that much slower? Unlocked
mov vs unlocked add? Definitely in dumb micro-benchmark testing it wasn't
twice as slow (IIRC).



Oh, that releasing "add" can be unlocked, and only the holder of the lock 
ever touches that field?


Right.


I must not have looked closely enough. In that case, I withdraw that 
objection, and the sequence-number-based spinlock sounds like a perfectly 
fine one.


Yes, the add will be slightly slower than the plain byte move, and the 
locked xadd will be slightly slower than a regular locked add, but 
compared to the serialization cost, that should be small. For some reason 
I thought you needed a locked instruction for the unlock too.


So try it with just a byte counter, and test some stupid micro-benchmark 
on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make 
it the normal spinlock sequence just because it isn't noticeably slower.


In fact, I think a "incb " instruction is even a byte shorter than 
"movb $1,mem", and with "unlock" being inlined, that could actually be a 
slight _win_.


OK, I'll try running some tests and get back to you on it.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-27 Thread Linus Torvalds


On Wed, 27 Jun 2007, Nick Piggin wrote:
> 
> I don't know why my unlock sequence should be that much slower? Unlocked
> mov vs unlocked add? Definitely in dumb micro-benchmark testing it wasn't
> twice as slow (IIRC).

Oh, that releasing "add" can be unlocked, and only the holder of the lock 
ever touches that field?

I must not have looked closely enough. In that case, I withdraw that 
objection, and the sequence-number-based spinlock sounds like a perfectly 
fine one.

Yes, the add will be slightly slower than the plain byte move, and the 
locked xadd will be slightly slower than a regular locked add, but 
compared to the serialization cost, that should be small. For some reason 
I thought you needed a locked instruction for the unlock too.

So try it with just a byte counter, and test some stupid micro-benchmark 
on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make 
it the normal spinlock sequence just because it isn't noticeably slower.

In fact, I think a "incb " instruction is even a byte shorter than 
"movb $1,mem", and with "unlock" being inlined, that could actually be a 
slight _win_.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-27 Thread Linus Torvalds


On Wed, 27 Jun 2007, Nick Piggin wrote:
 
 I don't know why my unlock sequence should be that much slower? Unlocked
 mov vs unlocked add? Definitely in dumb micro-benchmark testing it wasn't
 twice as slow (IIRC).

Oh, that releasing add can be unlocked, and only the holder of the lock 
ever touches that field?

I must not have looked closely enough. In that case, I withdraw that 
objection, and the sequence-number-based spinlock sounds like a perfectly 
fine one.

Yes, the add will be slightly slower than the plain byte move, and the 
locked xadd will be slightly slower than a regular locked add, but 
compared to the serialization cost, that should be small. For some reason 
I thought you needed a locked instruction for the unlock too.

So try it with just a byte counter, and test some stupid micro-benchmark 
on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make 
it the normal spinlock sequence just because it isn't noticeably slower.

In fact, I think a incb mem instruction is even a byte shorter than 
movb $1,mem, and with unlock being inlined, that could actually be a 
slight _win_.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-27 Thread Nick Piggin

Linus Torvalds wrote:


On Wed, 27 Jun 2007, Nick Piggin wrote:


I don't know why my unlock sequence should be that much slower? Unlocked
mov vs unlocked add? Definitely in dumb micro-benchmark testing it wasn't
twice as slow (IIRC).



Oh, that releasing add can be unlocked, and only the holder of the lock 
ever touches that field?


Right.


I must not have looked closely enough. In that case, I withdraw that 
objection, and the sequence-number-based spinlock sounds like a perfectly 
fine one.


Yes, the add will be slightly slower than the plain byte move, and the 
locked xadd will be slightly slower than a regular locked add, but 
compared to the serialization cost, that should be small. For some reason 
I thought you needed a locked instruction for the unlock too.


So try it with just a byte counter, and test some stupid micro-benchmark 
on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make 
it the normal spinlock sequence just because it isn't noticeably slower.


In fact, I think a incb mem instruction is even a byte shorter than 
movb $1,mem, and with unlock being inlined, that could actually be a 
slight _win_.


OK, I'll try running some tests and get back to you on it.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-27 Thread Linus Torvalds

Nick,
 call me a worry-wart, but I slept on this, and started worrying..

On Tue, 26 Jun 2007, Linus Torvalds wrote:
 
 So try it with just a byte counter, and test some stupid micro-benchmark 
 on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make 
 it the normal spinlock sequence just because it isn't noticeably slower.

So I thought about this a bit more, and I like your sequence counter 
approach, but it still worried me.

In the current spinlock code, we have a very simple setup for a 
successful grab of the spinlock:

CPU#0   CPU#1

A (= code before the spinlock)
lock release

lock decb mem   (serializing instruction)

B (= code after the spinlock)

and there is no question that memory operations in B cannot leak into A.

With the sequence counters, the situation is more complex:

CPU #0  CPU #1

A (= code before the spinlock)

lock xadd mem   (serializing instruction)

B (= code afte xadd, but not inside lock)

lock release

cmp head, tail

C (= code inside the lock)

Now, B is basically the empty set, but that's not the issue I worry about. 
The thing is, I can guarantee by the Intel memory ordering rules that 
neither B nor C will ever have memops that leak past the xadd, but I'm 
not at all as sure that we cannot have memops in C that leak into B!

And B really isn't protected by the lock - it may run while another CPU 
still holds the lock, and we know the other CPU released it only as part 
of the compare. But that compare isn't a serializing instruction!

IOW, I could imagine a load inside C being speculated, and being moved 
*ahead* of the load that compares the spinlock head with the tail! IOW, 
the load that is _inside_ the spinlock has effectively moved to outside 
the protected region, and the spinlock isn't really a reliable mutual 
exclusion barrier any more!

(Yes, there is a data-dependency on the compare, but it is only used for a 
conditional branch, and conditional branches are control dependencies and 
can be speculated, so CPU speculation can easily break that apparent 
dependency chain and do later loads *before* the spinlock load completes!)

Now, I have good reason to believe that all Intel and AMD CPU's have a 
stricter-than-documented memory ordering, and that your spinlock may 
actually work perfectly well. But it still worries me. As far as I can 
tell, there's a theoretical problem with your spinlock implementation.

So I'd like you to ask around some CPU people, and get people from both 
Intel and AMD to sign off on your spinlocks as safe. I suspect you already 
have the required contacts, but if you don't, I can send things off to the 
appropriate people at least inside Intel.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-27 Thread Ingo Molnar

* Linus Torvalds [EMAIL PROTECTED] wrote:

 With the sequence counters, the situation is more complex:
 
   CPU #0  CPU #1
 
   A (= code before the spinlock)
 
   lock xadd mem   (serializing instruction)
 
   B (= code afte xadd, but not inside lock)
 
   lock release
 
   cmp head, tail
 
   C (= code inside the lock)
 
 Now, B is basically the empty set, but that's not the issue I worry 
 about. The thing is, I can guarantee by the Intel memory ordering 
 rules that neither B nor C will ever have memops that leak past the 
 xadd, but I'm not at all as sure that we cannot have memops in C 
 that leak into B!
 
 And B really isn't protected by the lock - it may run while another 
 CPU still holds the lock, and we know the other CPU released it only 
 as part of the compare. But that compare isn't a serializing 
 instruction!
 
 IOW, I could imagine a load inside C being speculated, and being moved 
 *ahead* of the load that compares the spinlock head with the tail! 
 IOW, the load that is _inside_ the spinlock has effectively moved to 
 outside the protected region, and the spinlock isn't really a reliable 
 mutual exclusion barrier any more!
 
 (Yes, there is a data-dependency on the compare, but it is only used 
 for a conditional branch, and conditional branches are control 
 dependencies and can be speculated, so CPU speculation can easily 
 break that apparent dependency chain and do later loads *before* the 
 spinlock load completes!)
 
 Now, I have good reason to believe that all Intel and AMD CPU's have a 
 stricter-than-documented memory ordering, and that your spinlock may 
 actually work perfectly well. But it still worries me. As far as I can 
 tell, there's a theoretical problem with your spinlock implementation.

hm, i agree with you that this is problematic. Especially on an SMT CPU 
it would be a big architectural restriction if prefetches couldnt cross 
cache misses. (and that's the only way i could see Nick's scheme 
working: MESI coherency coupled with the speculative use of that 
cacheline's value never surviving a MESI invalidation of that 
cacheline. That would guarantee that once we have the lock, any 
speculative result is fully coherent and no other CPU has modified it.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-27 Thread Davide Libenzi
On Wed, 27 Jun 2007, Linus Torvalds wrote:

 On Tue, 26 Jun 2007, Linus Torvalds wrote:
  
  So try it with just a byte counter, and test some stupid micro-benchmark 
  on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make 
  it the normal spinlock sequence just because it isn't noticeably slower.
 
 So I thought about this a bit more, and I like your sequence counter 
 approach, but it still worried me.
 
 In the current spinlock code, we have a very simple setup for a 
 successful grab of the spinlock:
 
   CPU#0   CPU#1
 
   A (= code before the spinlock)
   lock release
 
   lock decb mem   (serializing instruction)
 
   B (= code after the spinlock)
 
 and there is no question that memory operations in B cannot leak into A.
 
 With the sequence counters, the situation is more complex:
 
   CPU #0  CPU #1
 
   A (= code before the spinlock)
 
   lock xadd mem   (serializing instruction)
 
   B (= code afte xadd, but not inside lock)
 
   lock release
 
   cmp head, tail
 
   C (= code inside the lock)
 
 Now, B is basically the empty set, but that's not the issue I worry about. 
 The thing is, I can guarantee by the Intel memory ordering rules that 
 neither B nor C will ever have memops that leak past the xadd, but I'm 
 not at all as sure that we cannot have memops in C that leak into B!
 
 And B really isn't protected by the lock - it may run while another CPU 
 still holds the lock, and we know the other CPU released it only as part 
 of the compare. But that compare isn't a serializing instruction!
 
 IOW, I could imagine a load inside C being speculated, and being moved 
 *ahead* of the load that compares the spinlock head with the tail! IOW, 
 the load that is _inside_ the spinlock has effectively moved to outside 
 the protected region, and the spinlock isn't really a reliable mutual 
 exclusion barrier any more!
 
 (Yes, there is a data-dependency on the compare, but it is only used for a 
 conditional branch, and conditional branches are control dependencies and 
 can be speculated, so CPU speculation can easily break that apparent 
 dependency chain and do later loads *before* the spinlock load completes!)
 
 Now, I have good reason to believe that all Intel and AMD CPU's have a 
 stricter-than-documented memory ordering, and that your spinlock may 
 actually work perfectly well. But it still worries me. As far as I can 
 tell, there's a theoretical problem with your spinlock implementation.

Nice catch ;) But wasn't Intel suggesting in not relying on the old 
strict ordering rules? IOW shouldn't an mfence always be there? Not only 
loads could leak up into the wait phase, but stores too, if they have no 
dependency with the head and tail loads.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-27 Thread Linus Torvalds


On Wed, 27 Jun 2007, Davide Libenzi wrote:
  
  Now, I have good reason to believe that all Intel and AMD CPU's have a 
  stricter-than-documented memory ordering, and that your spinlock may 
  actually work perfectly well. But it still worries me. As far as I can 
  tell, there's a theoretical problem with your spinlock implementation.
 
 Nice catch ;) But wasn't Intel suggesting in not relying on the old 
 strict ordering rules?

Actually, both Intel and AMD engineers have been talking about making the 
documentation _stricter_, rather than looser. They apparently already are 
pretty damn strict, because not being stricter than the docs imply just 
ends up exposing too many potential problems in software that didn't 
really follow the rules.

For example, it's quite possible to do loads out of order, but guarantee 
that the result is 100% equivalent with a totally in-order machine. One 
way you do that is to keep track of the cacheline for any speculative 
loads, and if it gets invalidated before the speculative instruction has 
completed, you just throw the speculation away.

End result: you can do any amount of speculation you damn well please at a 
micro-architectural level, but if the speculation would ever have been 
architecturally _visible_, it never happens!

(Yeah, that is just me in my non-professional capacity of hw engineer 
wanna-be: I'm not saying that that is necessarily what Intel or AMD 
actually ever do, and they may have other approaches entirely).

 IOW shouldn't an mfence always be there? Not only loads could leak up 
 into the wait phase, but stores too, if they have no dependency with the 
 head and tail loads.

Stores never leak up. They only ever leak down (ie past subsequent loads 
or stores), so you don't need to worry about them. That's actually already 
documented (although not in those terms), and if it wasn't true, then we 
couldn't do the spin unlock with just a regular store anyway.

(There's basically never any reason to speculate stores before other mem 
ops. It's hard, and pointless. Stores you want to just buffer and move as 
_late_ as possible, loads you want to speculate and move as _early_ as 
possible. Anything else doesn't make sense).

So I'm fairly sure that the only thing you really need to worry about in 
this thing is the load-load ordering (the load for the spinlock compare vs 
any loads inside the spinlock), and I'm reasonably certain that no 
existing x86 (and likely no future x86) will make load-load reordering 
effects architecturally visible, even if the implementation may do so 
*internally* when it's not possible to see it in the end result.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-27 Thread Davide Libenzi
On Wed, 27 Jun 2007, Linus Torvalds wrote:

  IOW shouldn't an mfence always be there? Not only loads could leak up 
  into the wait phase, but stores too, if they have no dependency with the 
  head and tail loads.
 
 Stores never leak up. They only ever leak down (ie past subsequent loads 
 or stores), so you don't need to worry about them. That's actually already 
 documented (although not in those terms), and if it wasn't true, then we 
 couldn't do the spin unlock with just a regular store anyway.

Yes, Intel has never done that. They'll probably never do it since it'll 
break a lot of system software (unless they use a new mode-bit that 
allows system software to enable lose-ordering). Although I clearly 
remember to have read in one of their P4 optimization manuals to not 
assume this in the future.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-27 Thread Linus Torvalds


On Wed, 27 Jun 2007, Davide Libenzi wrote:

 On Wed, 27 Jun 2007, Linus Torvalds wrote:
  
  Stores never leak up. They only ever leak down (ie past subsequent loads 
  or stores), so you don't need to worry about them. That's actually already 
  documented (although not in those terms), and if it wasn't true, then we 
  couldn't do the spin unlock with just a regular store anyway.
 
 Yes, Intel has never done that. They'll probably never do it since it'll 
 break a lot of system software (unless they use a new mode-bit that 
 allows system software to enable lose-ordering). Although I clearly 
 remember to have read in one of their P4 optimization manuals to not 
 assume this in the future.

That optimization manual was confused. 

The Intel memory ordering documentation *clearly* states that only reads 
pass writes, not the other way around.

Some very confused people have thought that pass is a two-way thing. 
It's not. Passing in the Intel memory ordering means go _ahead_ of, 
exactly the same way it means in traffic. You don't pass people by 
falling behind them.

It's also obvious from reading the manual, because any other reading would 
be very strange: it says

 1. Reads can be carried out speculatively and in any order

 2. Reads can pass buffered writes, but the processor is self-consistent

 3. Writes to memory are always carried out in program order [.. and then 
lists exceptions that are not interesting - it's clflush and the 
non-temporal stores, not any normal writes ]

 4. Writes can be buffered

 5. Writes are not performed speculatively; they are only performed for 
instructions that have actually been retired.

 6. Data from buffered writes can be forwarded to waiting reads within the 
processor.

 7. Reads or writes cannot pass (be carried out ahead of) I/O 
instructions, locked instructions or serializing instructions.

 8. Reads cannot pass LFENCE and MFENCE instructions.

 9. Writes cannot pass SFENCE or MFENCE instructions.

The thing to note is:

 a) in (1), Intel says that reads can occur in any order, but (2) makes it 
clear that that is only relevant wrt other _reads_

 b) in (2), they say pass, but then they actually explain that pass 
means be carried out ahead of in (7). 

HOWEVER, it should be obvious in (2) even _without_ the explicit 
clarification in (7) that pass is a one-way thing, because otherwise 
(2) is totally _meaningless_. It would be meaningless for two reasons:

 - (1) already said that reads can be done in any order, so if that 
   was a any order wrt writes, then (2) would be pointless. So (2) 
   must mean something *else* than any order, and the only sane 
   reading of it that isn't any order is that pass is a one-way 
   thing: you pass somebody when you go ahead of them, you do *not* 
   pass somebody when you fall behind them!

 - if (2) really meant that reads and writes can just be re-ordered, 
   then the choice of words makes no sense. It would be much more 
   sensible to say that reads can be carried out in any order wrt 
   writes, instead of talking explicitly about passing buffered 
   writes

Anyway, I'm pretty damn sure my reading is correct. And no, it's not a it 
happens to work. It's _architecturally_required_ to work, and nobody has 
ever complained about the use of a simple store to unlock a spinlock 
(which would only work if the reads can pass only means *later* reads 
can pass *earlier* writes).

And it turns out that I think #1 is going away. Yes, the uarch will 
internally re-order reads, wrt each other, but if it isn't architecturally 
visible, then from an architectural standpoint #1 simply doesn't happen.

I can't guarantee that will happen, of course, but from talking to both 
AMD and Intel people, I think that they'll just document the stricter 
rules as the de-facto rules.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-27 Thread Davide Libenzi
On Wed, 27 Jun 2007, Linus Torvalds wrote:

 On Wed, 27 Jun 2007, Davide Libenzi wrote:
 
  On Wed, 27 Jun 2007, Linus Torvalds wrote:
   
   Stores never leak up. They only ever leak down (ie past subsequent 
   loads 
   or stores), so you don't need to worry about them. That's actually 
   already 
   documented (although not in those terms), and if it wasn't true, then we 
   couldn't do the spin unlock with just a regular store anyway.
  
  Yes, Intel has never done that. They'll probably never do it since it'll 
  break a lot of system software (unless they use a new mode-bit that 
  allows system software to enable lose-ordering). Although I clearly 
  remember to have read in one of their P4 optimization manuals to not 
  assume this in the future.
 
 That optimization manual was confused. 
 
 The Intel memory ordering documentation *clearly* states that only reads 
 pass writes, not the other way around.

Yes, they were stating that clearly. IIWNOC (If I Were Not On Crack) I 
remember them saying to not assume any ordering besides the data 
dependency and the CPU self-consistency in the future CPUs, and to use 
*fence instructions when certain semantics were required.
But google did not help me in finding that doc, so maybe I were really on 
crack :)



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-26 Thread Nick Piggin

Linus Torvalds wrote:


On Tue, 26 Jun 2007, Nick Piggin wrote:


Hmm, not that I have a strong opinion one way or the other, but I
don't know that they would encourage bad code. They are not going to
reduce latency under a locked section, but will improve determinism
in the contended case.



xadd really generally *is* slower than an add. One is often microcoded, 
the other is not.


Oh. I found xadd to be not hugely slower on my P4, but it was a little
bit.


But the real problem is that your "unlock" sequence is now about two 
orders of magnitude slower than it used to be. So it used to be that a 
spinlocked sequence only had a single synchronization point, now it has 
two. *That* is really bad, and I guarantee that it makes your spinlocks 
effectively twice as slow for the non-contended parts.


I don't know why my unlock sequence should be that much slower? Unlocked
mov vs unlocked add? Definitely in dumb micro-benchmark testing it wasn't
twice as slow (IIRC).


But your xadd thing might be worth looking at, just to see how expensive 
it is. As an _alternative_ to spinlocks, it's certainly viable.


(Side note: why make it a word? Word operations are slower on many x86 
implementations, because they add yet another prefix. You only need a 
byte)


No real reason I guess. I'll change it.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-26 Thread Linus Torvalds


On Tue, 26 Jun 2007, Nick Piggin wrote:
> 
> Hmm, not that I have a strong opinion one way or the other, but I
> don't know that they would encourage bad code. They are not going to
> reduce latency under a locked section, but will improve determinism
> in the contended case.

xadd really generally *is* slower than an add. One is often microcoded, 
the other is not.

But the real problem is that your "unlock" sequence is now about two 
orders of magnitude slower than it used to be. So it used to be that a 
spinlocked sequence only had a single synchronization point, now it has 
two. *That* is really bad, and I guarantee that it makes your spinlocks 
effectively twice as slow for the non-contended parts.

But your xadd thing might be worth looking at, just to see how expensive 
it is. As an _alternative_ to spinlocks, it's certainly viable.

(Side note: why make it a word? Word operations are slower on many x86 
implementations, because they add yet another prefix. You only need a 
byte)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-26 Thread Jarek Poplawski
On Tue, Jun 26, 2007 at 06:42:10PM +1000, Nick Piggin wrote:
...
> They should also improve performance in heavily contended case due to
> the nature of how they spin, but I know that's not something you want
> to hear about. And theoretically there should be no reason why xadd is
> any slower than dec and look at the status flags, should there? I never
> implementedit in optimised assembly to test though...
...

BTW, could you explain why the below diagnose doesn't relate
to your solution?

On 06/21/2007 12:08 PM, Ingo Molnar wrote:
...
> So it seems the problem was that if a core kept _truly_ modifying a 
> cacheline via atomics in a high enough frequency, it could artificially 
> starve the other core. (which would keep waiting for the cacheline to be 
> released one day, and which kept the first core from ever making any 
> progress) To me that looks like a real problem on the hardware side - 
> shouldnt cacheline ownership be arbitrated a bit better than that?
> 

Thanks & regards,
Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-26 Thread Nick Piggin

Linus Torvalds wrote:


On Thu, 21 Jun 2007, Eric Dumazet wrote:


This reminds me Nick's proposal of 'queued spinlocks' 3 months ago

Maybe this should be re-considered ? (unlock is still a non atomic op, 
so we dont pay the serializing cost twice)



No. The point is simple:

IF YOU NEED THIS, YOU ARE DOING SOMETHING WRONG!

I don't understand why this is even controversial. Especially since we 
have a patch for the problem that proves my point: the _proper_ way to fix 
things is to just not do the bad thing, instead of trying to allow the bad 
behaviour and try to handle it.


Things like queued spinlocks just make excuses for bad code. 

We don't do nesting locking either, for exactly the same reason. Are 
nesting locks "easier"? Absolutely. They are also almost always a sign of 
a *bug*. So making spinlocks and/or mutexes nest by default is just a way 
to encourage bad programming!


Hmm, not that I have a strong opinion one way or the other, but I
don't know that they would encourage bad code. They are not going to
reduce latency under a locked section, but will improve determinism
in the contended case.

They should also improve performance in heavily contended case due to
the nature of how they spin, but I know that's not something you want
to hear about. And theoretically there should be no reason why xadd is
any slower than dec and look at the status flags, should there? I never
implementedit in optimised assembly to test though...

Some hardware seems to have no idea of fair cacheline scheduling, and
especially when there are more than 2 CPUs contending for the
cacheline, there can be large starvations.

And actually some times we have code that really wants to drop the
lock and queue behind other contenders. Most of the lockbreak stuff
for example.

Suppose we could have a completely fair spinlock primitive that has
*virtually* no downsides over the unfair version, you'd take the fair
one, right?

Not that I'm saying they'd ever be a good solution to bad code, but I
do think fairness is better than none, all else being equal.


extract : 


Implement queued spinlocks for i386. This shouldn't increase the size of
the spinlock structure, while still able to handle 2^16 CPUs.



Umm. i386 spinlocks could and should be *one*byte*.

In fact, I don't even know why they are wasting four bytes right now: the 
fact that somebody made them an "int" just wastes memory. All the actual 
code uses "decb", so it's not even a question of safety. I wonder why we 
have that 32-bit thing and the ugly casts.


Anyway, I think the important point is that they can remain within 4
bytes, which is obviously the most important boundary (they could be
2 bytes on i386).

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-26 Thread Nick Piggin

Linus Torvalds wrote:


On Thu, 21 Jun 2007, Eric Dumazet wrote:


This reminds me Nick's proposal of 'queued spinlocks' 3 months ago

Maybe this should be re-considered ? (unlock is still a non atomic op, 
so we dont pay the serializing cost twice)



No. The point is simple:

IF YOU NEED THIS, YOU ARE DOING SOMETHING WRONG!

I don't understand why this is even controversial. Especially since we 
have a patch for the problem that proves my point: the _proper_ way to fix 
things is to just not do the bad thing, instead of trying to allow the bad 
behaviour and try to handle it.


Things like queued spinlocks just make excuses for bad code. 

We don't do nesting locking either, for exactly the same reason. Are 
nesting locks easier? Absolutely. They are also almost always a sign of 
a *bug*. So making spinlocks and/or mutexes nest by default is just a way 
to encourage bad programming!


Hmm, not that I have a strong opinion one way or the other, but I
don't know that they would encourage bad code. They are not going to
reduce latency under a locked section, but will improve determinism
in the contended case.

They should also improve performance in heavily contended case due to
the nature of how they spin, but I know that's not something you want
to hear about. And theoretically there should be no reason why xadd is
any slower than dec and look at the status flags, should there? I never
implementedit in optimised assembly to test though...

Some hardware seems to have no idea of fair cacheline scheduling, and
especially when there are more than 2 CPUs contending for the
cacheline, there can be large starvations.

And actually some times we have code that really wants to drop the
lock and queue behind other contenders. Most of the lockbreak stuff
for example.

Suppose we could have a completely fair spinlock primitive that has
*virtually* no downsides over the unfair version, you'd take the fair
one, right?

Not that I'm saying they'd ever be a good solution to bad code, but I
do think fairness is better than none, all else being equal.


extract : 


Implement queued spinlocks for i386. This shouldn't increase the size of
the spinlock structure, while still able to handle 2^16 CPUs.



Umm. i386 spinlocks could and should be *one*byte*.

In fact, I don't even know why they are wasting four bytes right now: the 
fact that somebody made them an int just wastes memory. All the actual 
code uses decb, so it's not even a question of safety. I wonder why we 
have that 32-bit thing and the ugly casts.


Anyway, I think the important point is that they can remain within 4
bytes, which is obviously the most important boundary (they could be
2 bytes on i386).

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-26 Thread Jarek Poplawski
On Tue, Jun 26, 2007 at 06:42:10PM +1000, Nick Piggin wrote:
...
 They should also improve performance in heavily contended case due to
 the nature of how they spin, but I know that's not something you want
 to hear about. And theoretically there should be no reason why xadd is
 any slower than dec and look at the status flags, should there? I never
 implementedit in optimised assembly to test though...
...

BTW, could you explain why the below diagnose doesn't relate
to your solution?

On 06/21/2007 12:08 PM, Ingo Molnar wrote:
...
 So it seems the problem was that if a core kept _truly_ modifying a 
 cacheline via atomics in a high enough frequency, it could artificially 
 starve the other core. (which would keep waiting for the cacheline to be 
 released one day, and which kept the first core from ever making any 
 progress) To me that looks like a real problem on the hardware side - 
 shouldnt cacheline ownership be arbitrated a bit better than that?
 

Thanks  regards,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-26 Thread Linus Torvalds


On Tue, 26 Jun 2007, Nick Piggin wrote:
 
 Hmm, not that I have a strong opinion one way or the other, but I
 don't know that they would encourage bad code. They are not going to
 reduce latency under a locked section, but will improve determinism
 in the contended case.

xadd really generally *is* slower than an add. One is often microcoded, 
the other is not.

But the real problem is that your unlock sequence is now about two 
orders of magnitude slower than it used to be. So it used to be that a 
spinlocked sequence only had a single synchronization point, now it has 
two. *That* is really bad, and I guarantee that it makes your spinlocks 
effectively twice as slow for the non-contended parts.

But your xadd thing might be worth looking at, just to see how expensive 
it is. As an _alternative_ to spinlocks, it's certainly viable.

(Side note: why make it a word? Word operations are slower on many x86 
implementations, because they add yet another prefix. You only need a 
byte)

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-26 Thread Nick Piggin

Linus Torvalds wrote:


On Tue, 26 Jun 2007, Nick Piggin wrote:


Hmm, not that I have a strong opinion one way or the other, but I
don't know that they would encourage bad code. They are not going to
reduce latency under a locked section, but will improve determinism
in the contended case.



xadd really generally *is* slower than an add. One is often microcoded, 
the other is not.


Oh. I found xadd to be not hugely slower on my P4, but it was a little
bit.


But the real problem is that your unlock sequence is now about two 
orders of magnitude slower than it used to be. So it used to be that a 
spinlocked sequence only had a single synchronization point, now it has 
two. *That* is really bad, and I guarantee that it makes your spinlocks 
effectively twice as slow for the non-contended parts.


I don't know why my unlock sequence should be that much slower? Unlocked
mov vs unlocked add? Definitely in dumb micro-benchmark testing it wasn't
twice as slow (IIRC).


But your xadd thing might be worth looking at, just to see how expensive 
it is. As an _alternative_ to spinlocks, it's certainly viable.


(Side note: why make it a word? Word operations are slower on many x86 
implementations, because they add yet another prefix. You only need a 
byte)


No real reason I guess. I'll change it.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-25 Thread Jarek Poplawski
On Sat, Jun 23, 2007 at 12:36:08PM +0200, Miklos Szeredi wrote:
...
> And it's not a NO_HZ kernel.
...

BTW, maybe I've missed this and it's unconnected, but I hope the
first config has been changed - especially this CONFIG_AGP_AMD64 = y,
and this bug from mm/slab.c has gone long ago...

Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-25 Thread Jarek Poplawski
On Sat, Jun 23, 2007 at 12:36:08PM +0200, Miklos Szeredi wrote:
...
 And it's not a NO_HZ kernel.
...

BTW, maybe I've missed this and it's unconnected, but I hope the
first config has been changed - especially this CONFIG_AGP_AMD64 = y,
and this bug from mm/slab.c has gone long ago...

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-23 Thread Linus Torvalds


On Sat, 23 Jun 2007, Miklos Szeredi wrote:
> 
> What I notice is that the interrupt distribution between the CPUs is
> very asymmetric like this:
> 
>CPU0   CPU1
>   0: 220496 42   IO-APIC-edge  timer
>   1:   3841  0   IO-APIC-edge  i8042
...
> LOC: 220499 220463
> ERR:  0
> 
> and the freezes don't really change that.  And the NMI traces show,
> that it's always CPU1 which is spinning in wait_task_inactive().

Well, the LOC thing is for the local apic timer, so while regular 
interrupts are indeed very skewed, both CPU's is nicely getting the local 
apic timer thing..

That said, the timer interrupt generally happens just a few hundred times 
a second, and if there's just a higher likelihood that it happens when the 
spinlock is taken, then half-a-second pauses could easily be just because 
even when the interrupt happens, it could be skewed to happen when the 
lock is held.

And that definitely is the case: the most expensive instruction _by_far_ 
in that loop is the actual locked instruction that acquires the lock 
(especially with the cache-line bouncing around), so an interrupt would be 
much more likely to happen right after that one rather than after the 
store that releases the lock, which can be buffered.

It can be quite interesting to look at instruction-level cycle profiling 
with oprofile, just to see where the costs are..

Linus


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-23 Thread Miklos Szeredi
> > the freezes that Miklos was seeing were hardirq contexts blocking in 
> > task_rq_lock() - that is done with interrupts disabled. (Miklos i 
> > think also tried !NOHZ kernels and older kernels, with a similar 
> > result.)
> > 
> > plus on the ptrace side, the wait_task_inactive() code had most of its 
> > overhead in the atomic op, so if any timer IRQ hit _that_ core, it was 
> > likely while we were still holding the runqueue lock!
> > 
> > i think the only thing that eventually got Miklos' laptop out of the 
> > wedge were timer irqs hitting the ptrace CPU in exactly those 
> > instructions where it was not holding the runqueue lock. (or perhaps 
> > an asynchronous SMM event delaying it for a long time)
> 
> even considering that the 'LOCK'-ed intruction was the heaviest in the 
> busy-loop, the numbers still just dont add up to 'tens of seconds of 
> lockups', so there must be something else happening too.
> 
> So here's an addition to the existing theories: the Core2Duo is a 
> 4-issue CPU architecture. Now, why does this matter? It matters to the 
> timing of the delivery of interrupts. For example, on a 3-issue 
> architecture, the instruction level profile of well-cached workloads 
> often looks like this:
> 
> c05a3b71:  710  89 d6   mov%edx,%esi
> c05a3b73:0  8b 55 c0mov0xffc0(%ebp),%edx
> c05a3b76:0  89 c3   mov%eax,%ebx
> c05a3b78:  775  8b 82 e8 00 00 00   mov0xe8(%edx),%eax
> c05a3b7e:0  8b 48 18mov0x18(%eax),%ecx
> c05a3b81:0  8b 45 c8mov0xffc8(%ebp),%eax
> c05a3b84:  792  89 1c 24mov%ebx,(%esp)
> c05a3b87:0  89 74 24 04 mov%esi,0x4(%esp)
> c05a3b8b:0  ff d1   call   *%ecx
> c05a3b8d:0  8b 4d c8mov0xffc8(%ebp),%ecx
> c05a3b90:  925  8b 41 6cmov0x6c(%ecx),%eax
> c05a3b93:0  39 41 10cmp%eax,0x10(%ecx)
> c05a3b96:0  0f 85 a8 01 00 00   jnec05a3d44 
> 
> c05a3b9c:  949  89 da   mov%ebx,%edx
> c05a3b9e:0  89 f1   mov%esi,%ecx
> c05a3ba0:0  8b 45 c8mov0xffc8(%ebp),%eax
> 
> the second column is the number of times the profiling interrupt has hit 
> that particular instruction.
> 
> Note the many zero entries - this means that for instructions that are 
> well-cached, the issue order _prevents_ interrupts from _ever_ hitting 
> to within a bundle of micro-ops that the decoder will issue! The above 
> workload was a plain lat_ctx, so nothing special, and interrupts and DMA 
> traffic were coming and going. Still the bundling of instructions was 
> very strong.
> 
> There's no guarantee of 'instruction bundling': a cachemiss can still 
> stall the pipeline and allow an interrupt to hit any instruction [where 
> interrupt delivery is valid], but on a well-cached workload like the 
> above, even a 3-issue architecture can effectively 'merge' instructions 
> to each other, and can make them essentially 'atomic' as far as external 
> interrupts go.
> 
> [ also note another interesting thing in the profile above: the
>   CALL *%ecx was likely BTB-optimized and hence we have a 'bundling' 
>   effect that is even larger than 3 instructions. ]
> 
> i think that is what might have happened on Miklos's laptop too: the 
> 'movb' of the spin_unlock() done by the wait_task_inactive() got 
> 'bundled' together with the first LOCK instruction that took it again, 
> making it very unlikely for a timer interrupt to ever hit that small 
> window in wait_task_inactive(). The cpu_relax()'s "REP; NOP" was likely 
> a simple NOP, because the Core2Duo is not an SMT platform.
> 
> to check this theory, adding 3 NOPs to the critical section should make 
> the lockups a lot less prominent too. (While NOPs are not actually 
> 'issued', they do take up decoder bandwidth, so they hopefully are able 
> to break up any 'bundle' of instructions.)
> 
> Miklos, if you've got some time to test this - could you revert the 
> fa490cfd15d7 commit and apply the patch below - does it have any impact 
> on the lockups you were experiencing?

No.  If anything it made the freezes somwhat more frequent.

And it's not a NO_HZ kernel.

What I notice is that the interrupt distribution between the CPUs is
very asymmetric like this:

   CPU0   CPU1
  0: 220496 42   IO-APIC-edge  timer
  1:   3841  0   IO-APIC-edge  i8042
  8:  1  0   IO-APIC-edge  rtc
  9:   2756  0   IO-APIC-fasteoi   acpi
 12:   2638  0   IO-APIC-edge  i8042
 14:   7776  0   IO-APIC-edge  ide0
 16:   6083  0   IO-APIC-fasteoi   uhci_hcd:usb2
 17:  34414  3   IO-APIC-fasteoi   

Re: [BUG] long freezes on thinkpad t60

2007-06-23 Thread Miklos Szeredi
  the freezes that Miklos was seeing were hardirq contexts blocking in 
  task_rq_lock() - that is done with interrupts disabled. (Miklos i 
  think also tried !NOHZ kernels and older kernels, with a similar 
  result.)
  
  plus on the ptrace side, the wait_task_inactive() code had most of its 
  overhead in the atomic op, so if any timer IRQ hit _that_ core, it was 
  likely while we were still holding the runqueue lock!
  
  i think the only thing that eventually got Miklos' laptop out of the 
  wedge were timer irqs hitting the ptrace CPU in exactly those 
  instructions where it was not holding the runqueue lock. (or perhaps 
  an asynchronous SMM event delaying it for a long time)
 
 even considering that the 'LOCK'-ed intruction was the heaviest in the 
 busy-loop, the numbers still just dont add up to 'tens of seconds of 
 lockups', so there must be something else happening too.
 
 So here's an addition to the existing theories: the Core2Duo is a 
 4-issue CPU architecture. Now, why does this matter? It matters to the 
 timing of the delivery of interrupts. For example, on a 3-issue 
 architecture, the instruction level profile of well-cached workloads 
 often looks like this:
 
 c05a3b71:  710  89 d6   mov%edx,%esi
 c05a3b73:0  8b 55 c0mov0xffc0(%ebp),%edx
 c05a3b76:0  89 c3   mov%eax,%ebx
 c05a3b78:  775  8b 82 e8 00 00 00   mov0xe8(%edx),%eax
 c05a3b7e:0  8b 48 18mov0x18(%eax),%ecx
 c05a3b81:0  8b 45 c8mov0xffc8(%ebp),%eax
 c05a3b84:  792  89 1c 24mov%ebx,(%esp)
 c05a3b87:0  89 74 24 04 mov%esi,0x4(%esp)
 c05a3b8b:0  ff d1   call   *%ecx
 c05a3b8d:0  8b 4d c8mov0xffc8(%ebp),%ecx
 c05a3b90:  925  8b 41 6cmov0x6c(%ecx),%eax
 c05a3b93:0  39 41 10cmp%eax,0x10(%ecx)
 c05a3b96:0  0f 85 a8 01 00 00   jnec05a3d44 
 schedule+0x2a4
 c05a3b9c:  949  89 da   mov%ebx,%edx
 c05a3b9e:0  89 f1   mov%esi,%ecx
 c05a3ba0:0  8b 45 c8mov0xffc8(%ebp),%eax
 
 the second column is the number of times the profiling interrupt has hit 
 that particular instruction.
 
 Note the many zero entries - this means that for instructions that are 
 well-cached, the issue order _prevents_ interrupts from _ever_ hitting 
 to within a bundle of micro-ops that the decoder will issue! The above 
 workload was a plain lat_ctx, so nothing special, and interrupts and DMA 
 traffic were coming and going. Still the bundling of instructions was 
 very strong.
 
 There's no guarantee of 'instruction bundling': a cachemiss can still 
 stall the pipeline and allow an interrupt to hit any instruction [where 
 interrupt delivery is valid], but on a well-cached workload like the 
 above, even a 3-issue architecture can effectively 'merge' instructions 
 to each other, and can make them essentially 'atomic' as far as external 
 interrupts go.
 
 [ also note another interesting thing in the profile above: the
   CALL *%ecx was likely BTB-optimized and hence we have a 'bundling' 
   effect that is even larger than 3 instructions. ]
 
 i think that is what might have happened on Miklos's laptop too: the 
 'movb' of the spin_unlock() done by the wait_task_inactive() got 
 'bundled' together with the first LOCK instruction that took it again, 
 making it very unlikely for a timer interrupt to ever hit that small 
 window in wait_task_inactive(). The cpu_relax()'s REP; NOP was likely 
 a simple NOP, because the Core2Duo is not an SMT platform.
 
 to check this theory, adding 3 NOPs to the critical section should make 
 the lockups a lot less prominent too. (While NOPs are not actually 
 'issued', they do take up decoder bandwidth, so they hopefully are able 
 to break up any 'bundle' of instructions.)
 
 Miklos, if you've got some time to test this - could you revert the 
 fa490cfd15d7 commit and apply the patch below - does it have any impact 
 on the lockups you were experiencing?

No.  If anything it made the freezes somwhat more frequent.

And it's not a NO_HZ kernel.

What I notice is that the interrupt distribution between the CPUs is
very asymmetric like this:

   CPU0   CPU1
  0: 220496 42   IO-APIC-edge  timer
  1:   3841  0   IO-APIC-edge  i8042
  8:  1  0   IO-APIC-edge  rtc
  9:   2756  0   IO-APIC-fasteoi   acpi
 12:   2638  0   IO-APIC-edge  i8042
 14:   7776  0   IO-APIC-edge  ide0
 16:   6083  0   IO-APIC-fasteoi   uhci_hcd:usb2
 17:  34414  3   IO-APIC-fasteoi   uhci_hcd:usb3, HDA Intel
 18:  0  0   IO-APIC-fasteoi   

Re: [BUG] long freezes on thinkpad t60

2007-06-23 Thread Linus Torvalds


On Sat, 23 Jun 2007, Miklos Szeredi wrote:
 
 What I notice is that the interrupt distribution between the CPUs is
 very asymmetric like this:
 
CPU0   CPU1
   0: 220496 42   IO-APIC-edge  timer
   1:   3841  0   IO-APIC-edge  i8042
...
 LOC: 220499 220463
 ERR:  0
 
 and the freezes don't really change that.  And the NMI traces show,
 that it's always CPU1 which is spinning in wait_task_inactive().

Well, the LOC thing is for the local apic timer, so while regular 
interrupts are indeed very skewed, both CPU's is nicely getting the local 
apic timer thing..

That said, the timer interrupt generally happens just a few hundred times 
a second, and if there's just a higher likelihood that it happens when the 
spinlock is taken, then half-a-second pauses could easily be just because 
even when the interrupt happens, it could be skewed to happen when the 
lock is held.

And that definitely is the case: the most expensive instruction _by_far_ 
in that loop is the actual locked instruction that acquires the lock 
(especially with the cache-line bouncing around), so an interrupt would be 
much more likely to happen right after that one rather than after the 
store that releases the lock, which can be buffered.

It can be quite interesting to look at instruction-level cycle profiling 
with oprofile, just to see where the costs are..

Linus


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-22 Thread Jarek Poplawski
On Thu, Jun 21, 2007 at 09:01:28AM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 21 Jun 2007, Jarek Poplawski wrote:
...
> So I don't see how you could possibly having two different CPU's getting 
> into some lock-step in that loop: changing "task_rq()" is a really quite 
> heavy operation (it's about migrating between CPU's), and generally 
> happens at a fairly low frequency (ie "a couple of times a second" kind of 
> thing, not "tight CPU loop").

Yes, I've agreed with Ingo it was only one of my illusions...

> But bugs happen..
> 
> > Another possible problem could be a result of some wrong optimization
> > or wrong propagation of change of this task_rq(p) value.
> 
> I agree, but that kind of bug would likely not cause temporary hangs, but 
> actual "the machine is dead" operations. If you get the totally *wrong* 
> value due to some systematic bug, you'd be waiting forever for it to 
> match, not get into a loop for a half second and then it clearing up..
> 
> But I don't think we can throw the "it's another bug" theory _entirely_ 
> out the window.

Alas, I'm the last here who should talk with you or Ingo about
hardware, but my point is that until it's not 100% proven this
is spinlocks vs. cpu case any nearby possibilities should be
considered. One of them is this loop, which can ... loop. Of
course we can't see any reason for this, but if something is
theoretically allowed it can happen. Here it's enough the
task_rq(p) is for some very unprobable reason (maybe buggy
hardware, code or compiling) cached or read too late, and
maybe it's only on this one only notebook in the world? If
you're sure this is not the case, let's forget about this.
Of course, I don't mean it's a direct optimization. I think
about something that could be triggered by something like this
tested smp_mb() in entirely another place. IMHO, it would be
very interesting to assure if this barrier fixed the spinlock
or maybe some variable.

There is also another interesting question: if it was only about
spinlocks (which may be) why on those watchdog traces there
are mainly two "fighters": wait_task_inactive() and
try_to_wake_up(). It seems there should be seen more clients
of task_rq_lock(). So, my another unlikely idea is: maybe
for some other strange, unprobable but only theoretically
possibility there could be something wrong around this
yield() in wait_task_inactive(). The comment above this
function reads about the task that *will* unschedule.
So, maybe it would be not nice if e.g. during this yield()
something wakes it up?

...
> But it only happens with badly coded software: the rule simply is that you 
> MUST NOT release and immediately re-acquire the same spinlock on the same 
> core, because as far as other cores are concerned, that's basically the 
> same as never releasing it in the first place.

I totally agree! Let's only get certainty there is nothing
more here. BTW, is there any reason not to add some test with
a warning under spinlock debugging against such badly behaving
places?

Regards,
Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-22 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> the freezes that Miklos was seeing were hardirq contexts blocking in 
> task_rq_lock() - that is done with interrupts disabled. (Miklos i 
> think also tried !NOHZ kernels and older kernels, with a similar 
> result.)
> 
> plus on the ptrace side, the wait_task_inactive() code had most of its 
> overhead in the atomic op, so if any timer IRQ hit _that_ core, it was 
> likely while we were still holding the runqueue lock!
> 
> i think the only thing that eventually got Miklos' laptop out of the 
> wedge were timer irqs hitting the ptrace CPU in exactly those 
> instructions where it was not holding the runqueue lock. (or perhaps 
> an asynchronous SMM event delaying it for a long time)

even considering that the 'LOCK'-ed intruction was the heaviest in the 
busy-loop, the numbers still just dont add up to 'tens of seconds of 
lockups', so there must be something else happening too.

So here's an addition to the existing theories: the Core2Duo is a 
4-issue CPU architecture. Now, why does this matter? It matters to the 
timing of the delivery of interrupts. For example, on a 3-issue 
architecture, the instruction level profile of well-cached workloads 
often looks like this:

c05a3b71:  710  89 d6   mov%edx,%esi
c05a3b73:0  8b 55 c0mov0xffc0(%ebp),%edx
c05a3b76:0  89 c3   mov%eax,%ebx
c05a3b78:  775  8b 82 e8 00 00 00   mov0xe8(%edx),%eax
c05a3b7e:0  8b 48 18mov0x18(%eax),%ecx
c05a3b81:0  8b 45 c8mov0xffc8(%ebp),%eax
c05a3b84:  792  89 1c 24mov%ebx,(%esp)
c05a3b87:0  89 74 24 04 mov%esi,0x4(%esp)
c05a3b8b:0  ff d1   call   *%ecx
c05a3b8d:0  8b 4d c8mov0xffc8(%ebp),%ecx
c05a3b90:  925  8b 41 6cmov0x6c(%ecx),%eax
c05a3b93:0  39 41 10cmp%eax,0x10(%ecx)
c05a3b96:0  0f 85 a8 01 00 00   jnec05a3d44 
c05a3b9c:  949  89 da   mov%ebx,%edx
c05a3b9e:0  89 f1   mov%esi,%ecx
c05a3ba0:0  8b 45 c8mov0xffc8(%ebp),%eax

the second column is the number of times the profiling interrupt has hit 
that particular instruction.

Note the many zero entries - this means that for instructions that are 
well-cached, the issue order _prevents_ interrupts from _ever_ hitting 
to within a bundle of micro-ops that the decoder will issue! The above 
workload was a plain lat_ctx, so nothing special, and interrupts and DMA 
traffic were coming and going. Still the bundling of instructions was 
very strong.

There's no guarantee of 'instruction bundling': a cachemiss can still 
stall the pipeline and allow an interrupt to hit any instruction [where 
interrupt delivery is valid], but on a well-cached workload like the 
above, even a 3-issue architecture can effectively 'merge' instructions 
to each other, and can make them essentially 'atomic' as far as external 
interrupts go.

[ also note another interesting thing in the profile above: the
  CALL *%ecx was likely BTB-optimized and hence we have a 'bundling' 
  effect that is even larger than 3 instructions. ]

i think that is what might have happened on Miklos's laptop too: the 
'movb' of the spin_unlock() done by the wait_task_inactive() got 
'bundled' together with the first LOCK instruction that took it again, 
making it very unlikely for a timer interrupt to ever hit that small 
window in wait_task_inactive(). The cpu_relax()'s "REP; NOP" was likely 
a simple NOP, because the Core2Duo is not an SMT platform.

to check this theory, adding 3 NOPs to the critical section should make 
the lockups a lot less prominent too. (While NOPs are not actually 
'issued', they do take up decoder bandwidth, so they hopefully are able 
to break up any 'bundle' of instructions.)

Miklos, if you've got some time to test this - could you revert the 
fa490cfd15d7 commit and apply the patch below - does it have any impact 
on the lockups you were experiencing?

Ingo

---
 kernel/sched.c |1 +
 1 file changed, 1 insertion(+)

Index: linux/kernel/sched.c
===
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1131,6 +1131,7 @@ repeat:
preempted = !task_running(rq, p);
task_rq_unlock(rq, );
cpu_relax();
+   asm volatile ("nop; nop; nop;");
if (preempted)
yield();
goto repeat;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] long freezes on thinkpad t60

2007-06-22 Thread Ingo Molnar

* Ingo Molnar [EMAIL PROTECTED] wrote:

 the freezes that Miklos was seeing were hardirq contexts blocking in 
 task_rq_lock() - that is done with interrupts disabled. (Miklos i 
 think also tried !NOHZ kernels and older kernels, with a similar 
 result.)
 
 plus on the ptrace side, the wait_task_inactive() code had most of its 
 overhead in the atomic op, so if any timer IRQ hit _that_ core, it was 
 likely while we were still holding the runqueue lock!
 
 i think the only thing that eventually got Miklos' laptop out of the 
 wedge were timer irqs hitting the ptrace CPU in exactly those 
 instructions where it was not holding the runqueue lock. (or perhaps 
 an asynchronous SMM event delaying it for a long time)

even considering that the 'LOCK'-ed intruction was the heaviest in the 
busy-loop, the numbers still just dont add up to 'tens of seconds of 
lockups', so there must be something else happening too.

So here's an addition to the existing theories: the Core2Duo is a 
4-issue CPU architecture. Now, why does this matter? It matters to the 
timing of the delivery of interrupts. For example, on a 3-issue 
architecture, the instruction level profile of well-cached workloads 
often looks like this:

c05a3b71:  710  89 d6   mov%edx,%esi
c05a3b73:0  8b 55 c0mov0xffc0(%ebp),%edx
c05a3b76:0  89 c3   mov%eax,%ebx
c05a3b78:  775  8b 82 e8 00 00 00   mov0xe8(%edx),%eax
c05a3b7e:0  8b 48 18mov0x18(%eax),%ecx
c05a3b81:0  8b 45 c8mov0xffc8(%ebp),%eax
c05a3b84:  792  89 1c 24mov%ebx,(%esp)
c05a3b87:0  89 74 24 04 mov%esi,0x4(%esp)
c05a3b8b:0  ff d1   call   *%ecx
c05a3b8d:0  8b 4d c8mov0xffc8(%ebp),%ecx
c05a3b90:  925  8b 41 6cmov0x6c(%ecx),%eax
c05a3b93:0  39 41 10cmp%eax,0x10(%ecx)
c05a3b96:0  0f 85 a8 01 00 00   jnec05a3d44 schedule+0x2a4
c05a3b9c:  949  89 da   mov%ebx,%edx
c05a3b9e:0  89 f1   mov%esi,%ecx
c05a3ba0:0  8b 45 c8mov0xffc8(%ebp),%eax

the second column is the number of times the profiling interrupt has hit 
that particular instruction.

Note the many zero entries - this means that for instructions that are 
well-cached, the issue order _prevents_ interrupts from _ever_ hitting 
to within a bundle of micro-ops that the decoder will issue! The above 
workload was a plain lat_ctx, so nothing special, and interrupts and DMA 
traffic were coming and going. Still the bundling of instructions was 
very strong.

There's no guarantee of 'instruction bundling': a cachemiss can still 
stall the pipeline and allow an interrupt to hit any instruction [where 
interrupt delivery is valid], but on a well-cached workload like the 
above, even a 3-issue architecture can effectively 'merge' instructions 
to each other, and can make them essentially 'atomic' as far as external 
interrupts go.

[ also note another interesting thing in the profile above: the
  CALL *%ecx was likely BTB-optimized and hence we have a 'bundling' 
  effect that is even larger than 3 instructions. ]

i think that is what might have happened on Miklos's laptop too: the 
'movb' of the spin_unlock() done by the wait_task_inactive() got 
'bundled' together with the first LOCK instruction that took it again, 
making it very unlikely for a timer interrupt to ever hit that small 
window in wait_task_inactive(). The cpu_relax()'s REP; NOP was likely 
a simple NOP, because the Core2Duo is not an SMT platform.

to check this theory, adding 3 NOPs to the critical section should make 
the lockups a lot less prominent too. (While NOPs are not actually 
'issued', they do take up decoder bandwidth, so they hopefully are able 
to break up any 'bundle' of instructions.)

Miklos, if you've got some time to test this - could you revert the 
fa490cfd15d7 commit and apply the patch below - does it have any impact 
on the lockups you were experiencing?

Ingo

---
 kernel/sched.c |1 +
 1 file changed, 1 insertion(+)

Index: linux/kernel/sched.c
===
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1131,6 +1131,7 @@ repeat:
preempted = !task_running(rq, p);
task_rq_unlock(rq, flags);
cpu_relax();
+   asm volatile (nop; nop; nop;);
if (preempted)
yield();
goto repeat;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-22 Thread Jarek Poplawski
On Thu, Jun 21, 2007 at 09:01:28AM -0700, Linus Torvalds wrote:
 
 
 On Thu, 21 Jun 2007, Jarek Poplawski wrote:
...
 So I don't see how you could possibly having two different CPU's getting 
 into some lock-step in that loop: changing task_rq() is a really quite 
 heavy operation (it's about migrating between CPU's), and generally 
 happens at a fairly low frequency (ie a couple of times a second kind of 
 thing, not tight CPU loop).

Yes, I've agreed with Ingo it was only one of my illusions...

 But bugs happen..
 
  Another possible problem could be a result of some wrong optimization
  or wrong propagation of change of this task_rq(p) value.
 
 I agree, but that kind of bug would likely not cause temporary hangs, but 
 actual the machine is dead operations. If you get the totally *wrong* 
 value due to some systematic bug, you'd be waiting forever for it to 
 match, not get into a loop for a half second and then it clearing up..
 
 But I don't think we can throw the it's another bug theory _entirely_ 
 out the window.

Alas, I'm the last here who should talk with you or Ingo about
hardware, but my point is that until it's not 100% proven this
is spinlocks vs. cpu case any nearby possibilities should be
considered. One of them is this loop, which can ... loop. Of
course we can't see any reason for this, but if something is
theoretically allowed it can happen. Here it's enough the
task_rq(p) is for some very unprobable reason (maybe buggy
hardware, code or compiling) cached or read too late, and
maybe it's only on this one only notebook in the world? If
you're sure this is not the case, let's forget about this.
Of course, I don't mean it's a direct optimization. I think
about something that could be triggered by something like this
tested smp_mb() in entirely another place. IMHO, it would be
very interesting to assure if this barrier fixed the spinlock
or maybe some variable.

There is also another interesting question: if it was only about
spinlocks (which may be) why on those watchdog traces there
are mainly two fighters: wait_task_inactive() and
try_to_wake_up(). It seems there should be seen more clients
of task_rq_lock(). So, my another unlikely idea is: maybe
for some other strange, unprobable but only theoretically
possibility there could be something wrong around this
yield() in wait_task_inactive(). The comment above this
function reads about the task that *will* unschedule.
So, maybe it would be not nice if e.g. during this yield()
something wakes it up?

...
 But it only happens with badly coded software: the rule simply is that you 
 MUST NOT release and immediately re-acquire the same spinlock on the same 
 core, because as far as other cores are concerned, that's basically the 
 same as never releasing it in the first place.

I totally agree! Let's only get certainty there is nothing
more here. BTW, is there any reason not to add some test with
a warning under spinlock debugging against such badly behaving
places?

Regards,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> On Thu, 21 Jun 2007, Ingo Molnar wrote:
> > 
> > damn, i first wrote up an explanation about why that ugly __delay(1) is 
> > there (it almost hurts my eyes when i look at it!) but then deleted it 
> > as superfluous :-/
> 
> I'm fine with a delay, but the __delay(1) is simply not "correct". It 
> doesn't do anything.

it's a bit trickier than that. Yes, it's a simple 1-entry loop and thus 
makes little sense to call. But it's a loop that got boot-time 
calibrated, so we can do this in the spinlock-debug code:

u64 loops = loops_per_jiffy * HZ;

this guarantees that we will loop for _at least_ 1 second before 
printing a message. (in practice it's much longer, especially with the 
current naive trylock approach)

Why? Because as part of the activities that the spin-loop does, we also 
do everything that an mdelay(1000) would do. We do it 'piecemail-wise', 
and we do it very inefficiently, but the lower time-bound should be 
guaranteed. This is done because most of the problems were caused by too 
short looping and bogus debug printouts. So this is basically an 
open-coded udelay implementation.

Furthermore, with the spin_is_locked() fix i just sent, the __loop(1) 
solution should actually be quite close to a real udelay() thing, 
without the delay effect. It would probably more accurate to increase it 
to loops_per_jiffy*10*HZ/16 and to call a __loop(16) thing, to get 
closer timings. (but then some people would argue 'why dont you take the 
lock as soon as it's released'.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Linus Torvalds


On Thu, 21 Jun 2007, Ingo Molnar wrote:
> 
> damn, i first wrote up an explanation about why that ugly __delay(1) is 
> there (it almost hurts my eyes when i look at it!) but then deleted it 
> as superfluous :-/

I'm fine with a delay, but the __delay(1) is simply not "correct". It 
doesn't do anything.

"udelay()" waits for a certain time. Use that. 

> the reason for the __delay(1) was really mundane: to be able to figure 
> out when to print a 'we locked up' message to the user.

No it does not.

You may think it does, but it does nothing of the sort.

Use "udelay()" or somethign that actually takes a *time*.

Just __delay() is nothing but a loop, and calling it with an argument of 1 
is stupid and buggy. 

The only *possibly* valid use of "__delay()" implies using a counter that 
is based on the "loops_per_sec" thing, which depends on what the delay  
function actually is.

For example, the delay function may well turn out to be this:

__asm__ __volatile__(
"\tjmp 1f\n"
".align 16\n"
"1:\tjmp 2f\n"
".align 16\n"
"2:\tdecl %0\n\tjns 2b"
:"=" (d0)
:"0" (loops));

Notice? "Your code, it does nothing!"

When I said that the code was buggy, I meant it.

It has nothing to do with spinlocks. And "__delay(1)" is *always* a bug.

You migth want to replace it with

smp_rmb();
udelay(1);

instead, at which point it *does* something: it has that read barrier 
(which is not actually needed on x86, but whatever), and it has a delay 
that is *meaningful*.

A plain "__delay(1)" is neither.

So let me repeat my statement: "What a piece of crap".

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Eric Dumazet

Linus Torvalds a écrit :


On Thu, 21 Jun 2007, Linus Torvalds wrote:
We don't do nesting locking either, for exactly the same reason. Are 
nesting locks "easier"? Absolutely. They are also almost always a sign of 
a *bug*. So making spinlocks and/or mutexes nest by default is just a way 
to encourage bad programming!


Side note, and as a "truth in advertising" section: I'll have to admit 
that I argued against fair semaphores on the same grounds. I was wrong 
then (and eventually admitted it, and we obviously try to make our mutexes 
and semaphores fair these days!), and maybe I'm wrong now.


If somebody can actually come up with a sequence where we have spinlock 
starvation, and it's not about an example of bad locking, and nobody 
really can come up with any other way to fix it, we may eventually have to 
add the notion of "fair spinlocks".




I tried to find such a sequence, but I think its more a matter of hardware 
evolution, and some degenerated cases.


In some years (months ?), it might possible to starve say the file struct 
spinlock of a process in a open()/close() infinite loop. This because the 
number of instruction per 'memory cache line transfert between cpus/core' is 
raising.


But then one can say its a bug in user code :)

Another way to starve kernel might be a loop doing settime() , since seqlock 
are quite special in serialization :


Only seqlock's writers perform atomic ops, readers could be starved because of 
some hardware 'optimization'.



So my arguments are purely pragmatic. It's not that I hate fairness per 
se. I dislike it only when it's used to "solve" (aka hide) other problems.


In the end, some situations do need fairness, and the fact that aiming for 
fairness is often harder, slower, and more complicated than not doing so 
at that point turns into a non-argument. If you need it, you need it.


Maybe some *big* NUMA machines really want this fairness (even if it cost some 
cycles as pointed by Davide in http://lkml.org/lkml/2007/3/29/246 ) , I am 
just guessing since I cannot test such monsters. I tested Davide program on a 
Dual Opteron and got some perf difference.



$ ./qspins  -n 2
now testing: TICKLOCK
timeres=4000
uscycles=1991
AVG[0]: 2195.25 cycles/loop
SIG[0]: 11.813657
AVG[1]: 2212.312500 cycles/loop
SIG[1]: 38.038991

$ ./qspins  -n 2 -s
now testing: SPINLOCK
timeres=4000
uscycles=1991
AVG[0]: 2066.00 cycles/loop
SIG[0]: 0.00
AVG[1]: 2115.687500 cycles/loop
SIG[1]: 63.083000




I just don't think we need it, and we're better off solving problems other 
ways.


(For example, we might also solve such problems by creating a separate
"fair_spin_lock" abstraction, and only making the particular users that 
need it actually use it. It would depend a bit on whether the cost of 
implementing the fairness is noticeable enough for it to be worth having 
a separate construct for it).


Linus




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Linus Torvalds


On Thu, 21 Jun 2007, Ingo Molnar wrote:
> 
> yeah. I think Linux is i think the only OS on the planet that is using 
> the movb trick for unlock, it even triggered a hardware erratum ;)

I'm pretty sure others do it too.

Maybe not on an OS level (but I actually doubt that - I'd be surprised if 
Windows doesn't do the exact same thing), but I know for a fact that a lot 
of people in threaded libraries end up depending very much on the "simple 
store closes a locked section".

Just googling for "xchg" "mov" "spinlock" "-linux" shows discussion boards 
for Windows developers with open-coded spinlocks like


int ResourceFlag = 0; // 0=Free, 1=Inuse
...
// Wait until we get the resource
while(InterlockedExchange(, 1) != 0) {
   Sleep(0); } // Wait a tad
// Have the resource
... // do your thing
ResourceFlag = 0; // Release the resource


and that's definitely Windows code, not some Linux person doing it.

And this is from an OS2 forum

unsigned owned=0;

void request() {
  while(LockedExchanged(,1)!=0)
;
}

void release() {
  owned = 0;
}

so it's not even something unusual.

So while arguably these people don't know (and don't care) about subtle 
issues like memory ordering, I can *guarantee* that a lot of programs 
depend on them, even if that dependency may often come from a lack of 
knowledge, rather than actively understanding what we do like in the Linux 
kernel community.

(And yes, they rely on compilers not reordering either. Tough.)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> >  for (;;) {
> >  for (i = 0; i < loops; i++) {
> >  if (__raw_write_trylock(>raw_lock))
> >  return;
> >  __delay(1);
> >  }
> 
> What a piece of crap. 
> 
> Anybody who ever waits for a lock by busy-looping over it is BUGGY, 
> dammit!
> 
> The only correct way to wait for a lock is:
> 
>   (a) try it *once* with an atomic r-m-w 
>   (b) loop over just _reading_ it (and something that implies a memory 
>   barrier, _not_ "__delay()". Use "cpu_relax()" or "smp_rmb()")
>   (c) rinse and repeat.

damn, i first wrote up an explanation about why that ugly __delay(1) is 
there (it almost hurts my eyes when i look at it!) but then deleted it 
as superfluous :-/

really, it's not because i'm stupid (although i might still be stupid 
for other resons ;-), it wasnt there in earlier spin-debug versions. We 
even had an inner spin_is_locked() loop at a stage (and should add it 
again).

the reason for the __delay(1) was really mundane: to be able to figure 
out when to print a 'we locked up' message to the user. If it's 1 
second, it causes false positive on some systems. If it's 10 minutes, 
people press reset before we print out any useful data. It used to be 
just a loop of rep_nop()s, but that was hard to calibrate: on certain 
newer hardware it was triggering as fast as in 2 seconds, causing many 
false positives. We cannot use jiffies nor any other clocksource in this 
debug code.

so i settled for the butt-ugly but working __delay(1) thing, to be able 
to time the debug messages.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> On Thu, 21 Jun 2007, Ingo Molnar wrote:
> > 
> > I can understand why no data is saved by this change: gcc is 
> > aligning the next field to a natural boundary anyway and we dont 
> > really have arrays of spinlocks (fortunately).
> 
> Actually, some data structures could well shrink.
> 
> Look at "struct task_struct", for example. Right now it has two 
> spinlocks right next to each other (alloc_lock and pi_lock).

yeah. We've got init_task's task-struct embedded in the vmlinux, but 
it's aligned to 32 bytes, which probably hides this effect. We'd only 
see it if the size change just happened to bring a key data structure 
(which is also embedded in the vmlinux) just below a modulo 32 bytes 
boundary. The chance for that is around 6:32 per 'merge' event. That 
means that there cannot be all that many such cases ;-)

anyway, the shorter init sequence is worth it already.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> No, the cache line arbitration doesn't know anything about "locked" vs 
> "unlocked" instructions (it could, but there really is no point).
> 
> The real issue is that locked instructions on x86 are serializing, 
> which makes them extremely slow (compared to just a write), and then 
> by being slow they effectively just make the window for another core 
> bigger.
> 
> IOW, it doesn't "fix" anything, it just hides the bug with timing.

yeah. I think Linux is i think the only OS on the planet that is using 
the movb trick for unlock, it even triggered a hardware erratum ;) So it 
might surprise some hw makers who might rely on the heuristics that each 
critical section lock and unlock is a LOCK-ed instruction.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Linus Torvalds


On Thu, 21 Jun 2007, Ingo Molnar wrote:
> 
> * Linus Torvalds <[EMAIL PROTECTED]> wrote:
> 
> > If somebody can actually come up with a sequence where we have 
> > spinlock starvation, and it's not about an example of bad locking, and 
> > nobody really can come up with any other way to fix it, we may 
> > eventually have to add the notion of "fair spinlocks".
> 
> there was one bad case i can remember: the spinlock debugging code had a 
> trylock open-coded loop and on certain Opterons CPUs were starving each 
> other.

But this is a perfect example of exactly what I'm talking about:

 THAT CODE IS HORRIBLY BUGGY!

It's not the spinlocks that are broken, it's that damn code.

>  for (;;) {
>  for (i = 0; i < loops; i++) {
>  if (__raw_write_trylock(>raw_lock))
>  return;
>  __delay(1);
>  }

What a piece of crap. 

Anybody who ever waits for a lock by busy-looping over it is BUGGY, 
dammit!

The only correct way to wait for a lock is:

  (a) try it *once* with an atomic r-m-w 
  (b) loop over just _reading_ it (and something that implies a memory 
  barrier, _not_ "__delay()". Use "cpu_relax()" or "smp_rmb()")
  (c) rinse and repeat.

and code like the above should just be shot on sight.

So don't blame the spinlocks or the hardware for crap code.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> It's in fact entirely possible that the long freezes have always been 
> there, but the NOHZ option meant that we had much longer stretches of 
> time without things like timer interrupts to jumble up the timing! So 
> maybe the freezes existed before, but with timer interrupts happening 
> hundreds of times a second, they weren't noticeable to humans.

the freezes that Miklos was seeing were hardirq contexts blocking in 
task_rq_lock() - that is done with interrupts disabled. (Miklos i think 
also tried !NOHZ kernels and older kernels, with a similar result.)

plus on the ptrace side, the wait_task_inactive() code had most of its 
overhead in the atomic op, so if any timer IRQ hit _that_ core, it was 
likely while we were still holding the runqueue lock!

i think the only thing that eventually got Miklos' laptop out of the 
wedge were timer irqs hitting the ptrace CPU in exactly those 
instructions where it was not holding the runqueue lock. (or perhaps an 
asynchronous SMM event delaying it for a long time)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> (And no, on 32-bit x86, we don't allow more than 128 CPU's. I don't 
> think such an insane machine has ever existed).

and if people _really_ want to boot a large-smp 32-bit kernel on some 
new, tons-of-cpus box, as a workaround they can enable the spinlock 
debugging code, which has no limitation on the number of CPUs supported.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Linus Torvalds


On Thu, 21 Jun 2007, Ingo Molnar wrote:
> 
> I can understand why no data is saved by this change: gcc is aligning 
> the next field to a natural boundary anyway and we dont really have 
> arrays of spinlocks (fortunately).

Actually, some data structures could well shrink.

Look at "struct task_struct", for example. Right now it has two spinlocks 
right next to each other (alloc_lock and pi_lock).

Other data structures may have things like bitfields etc.

But yeah, I'd not expect that to be very common, and in some cases you 
might have to re-order data structures to take advantage of better 
packing, and even then it's probably not all that noticeable.

> but this is certainly not something for 2.6.22, it's an early 2.6.23 
> matter i suspect.

Oh, absolutely.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> If somebody can actually come up with a sequence where we have 
> spinlock starvation, and it's not about an example of bad locking, and 
> nobody really can come up with any other way to fix it, we may 
> eventually have to add the notion of "fair spinlocks".

there was one bad case i can remember: the spinlock debugging code had a 
trylock open-coded loop and on certain Opterons CPUs were starving each 
other. This used to trigger with the ->tree_lock rwlock i think, on 
heavy MM loads. The starvation got so bad that the NMI watchdog started 
triggering ...

interestingly, this only triggered for certain rwlocks. Thus we, after a 
few failed attempts to pacify this open-coded loop, currently have that 
code disabled in lib/spinlock_debug.c:

 #if 0   /* This can cause lockups */
 static void __write_lock_debug(rwlock_t *lock)
 {
 u64 i;
 u64 loops = loops_per_jiffy * HZ;
 int print_once = 1;

 for (;;) {
 for (i = 0; i < loops; i++) {
 if (__raw_write_trylock(>raw_lock))
 return;
 __delay(1);
 }

the weird thing is that we still have the _very same_ construct in 
__spin_lock_debug():

 for (i = 0; i < loops; i++) {
 if (__raw_spin_trylock(>raw_lock))
 return;
 __delay(1);
 }

if there are any problems with this then people are not complaining loud 
enough :-)

note that because this is a trylock based loop, the acquire+release 
sequence problem should not apply to this problem.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> Umm. i386 spinlocks could and should be *one*byte*.
> 
> In fact, I don't even know why they are wasting four bytes right now: 
> the fact that somebody made them an "int" just wastes memory. All the 
> actual code uses "decb", so it's not even a question of safety. I 
> wonder why we have that 32-bit thing and the ugly casts.
> 
> Ingo, any memory of that?

no real reason that i can recall - i guess nobody dared to touch it 
because it used to have that 'volatile', indicating black voodoo ;-) Now 
that the bad stigma has been removed, we could try the patch below. It 
boots fine here, and we save 1K of kernel text size:

 textdata bss dec hex filename
  6236003  611992  401408 7249403  6e9dfb vmlinux.before
  6235075  611992  401408 7248475  6e9a5b vmlinux.after

I can understand why no data is saved by this change: gcc is aligning 
the next field to a natural boundary anyway and we dont really have 
arrays of spinlocks (fortunately). [and we save no data even if using 
the ((packed)) attribute.] Perhaps some data structure that is never in 
the kernel image itself still got smaller? Any good way to determine 
that?

But why is the text size different? Ah: i think it's spin_lock_init() 
getting shorter :-)

but this is certainly not something for 2.6.22, it's an early 2.6.23 
matter i suspect.

Ingo

--->
From: Ingo Molnar <[EMAIL PROTECTED]>
Subject: [patch] spinlocks i386: change them to byte fields

all spinlock ops are on byte operands, so change the spinlock field to 
be unsigned char. This saves a bit of kernel text size:

   textdata bss dec hex filename
6236003  611992  401408 7249403  6e9dfb vmlinux.before
6235075  611992  401408 7248475  6e9a5b vmlinux.after

Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
---
 include/asm-i386/spinlock_types.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/include/asm-i386/spinlock_types.h
===
--- linux.orig/include/asm-i386/spinlock_types.h
+++ linux/include/asm-i386/spinlock_types.h
@@ -6,7 +6,7 @@
 #endif
 
 typedef struct {
-   unsigned int slock;
+   unsigned char slock;
 } raw_spinlock_t;
 
 #define __RAW_SPIN_LOCK_UNLOCKED   { 1 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Linus Torvalds


On Thu, 21 Jun 2007, Linus Torvalds wrote:
> 
> We don't do nesting locking either, for exactly the same reason. Are 
> nesting locks "easier"? Absolutely. They are also almost always a sign of 
> a *bug*. So making spinlocks and/or mutexes nest by default is just a way 
> to encourage bad programming!

Side note, and as a "truth in advertising" section: I'll have to admit 
that I argued against fair semaphores on the same grounds. I was wrong 
then (and eventually admitted it, and we obviously try to make our mutexes 
and semaphores fair these days!), and maybe I'm wrong now.

If somebody can actually come up with a sequence where we have spinlock 
starvation, and it's not about an example of bad locking, and nobody 
really can come up with any other way to fix it, we may eventually have to 
add the notion of "fair spinlocks".

So my arguments are purely pragmatic. It's not that I hate fairness per 
se. I dislike it only when it's used to "solve" (aka hide) other problems.

In the end, some situations do need fairness, and the fact that aiming for 
fairness is often harder, slower, and more complicated than not doing so 
at that point turns into a non-argument. If you need it, you need it.

I just don't think we need it, and we're better off solving problems other 
ways.

(For example, we might also solve such problems by creating a separate
"fair_spin_lock" abstraction, and only making the particular users that 
need it actually use it. It would depend a bit on whether the cost of 
implementing the fairness is noticeable enough for it to be worth having 
a separate construct for it).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Linus Torvalds


On Thu, 21 Jun 2007, Eric Dumazet wrote:
> 
> This reminds me Nick's proposal of 'queued spinlocks' 3 months ago
> 
> Maybe this should be re-considered ? (unlock is still a non atomic op, 
> so we dont pay the serializing cost twice)

No. The point is simple:

IF YOU NEED THIS, YOU ARE DOING SOMETHING WRONG!

I don't understand why this is even controversial. Especially since we 
have a patch for the problem that proves my point: the _proper_ way to fix 
things is to just not do the bad thing, instead of trying to allow the bad 
behaviour and try to handle it.

Things like queued spinlocks just make excuses for bad code. 

We don't do nesting locking either, for exactly the same reason. Are 
nesting locks "easier"? Absolutely. They are also almost always a sign of 
a *bug*. So making spinlocks and/or mutexes nest by default is just a way 
to encourage bad programming!

> extract : 
> 
> Implement queued spinlocks for i386. This shouldn't increase the size of
> the spinlock structure, while still able to handle 2^16 CPUs.

Umm. i386 spinlocks could and should be *one*byte*.

In fact, I don't even know why they are wasting four bytes right now: the 
fact that somebody made them an "int" just wastes memory. All the actual 
code uses "decb", so it's not even a question of safety. I wonder why we 
have that 32-bit thing and the ugly casts.

Ingo, any memory of that?

(And no, on 32-bit x86, we don't allow more than 128 CPU's. I don't think 
such an insane machine has ever existed).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Eric Dumazet
On Thu, 21 Jun 2007 10:31:53 -0700 (PDT)
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> 
> 
> On Thu, 21 Jun 2007, Chuck Ebbert wrote:
> > 
> > A while ago I showed that spinlocks were a lot more fair when doing
> > unlock with the xchg instruction on x86. Probably the arbitration is all
> > screwed up because we use a mov instruction, which while atomic is not
> > locked.
> 
> No, the cache line arbitration doesn't know anything about "locked" vs 
> "unlocked" instructions (it could, but there really is no point).
> 
> The real issue is that locked instructions on x86 are serializing, which 
> makes them extremely slow (compared to just a write), and then by being 
> slow they effectively just make the window for another core bigger.
> 

This reminds me Nick's proposal of 'queued spinlocks' 3 months ago

Maybe this should be re-considered ? (unlock is still a non atomic op, 
so we dont pay the serializing cost twice)

http://lwn.net/Articles/227506/

extract : 

Implement queued spinlocks for i386. This shouldn't increase the size of
the spinlock structure, while still able to handle 2^16 CPUs.

The queued spinlock has 2 fields, a head and a tail, which are indexes
into a FIFO of waiting CPUs. To take a spinlock, a CPU performs an
"atomic_inc_return" on the head index, and keeps the returned value as
a ticket. The CPU then spins until the tail index is equal to that
ticket.

To unlock a spinlock, the tail index is incremented (this can be non
atomic, because only the lock owner will modify tail).

Implementation inefficiencies aside, this change should have little
effect on performance for uncontended locks, but will have quite a
large cost for highly contended locks [O(N) cacheline transfers vs
O(1) per lock aquisition, where N is the number of CPUs contending].
The benefit is is that contended locks will not cause any starvation.

Just an idea. Big NUMA hardware seems to have fairness logic that
prevents starvation for the regular spinlock logic. But it might be
interesting for -rt kernel or systems with starvation issues. 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Linus Torvalds


On Thu, 21 Jun 2007, Chuck Ebbert wrote:
> 
> A while ago I showed that spinlocks were a lot more fair when doing
> unlock with the xchg instruction on x86. Probably the arbitration is all
> screwed up because we use a mov instruction, which while atomic is not
> locked.

No, the cache line arbitration doesn't know anything about "locked" vs 
"unlocked" instructions (it could, but there really is no point).

The real issue is that locked instructions on x86 are serializing, which 
makes them extremely slow (compared to just a write), and then by being 
slow they effectively just make the window for another core bigger.

IOW, it doesn't "fix" anything, it just hides the bug with timing.

You can hide the problem other ways by just increasing the delay between 
the unlock and the lock (and adding one or more serializing instruction in 
between is generally the best way of doing that, simply because otherwise 
micro- architecture may just be re-ordering things on you, so that your 
"delay" isn't actually in between any more!).

But adding delays doesn't really fix anything, of course. It makes things 
"fairer" by making *both* sides suck more, but especially if both sides 
are actually the same exact thing, I could well imagine that they'd both 
just suck equally, and get into some pattern where they are now both 
slower, but still see exactly the same problem!

Of course, as long as interrupts are on, or things like DMA happen etc, 
it's really *really* hard to be totally unlucky, and after a while you're 
likely to break out of the steplock on your own, just because the CPU's 
get interrupted by something else.

It's in fact entirely possible that the long freezes have always been 
there, but the NOHZ option meant that we had much longer stretches of time 
without things like timer interrupts to jumble up the timing! So maybe the 
freezes existed before, but with timer interrupts happening hundreds of 
times a second, they weren't noticeable to humans.

(Btw, that's just _one_ theory. Don't take it _too_ seriously, but it 
could be one of the reasons why this showed up as a "new" problem, even 
though I don't think the "wait_for_inactive()" thing has changed lately.)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Chuck Ebbert
On 06/21/2007 12:08 PM, Ingo Molnar wrote:
> yeah - i'm not at all arguing in favor of the BTRL patch i did: i always 
> liked the 'nicer' inner loop of spinlocks, which could btw also easily 
> use MONITOR/MWAIT.

The "nice" inner loop is necessary or else it would generate huge amounts
of bus traffic while spinning.

> So it seems the problem was that if a core kept _truly_ modifying a 
> cacheline via atomics in a high enough frequency, it could artificially 
> starve the other core. (which would keep waiting for the cacheline to be 
> released one day, and which kept the first core from ever making any 
> progress) To me that looks like a real problem on the hardware side - 
> shouldnt cacheline ownership be arbitrated a bit better than that?
> 

A while ago I showed that spinlocks were a lot more fair when doing
unlock with the xchg instruction on x86. Probably the arbitration is all
screwed up because we use a mov instruction, which while atomic is not
locked.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Linus Torvalds


On Thu, 21 Jun 2007, Ingo Molnar wrote:
> 
> so the problem was not the trylock based spin_lock() itself (no matter 
> how it's structured in the assembly), the problem was actually modifying 
> the lock and re-modifying it again and again in a very tight 
> high-frequency loop, and hence not giving it to the other core?

That's what I think, yes.

And it matches the behaviour we've seen before (remember the whole thread 
about whether Opteron or Core 2 hardware is "more fair" between cores?)

It all simply boils down to the fact that releasing and almost immediately 
re-taking a lock is "invisible" to outside cores, because it will happen 
entirely within the L1 cache of one core if the cacheline is already in 
"owned" state.

Another core that spins wildly trying to get it ends up using a much 
slower "beat" (the cache coherency clock), and with just a bit of bad luck 
the L1 cache would simply never get probed, and the line never stolen, at 
the point where the lock happens to be released.

The fact that writes (as in the store that releases the lock) 
automatically get delayed by any x86 core by the store buffer, and the 
fact that atomic read-modify-write cycles do *not* get delayed, just means 
that if the "spinlock release" was "fairly close" to the "reacquire" in a 
software sense, the hardware will actually make them *even*closer*.

So you could make the spinlock release be a read-modify-write cycle, and 
it would make the spinlock much slower, but it would also make it much 
more likely that the other core will *see* the release.

For the same reasons, if you add a "smp_mb()" in between the release and 
the re-acquire, the other core is much more likely to see it: it means 
that the release won't be delayed, and thus it just won't be as "close" to 
the re-acquire.

So you can *hide* this problem in many ways, but it's still just hiding 
it.

The proper fix is to not do that kind of "release and re-acquire" 
behaviour in a loop.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> On Thu, 21 Jun 2007, Ingo Molnar wrote:
> > 
> > what worries me a bit though is that my patch that made spinlocks 
> > equally agressive to that loop didnt solve the hangs!
> 
> Your parch kept doing "spin_trylock()", didn't it?

yeah - it changed spin_lock()'s assembly to do a "LOCK BTRL", which is a 
trylock which tries to dirty the cacheline. There was a "REP NOP" after 
it and a loop back to the "LOCK BTRL".

> That's a read-modify-write thing, and keeps bouncing the cacheline 
> back and forth, and together with the fact that even *after* you get 
> the spinlock the "wait_for_inactive()" would actually end up looping 
> back, releasing it, and re-getting it.
> 
> So the problem was that "wait_for_inactive()" kept the lock (because 
> it actually *got* it), and looped over getting it, and because it was 
> an exclusive cacheline ownership, that implies that somebody else is 
> not getting it, and is kept from ever getting it.

ok, it's not completely clear where exactly the other core was spinning, 
but i took it from Miklos' observations that the other core was hanging 
in the _very same_ task_rq_lock() - which is a true spinlock as well 
that acquires it. So on one core the spin_lock() was starving, on 
another one it was always succeeding.

> So trying to use "trylock" doesn't help. It still has all the same bad 
> sides - it still gets the lock (getting the lock wasn't the problem: 
> _holding_ the lock was the problem), and it still kept the cache line 
> for the lock on one core.

so the problem was not the trylock based spin_lock() itself (no matter 
how it's structured in the assembly), the problem was actually modifying 
the lock and re-modifying it again and again in a very tight 
high-frequency loop, and hence not giving it to the other core?

> The only way to avoid lock contention is to avoid any exclusive use at 
> all.

yeah - i'm not at all arguing in favor of the BTRL patch i did: i always 
liked the 'nicer' inner loop of spinlocks, which could btw also easily 
use MONITOR/MWAIT. (my patch is also quite close to what we did in 
spinlocks many years ago, so it's more of a step backwards than real 
progress.)

So it seems the problem was that if a core kept _truly_ modifying a 
cacheline via atomics in a high enough frequency, it could artificially 
starve the other core. (which would keep waiting for the cacheline to be 
released one day, and which kept the first core from ever making any 
progress) To me that looks like a real problem on the hardware side - 
shouldnt cacheline ownership be arbitrated a bit better than that?

Up to the point where some external event (perhaps a periodic SMM 
related to thermal management) broke the deadlock/livelock scenario?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Linus Torvalds


On Thu, 21 Jun 2007, Jarek Poplawski wrote:
> 
> BTW, I've looked a bit at these NMI watchdog traces, and now I'm not
> even sure it's necessarily the spinlock's problem (but I don't exclude
> this possibility yet). It seems both processors use task_rq_lock(), so
> there could be also a problem with that loop.

I agree that that is also this kind of "loop getting a lock" loop, but if 
you can see it actually happening there, we have some other (and major) 
bug.

That loop is indeed a loop, but realistically speaking it should never be 
run through more than once. The only reason it's a loop is that there's a 
small small race where we get the information at the wrong time, get the 
wrong lock, and try again.

So the loop certainly *can* trigger (or it would be pointless), but I'd 
normally expect it not to, and even if it does end up looping around it 
should happen maybe *one* more time, absolutely not "get stuck" in the 
loop for any appreciable number of iterations.

So I don't see how you could possibly having two different CPU's getting 
into some lock-step in that loop: changing "task_rq()" is a really quite 
heavy operation (it's about migrating between CPU's), and generally 
happens at a fairly low frequency (ie "a couple of times a second" kind of 
thing, not "tight CPU loop").

But bugs happen..

> Another possible problem could be a result of some wrong optimization
> or wrong propagation of change of this task_rq(p) value.

I agree, but that kind of bug would likely not cause temporary hangs, but 
actual "the machine is dead" operations. If you get the totally *wrong* 
value due to some systematic bug, you'd be waiting forever for it to 
match, not get into a loop for a half second and then it clearing up..

But I don't think we can throw the "it's another bug" theory _entirely_ 
out the window.

That said, both Ingo and me have done "fairness" testing of hardware, and 
yes, if you have a reasonably tight loop with spinlocks, existing hardware 
really *is* unfair. Not by a "small amount" either - I had this program 
that did statistics (and Ingo improved on it and had some other tests 
too), and basically if you have two CPU's that try to get the same 
spinlock, they really *can* get into situations where one of them gets it 
millions of times in a row, and the other _never_ gets it.

But it only happens with badly coded software: the rule simply is that you 
MUST NOT release and immediately re-acquire the same spinlock on the same 
core, because as far as other cores are concerned, that's basically the 
same as never releasing it in the first place.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Linus Torvalds


On Thu, 21 Jun 2007, Ingo Molnar wrote:
> 
> what worries me a bit though is that my patch that made spinlocks 
> equally agressive to that loop didnt solve the hangs!

Your parch kept doing "spin_trylock()", didn't it?

That's a read-modify-write thing, and keeps bouncing the cacheline back 
and forth, and together with the fact that even *after* you get the 
spinlock the "wait_for_inactive()" would actually end up looping back, 
releasing it, and re-getting it.

So the problem was that "wait_for_inactive()" kept the lock (because it 
actually *got* it), and looped over getting it, and because it was an 
exclusive cacheline ownership, that implies that somebody else is not 
getting it, and is kept from ever getting it.

So trying to use "trylock" doesn't help. It still has all the same bad 
sides - it still gets the lock (getting the lock wasn't the problem: 
_holding_ the lock was the problem), and it still kept the cache line for 
the lock on one core.

The only way to avoid lock contention is to avoid any exclusive use at 
all.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Jarek Poplawski
On Thu, Jun 21, 2007 at 10:39:31AM +0200, Ingo Molnar wrote:
> 
> * Jarek Poplawski <[EMAIL PROTECTED]> wrote:
> 
> > BTW, I've looked a bit at these NMI watchdog traces, and now I'm not 
> > even sure it's necessarily the spinlock's problem (but I don't exclude 
> > this possibility yet). It seems both processors use task_rq_lock(), so 
> > there could be also a problem with that loop. The way the correctness 
> > of the taken lock is verified is racy: there is a small probability 
> > that if we have taken the wrong lock the check inside the loop is done 
> > just before the value is beeing changed elsewhere under the right 
> > lock. Another possible problem could be a result of some wrong 
> > optimization or wrong propagation of change of this task_rq(p) value.
> 
> ok, could you elaborate this in a bit more detail? You say it's racy - 
> any correctness bug in task_rq_lock() will cause the kernel to blow up 
> in spectacular ways. It's a fairly straightforward loop:
> 
>  static inline struct rq *__task_rq_lock(struct task_struct *p)
>  __acquires(rq->lock)
>  {
>  struct rq *rq;
> 
>  repeat_lock_task:
>  rq = task_rq(p);
>  spin_lock(>lock);
>  if (unlikely(rq != task_rq(p))) {
>  spin_unlock(>lock);
>  goto repeat_lock_task;
>  }
>  return rq;
>  }
> 
> the result of task_rq() depends on p->thread_info->cpu wich will only 
> change if a task has migrated over to another CPU. That is a 
> fundamentally 'slow' operation, but even if a task does it intentionally 
> in a high frequency way (for example via repeated calls to 
> sched_setaffinity) there's no way it could be faster than the spinlock 
> code here. So ... what problems can you see with it?

OK, you are right - I withdraw this "idea". Sorry!

Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Ingo Molnar

* Jarek Poplawski <[EMAIL PROTECTED]> wrote:

> BTW, I've looked a bit at these NMI watchdog traces, and now I'm not 
> even sure it's necessarily the spinlock's problem (but I don't exclude 
> this possibility yet). It seems both processors use task_rq_lock(), so 
> there could be also a problem with that loop. The way the correctness 
> of the taken lock is verified is racy: there is a small probability 
> that if we have taken the wrong lock the check inside the loop is done 
> just before the value is beeing changed elsewhere under the right 
> lock. Another possible problem could be a result of some wrong 
> optimization or wrong propagation of change of this task_rq(p) value.

ok, could you elaborate this in a bit more detail? You say it's racy - 
any correctness bug in task_rq_lock() will cause the kernel to blow up 
in spectacular ways. It's a fairly straightforward loop:

 static inline struct rq *__task_rq_lock(struct task_struct *p)
 __acquires(rq->lock)
 {
 struct rq *rq;

 repeat_lock_task:
 rq = task_rq(p);
 spin_lock(>lock);
 if (unlikely(rq != task_rq(p))) {
 spin_unlock(>lock);
 goto repeat_lock_task;
 }
 return rq;
 }

the result of task_rq() depends on p->thread_info->cpu wich will only 
change if a task has migrated over to another CPU. That is a 
fundamentally 'slow' operation, but even if a task does it intentionally 
in a high frequency way (for example via repeated calls to 
sched_setaffinity) there's no way it could be faster than the spinlock 
code here. So ... what problems can you see with it?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> In other words, spinlocks are optimized for *lack* of contention. If a 
> spinlock has contention, you don't try to make the spinlock "fair". 
> No, you try to fix the contention instead!

yeah, and if there's no easy solution, change it to a mutex. Fastpath 
performance of spinlocks and mutexes is essentially the same, and if 
there's any measurable contention then the scheduler is pretty good at 
sorting things out. Say if the average contention is longer than 10-20 
microseconds then likely we could already win by scheduling away to some 
other task. (the best is of course to have no contention at all - but 
there are causes where it is real hard, and there are cases where it's 
outright unmaintainable.)

Hw makers are currently producing transistors disproportionatly faster 
than humans are producing parallel code, as a result of which we've got 
more CPU cache than ever, even taking natural application bloat into 
account. (it just makes no sense to spend those transistors on 
parallelism when applications are just not making use of it yet. Plus 
caches are a lot less power intense than functional units of the CPU, 
and the limit these days is power input.)

So scheduling more frequently and more agressively makes more sense than 
ever before and that trend will likely not stop for some time to come. 

> The patch I sent out was an example of that. You *can* fix contention 
> problems. Does it take clever approaches? Yes. It's why we have hashed 
> spinlocks, RCU, and code sequences that are entirely lockless and use 
> optimistic approaches. And suddenly you get fairness *and* 
> performance!

what worries me a bit though is that my patch that made spinlocks 
equally agressive to that loop didnt solve the hangs! So there is some 
issue we dont understand yet - why was the wait_inactive_task() 
open-coded spin-trylock loop starving the other core which had ... an 
open-coded spin-trylock loop coded up in assembly? And we've got a 
handful of other open-coded loops in the kernel (networking for example) 
so this issue could come back and haunt us in a situation where we dont 
have a gifted hacker like Miklos being able to spend _weeks_ to track 
down the problem...

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Jarek Poplawski
On Wed, Jun 20, 2007 at 10:34:15AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 20 Jun 2007, Jarek Poplawski wrote:
> > 
> > I don't agree with this (+ I know it doesn't matter).
> > 
> > The real bug is what Chuck Ebbert wrote: "Spinlocks aren't fair".
> > And here they are simply lawlessly not fair.
> 
> Well, that's certainly a valid standpoint. I wouldn't claim you're 
> _wrong_.
> 
> At least in theory.
> 
> In *practice*, fair spinlocks are simply not possible. Not as in "it's 
> hard to do", but as in "you simply cannot do it".

I think we can agree it was more about some minimal fairness.

...
> Which gets us to the next level: we can consider hardware that isn't 
> "fair" in its cache coherency to be buggy hardware.

IMHO, we shouldn't try to blame the hardware until we know exactly
the source of this bug.

...
> In other words, spinlocks are optimized for *lack* of contention. If a 
> spinlock has contention, you don't try to make the spinlock "fair". No, 
> you try to fix the contention instead!
> 
> That way, you fix many things. You probably speed things up, _and_ you 
> make it fairer. It might sometimes take some brains and effort, but it's 
> worth it.

I could agree with this, but only when we know exactly what place
should be fixed if we don't care about speed. Then, of course, the
cost could be estimated. But after last Ingo's patch I'm not sure
he, or anybody else here, has this knowledge. I'd also remind that
adding one smp_mb() also did the work, and it doesn't look like a
big performance hit. We should only better know why this works.

> The patch I sent out was an example of that. You *can* fix contention 
> problems. Does it take clever approaches? Yes. It's why we have hashed 
> spinlocks, RCU, and code sequences that are entirely lockless and use 
> optimistic approaches. And suddenly you get fairness *and* performance!
> 
> It's a win-win situation. It does require a bit of effort, but hey, we're 
> good at effort.

Not necessarily so. Until the exact reason isn't known "for sure",
this one place could be fixed, but the same problem could appear
somewhere else in more masked form or is far less repeatable.

BTW, I've looked a bit at these NMI watchdog traces, and now I'm not
even sure it's necessarily the spinlock's problem (but I don't exclude
this possibility yet). It seems both processors use task_rq_lock(), so
there could be also a problem with that loop. The way the correctness
of the taken lock is verified is racy: there is a small probability
that if we have taken the wrong lock the check inside the loop is done
just before the value is beeing changed elsewhere under the right lock.
Another possible problem could be a result of some wrong optimization
or wrong propagation of change of this task_rq(p) value.

Thanks for response & best regards,
Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Jarek Poplawski
On Wed, Jun 20, 2007 at 10:34:15AM -0700, Linus Torvalds wrote:
 
 
 On Wed, 20 Jun 2007, Jarek Poplawski wrote:
  
  I don't agree with this (+ I know it doesn't matter).
  
  The real bug is what Chuck Ebbert wrote: Spinlocks aren't fair.
  And here they are simply lawlessly not fair.
 
 Well, that's certainly a valid standpoint. I wouldn't claim you're 
 _wrong_.
 
 At least in theory.
 
 In *practice*, fair spinlocks are simply not possible. Not as in it's 
 hard to do, but as in you simply cannot do it.

I think we can agree it was more about some minimal fairness.

...
 Which gets us to the next level: we can consider hardware that isn't 
 fair in its cache coherency to be buggy hardware.

IMHO, we shouldn't try to blame the hardware until we know exactly
the source of this bug.

...
 In other words, spinlocks are optimized for *lack* of contention. If a 
 spinlock has contention, you don't try to make the spinlock fair. No, 
 you try to fix the contention instead!
 
 That way, you fix many things. You probably speed things up, _and_ you 
 make it fairer. It might sometimes take some brains and effort, but it's 
 worth it.

I could agree with this, but only when we know exactly what place
should be fixed if we don't care about speed. Then, of course, the
cost could be estimated. But after last Ingo's patch I'm not sure
he, or anybody else here, has this knowledge. I'd also remind that
adding one smp_mb() also did the work, and it doesn't look like a
big performance hit. We should only better know why this works.

 The patch I sent out was an example of that. You *can* fix contention 
 problems. Does it take clever approaches? Yes. It's why we have hashed 
 spinlocks, RCU, and code sequences that are entirely lockless and use 
 optimistic approaches. And suddenly you get fairness *and* performance!
 
 It's a win-win situation. It does require a bit of effort, but hey, we're 
 good at effort.

Not necessarily so. Until the exact reason isn't known for sure,
this one place could be fixed, but the same problem could appear
somewhere else in more masked form or is far less repeatable.

BTW, I've looked a bit at these NMI watchdog traces, and now I'm not
even sure it's necessarily the spinlock's problem (but I don't exclude
this possibility yet). It seems both processors use task_rq_lock(), so
there could be also a problem with that loop. The way the correctness
of the taken lock is verified is racy: there is a small probability
that if we have taken the wrong lock the check inside the loop is done
just before the value is beeing changed elsewhere under the right lock.
Another possible problem could be a result of some wrong optimization
or wrong propagation of change of this task_rq(p) value.

Thanks for response  best regards,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Ingo Molnar

* Linus Torvalds [EMAIL PROTECTED] wrote:

 In other words, spinlocks are optimized for *lack* of contention. If a 
 spinlock has contention, you don't try to make the spinlock fair. 
 No, you try to fix the contention instead!

yeah, and if there's no easy solution, change it to a mutex. Fastpath 
performance of spinlocks and mutexes is essentially the same, and if 
there's any measurable contention then the scheduler is pretty good at 
sorting things out. Say if the average contention is longer than 10-20 
microseconds then likely we could already win by scheduling away to some 
other task. (the best is of course to have no contention at all - but 
there are causes where it is real hard, and there are cases where it's 
outright unmaintainable.)

Hw makers are currently producing transistors disproportionatly faster 
than humans are producing parallel code, as a result of which we've got 
more CPU cache than ever, even taking natural application bloat into 
account. (it just makes no sense to spend those transistors on 
parallelism when applications are just not making use of it yet. Plus 
caches are a lot less power intense than functional units of the CPU, 
and the limit these days is power input.)

So scheduling more frequently and more agressively makes more sense than 
ever before and that trend will likely not stop for some time to come. 

 The patch I sent out was an example of that. You *can* fix contention 
 problems. Does it take clever approaches? Yes. It's why we have hashed 
 spinlocks, RCU, and code sequences that are entirely lockless and use 
 optimistic approaches. And suddenly you get fairness *and* 
 performance!

what worries me a bit though is that my patch that made spinlocks 
equally agressive to that loop didnt solve the hangs! So there is some 
issue we dont understand yet - why was the wait_inactive_task() 
open-coded spin-trylock loop starving the other core which had ... an 
open-coded spin-trylock loop coded up in assembly? And we've got a 
handful of other open-coded loops in the kernel (networking for example) 
so this issue could come back and haunt us in a situation where we dont 
have a gifted hacker like Miklos being able to spend _weeks_ to track 
down the problem...

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Ingo Molnar

* Jarek Poplawski [EMAIL PROTECTED] wrote:

 BTW, I've looked a bit at these NMI watchdog traces, and now I'm not 
 even sure it's necessarily the spinlock's problem (but I don't exclude 
 this possibility yet). It seems both processors use task_rq_lock(), so 
 there could be also a problem with that loop. The way the correctness 
 of the taken lock is verified is racy: there is a small probability 
 that if we have taken the wrong lock the check inside the loop is done 
 just before the value is beeing changed elsewhere under the right 
 lock. Another possible problem could be a result of some wrong 
 optimization or wrong propagation of change of this task_rq(p) value.

ok, could you elaborate this in a bit more detail? You say it's racy - 
any correctness bug in task_rq_lock() will cause the kernel to blow up 
in spectacular ways. It's a fairly straightforward loop:

 static inline struct rq *__task_rq_lock(struct task_struct *p)
 __acquires(rq-lock)
 {
 struct rq *rq;

 repeat_lock_task:
 rq = task_rq(p);
 spin_lock(rq-lock);
 if (unlikely(rq != task_rq(p))) {
 spin_unlock(rq-lock);
 goto repeat_lock_task;
 }
 return rq;
 }

the result of task_rq() depends on p-thread_info-cpu wich will only 
change if a task has migrated over to another CPU. That is a 
fundamentally 'slow' operation, but even if a task does it intentionally 
in a high frequency way (for example via repeated calls to 
sched_setaffinity) there's no way it could be faster than the spinlock 
code here. So ... what problems can you see with it?

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Jarek Poplawski
On Thu, Jun 21, 2007 at 10:39:31AM +0200, Ingo Molnar wrote:
 
 * Jarek Poplawski [EMAIL PROTECTED] wrote:
 
  BTW, I've looked a bit at these NMI watchdog traces, and now I'm not 
  even sure it's necessarily the spinlock's problem (but I don't exclude 
  this possibility yet). It seems both processors use task_rq_lock(), so 
  there could be also a problem with that loop. The way the correctness 
  of the taken lock is verified is racy: there is a small probability 
  that if we have taken the wrong lock the check inside the loop is done 
  just before the value is beeing changed elsewhere under the right 
  lock. Another possible problem could be a result of some wrong 
  optimization or wrong propagation of change of this task_rq(p) value.
 
 ok, could you elaborate this in a bit more detail? You say it's racy - 
 any correctness bug in task_rq_lock() will cause the kernel to blow up 
 in spectacular ways. It's a fairly straightforward loop:
 
  static inline struct rq *__task_rq_lock(struct task_struct *p)
  __acquires(rq-lock)
  {
  struct rq *rq;
 
  repeat_lock_task:
  rq = task_rq(p);
  spin_lock(rq-lock);
  if (unlikely(rq != task_rq(p))) {
  spin_unlock(rq-lock);
  goto repeat_lock_task;
  }
  return rq;
  }
 
 the result of task_rq() depends on p-thread_info-cpu wich will only 
 change if a task has migrated over to another CPU. That is a 
 fundamentally 'slow' operation, but even if a task does it intentionally 
 in a high frequency way (for example via repeated calls to 
 sched_setaffinity) there's no way it could be faster than the spinlock 
 code here. So ... what problems can you see with it?

OK, you are right - I withdraw this idea. Sorry!

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Linus Torvalds


On Thu, 21 Jun 2007, Ingo Molnar wrote:
 
 what worries me a bit though is that my patch that made spinlocks 
 equally agressive to that loop didnt solve the hangs!

Your parch kept doing spin_trylock(), didn't it?

That's a read-modify-write thing, and keeps bouncing the cacheline back 
and forth, and together with the fact that even *after* you get the 
spinlock the wait_for_inactive() would actually end up looping back, 
releasing it, and re-getting it.

So the problem was that wait_for_inactive() kept the lock (because it 
actually *got* it), and looped over getting it, and because it was an 
exclusive cacheline ownership, that implies that somebody else is not 
getting it, and is kept from ever getting it.

So trying to use trylock doesn't help. It still has all the same bad 
sides - it still gets the lock (getting the lock wasn't the problem: 
_holding_ the lock was the problem), and it still kept the cache line for 
the lock on one core.

The only way to avoid lock contention is to avoid any exclusive use at 
all.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Linus Torvalds


On Thu, 21 Jun 2007, Jarek Poplawski wrote:
 
 BTW, I've looked a bit at these NMI watchdog traces, and now I'm not
 even sure it's necessarily the spinlock's problem (but I don't exclude
 this possibility yet). It seems both processors use task_rq_lock(), so
 there could be also a problem with that loop.

I agree that that is also this kind of loop getting a lock loop, but if 
you can see it actually happening there, we have some other (and major) 
bug.

That loop is indeed a loop, but realistically speaking it should never be 
run through more than once. The only reason it's a loop is that there's a 
small small race where we get the information at the wrong time, get the 
wrong lock, and try again.

So the loop certainly *can* trigger (or it would be pointless), but I'd 
normally expect it not to, and even if it does end up looping around it 
should happen maybe *one* more time, absolutely not get stuck in the 
loop for any appreciable number of iterations.

So I don't see how you could possibly having two different CPU's getting 
into some lock-step in that loop: changing task_rq() is a really quite 
heavy operation (it's about migrating between CPU's), and generally 
happens at a fairly low frequency (ie a couple of times a second kind of 
thing, not tight CPU loop).

But bugs happen..

 Another possible problem could be a result of some wrong optimization
 or wrong propagation of change of this task_rq(p) value.

I agree, but that kind of bug would likely not cause temporary hangs, but 
actual the machine is dead operations. If you get the totally *wrong* 
value due to some systematic bug, you'd be waiting forever for it to 
match, not get into a loop for a half second and then it clearing up..

But I don't think we can throw the it's another bug theory _entirely_ 
out the window.

That said, both Ingo and me have done fairness testing of hardware, and 
yes, if you have a reasonably tight loop with spinlocks, existing hardware 
really *is* unfair. Not by a small amount either - I had this program 
that did statistics (and Ingo improved on it and had some other tests 
too), and basically if you have two CPU's that try to get the same 
spinlock, they really *can* get into situations where one of them gets it 
millions of times in a row, and the other _never_ gets it.

But it only happens with badly coded software: the rule simply is that you 
MUST NOT release and immediately re-acquire the same spinlock on the same 
core, because as far as other cores are concerned, that's basically the 
same as never releasing it in the first place.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Ingo Molnar

* Linus Torvalds [EMAIL PROTECTED] wrote:

 On Thu, 21 Jun 2007, Ingo Molnar wrote:
  
  what worries me a bit though is that my patch that made spinlocks 
  equally agressive to that loop didnt solve the hangs!
 
 Your parch kept doing spin_trylock(), didn't it?

yeah - it changed spin_lock()'s assembly to do a LOCK BTRL, which is a 
trylock which tries to dirty the cacheline. There was a REP NOP after 
it and a loop back to the LOCK BTRL.

 That's a read-modify-write thing, and keeps bouncing the cacheline 
 back and forth, and together with the fact that even *after* you get 
 the spinlock the wait_for_inactive() would actually end up looping 
 back, releasing it, and re-getting it.
 
 So the problem was that wait_for_inactive() kept the lock (because 
 it actually *got* it), and looped over getting it, and because it was 
 an exclusive cacheline ownership, that implies that somebody else is 
 not getting it, and is kept from ever getting it.

ok, it's not completely clear where exactly the other core was spinning, 
but i took it from Miklos' observations that the other core was hanging 
in the _very same_ task_rq_lock() - which is a true spinlock as well 
that acquires it. So on one core the spin_lock() was starving, on 
another one it was always succeeding.

 So trying to use trylock doesn't help. It still has all the same bad 
 sides - it still gets the lock (getting the lock wasn't the problem: 
 _holding_ the lock was the problem), and it still kept the cache line 
 for the lock on one core.

so the problem was not the trylock based spin_lock() itself (no matter 
how it's structured in the assembly), the problem was actually modifying 
the lock and re-modifying it again and again in a very tight 
high-frequency loop, and hence not giving it to the other core?

 The only way to avoid lock contention is to avoid any exclusive use at 
 all.

yeah - i'm not at all arguing in favor of the BTRL patch i did: i always 
liked the 'nicer' inner loop of spinlocks, which could btw also easily 
use MONITOR/MWAIT. (my patch is also quite close to what we did in 
spinlocks many years ago, so it's more of a step backwards than real 
progress.)

So it seems the problem was that if a core kept _truly_ modifying a 
cacheline via atomics in a high enough frequency, it could artificially 
starve the other core. (which would keep waiting for the cacheline to be 
released one day, and which kept the first core from ever making any 
progress) To me that looks like a real problem on the hardware side - 
shouldnt cacheline ownership be arbitrated a bit better than that?

Up to the point where some external event (perhaps a periodic SMM 
related to thermal management) broke the deadlock/livelock scenario?

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Linus Torvalds


On Thu, 21 Jun 2007, Ingo Molnar wrote:
 
 so the problem was not the trylock based spin_lock() itself (no matter 
 how it's structured in the assembly), the problem was actually modifying 
 the lock and re-modifying it again and again in a very tight 
 high-frequency loop, and hence not giving it to the other core?

That's what I think, yes.

And it matches the behaviour we've seen before (remember the whole thread 
about whether Opteron or Core 2 hardware is more fair between cores?)

It all simply boils down to the fact that releasing and almost immediately 
re-taking a lock is invisible to outside cores, because it will happen 
entirely within the L1 cache of one core if the cacheline is already in 
owned state.

Another core that spins wildly trying to get it ends up using a much 
slower beat (the cache coherency clock), and with just a bit of bad luck 
the L1 cache would simply never get probed, and the line never stolen, at 
the point where the lock happens to be released.

The fact that writes (as in the store that releases the lock) 
automatically get delayed by any x86 core by the store buffer, and the 
fact that atomic read-modify-write cycles do *not* get delayed, just means 
that if the spinlock release was fairly close to the reacquire in a 
software sense, the hardware will actually make them *even*closer*.

So you could make the spinlock release be a read-modify-write cycle, and 
it would make the spinlock much slower, but it would also make it much 
more likely that the other core will *see* the release.

For the same reasons, if you add a smp_mb() in between the release and 
the re-acquire, the other core is much more likely to see it: it means 
that the release won't be delayed, and thus it just won't be as close to 
the re-acquire.

So you can *hide* this problem in many ways, but it's still just hiding 
it.

The proper fix is to not do that kind of release and re-acquire 
behaviour in a loop.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Chuck Ebbert
On 06/21/2007 12:08 PM, Ingo Molnar wrote:
 yeah - i'm not at all arguing in favor of the BTRL patch i did: i always 
 liked the 'nicer' inner loop of spinlocks, which could btw also easily 
 use MONITOR/MWAIT.

The nice inner loop is necessary or else it would generate huge amounts
of bus traffic while spinning.

 So it seems the problem was that if a core kept _truly_ modifying a 
 cacheline via atomics in a high enough frequency, it could artificially 
 starve the other core. (which would keep waiting for the cacheline to be 
 released one day, and which kept the first core from ever making any 
 progress) To me that looks like a real problem on the hardware side - 
 shouldnt cacheline ownership be arbitrated a bit better than that?
 

A while ago I showed that spinlocks were a lot more fair when doing
unlock with the xchg instruction on x86. Probably the arbitration is all
screwed up because we use a mov instruction, which while atomic is not
locked.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Linus Torvalds


On Thu, 21 Jun 2007, Chuck Ebbert wrote:
 
 A while ago I showed that spinlocks were a lot more fair when doing
 unlock with the xchg instruction on x86. Probably the arbitration is all
 screwed up because we use a mov instruction, which while atomic is not
 locked.

No, the cache line arbitration doesn't know anything about locked vs 
unlocked instructions (it could, but there really is no point).

The real issue is that locked instructions on x86 are serializing, which 
makes them extremely slow (compared to just a write), and then by being 
slow they effectively just make the window for another core bigger.

IOW, it doesn't fix anything, it just hides the bug with timing.

You can hide the problem other ways by just increasing the delay between 
the unlock and the lock (and adding one or more serializing instruction in 
between is generally the best way of doing that, simply because otherwise 
micro- architecture may just be re-ordering things on you, so that your 
delay isn't actually in between any more!).

But adding delays doesn't really fix anything, of course. It makes things 
fairer by making *both* sides suck more, but especially if both sides 
are actually the same exact thing, I could well imagine that they'd both 
just suck equally, and get into some pattern where they are now both 
slower, but still see exactly the same problem!

Of course, as long as interrupts are on, or things like DMA happen etc, 
it's really *really* hard to be totally unlucky, and after a while you're 
likely to break out of the steplock on your own, just because the CPU's 
get interrupted by something else.

It's in fact entirely possible that the long freezes have always been 
there, but the NOHZ option meant that we had much longer stretches of time 
without things like timer interrupts to jumble up the timing! So maybe the 
freezes existed before, but with timer interrupts happening hundreds of 
times a second, they weren't noticeable to humans.

(Btw, that's just _one_ theory. Don't take it _too_ seriously, but it 
could be one of the reasons why this showed up as a new problem, even 
though I don't think the wait_for_inactive() thing has changed lately.)

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Eric Dumazet
On Thu, 21 Jun 2007 10:31:53 -0700 (PDT)
Linus Torvalds [EMAIL PROTECTED] wrote:

 
 
 On Thu, 21 Jun 2007, Chuck Ebbert wrote:
  
  A while ago I showed that spinlocks were a lot more fair when doing
  unlock with the xchg instruction on x86. Probably the arbitration is all
  screwed up because we use a mov instruction, which while atomic is not
  locked.
 
 No, the cache line arbitration doesn't know anything about locked vs 
 unlocked instructions (it could, but there really is no point).
 
 The real issue is that locked instructions on x86 are serializing, which 
 makes them extremely slow (compared to just a write), and then by being 
 slow they effectively just make the window for another core bigger.
 

This reminds me Nick's proposal of 'queued spinlocks' 3 months ago

Maybe this should be re-considered ? (unlock is still a non atomic op, 
so we dont pay the serializing cost twice)

http://lwn.net/Articles/227506/

extract : 

Implement queued spinlocks for i386. This shouldn't increase the size of
the spinlock structure, while still able to handle 2^16 CPUs.

The queued spinlock has 2 fields, a head and a tail, which are indexes
into a FIFO of waiting CPUs. To take a spinlock, a CPU performs an
atomic_inc_return on the head index, and keeps the returned value as
a ticket. The CPU then spins until the tail index is equal to that
ticket.

To unlock a spinlock, the tail index is incremented (this can be non
atomic, because only the lock owner will modify tail).

Implementation inefficiencies aside, this change should have little
effect on performance for uncontended locks, but will have quite a
large cost for highly contended locks [O(N) cacheline transfers vs
O(1) per lock aquisition, where N is the number of CPUs contending].
The benefit is is that contended locks will not cause any starvation.

Just an idea. Big NUMA hardware seems to have fairness logic that
prevents starvation for the regular spinlock logic. But it might be
interesting for -rt kernel or systems with starvation issues. 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Linus Torvalds


On Thu, 21 Jun 2007, Eric Dumazet wrote:
 
 This reminds me Nick's proposal of 'queued spinlocks' 3 months ago
 
 Maybe this should be re-considered ? (unlock is still a non atomic op, 
 so we dont pay the serializing cost twice)

No. The point is simple:

IF YOU NEED THIS, YOU ARE DOING SOMETHING WRONG!

I don't understand why this is even controversial. Especially since we 
have a patch for the problem that proves my point: the _proper_ way to fix 
things is to just not do the bad thing, instead of trying to allow the bad 
behaviour and try to handle it.

Things like queued spinlocks just make excuses for bad code. 

We don't do nesting locking either, for exactly the same reason. Are 
nesting locks easier? Absolutely. They are also almost always a sign of 
a *bug*. So making spinlocks and/or mutexes nest by default is just a way 
to encourage bad programming!

 extract : 
 
 Implement queued spinlocks for i386. This shouldn't increase the size of
 the spinlock structure, while still able to handle 2^16 CPUs.

Umm. i386 spinlocks could and should be *one*byte*.

In fact, I don't even know why they are wasting four bytes right now: the 
fact that somebody made them an int just wastes memory. All the actual 
code uses decb, so it's not even a question of safety. I wonder why we 
have that 32-bit thing and the ugly casts.

Ingo, any memory of that?

(And no, on 32-bit x86, we don't allow more than 128 CPU's. I don't think 
such an insane machine has ever existed).

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Linus Torvalds


On Thu, 21 Jun 2007, Linus Torvalds wrote:
 
 We don't do nesting locking either, for exactly the same reason. Are 
 nesting locks easier? Absolutely. They are also almost always a sign of 
 a *bug*. So making spinlocks and/or mutexes nest by default is just a way 
 to encourage bad programming!

Side note, and as a truth in advertising section: I'll have to admit 
that I argued against fair semaphores on the same grounds. I was wrong 
then (and eventually admitted it, and we obviously try to make our mutexes 
and semaphores fair these days!), and maybe I'm wrong now.

If somebody can actually come up with a sequence where we have spinlock 
starvation, and it's not about an example of bad locking, and nobody 
really can come up with any other way to fix it, we may eventually have to 
add the notion of fair spinlocks.

So my arguments are purely pragmatic. It's not that I hate fairness per 
se. I dislike it only when it's used to solve (aka hide) other problems.

In the end, some situations do need fairness, and the fact that aiming for 
fairness is often harder, slower, and more complicated than not doing so 
at that point turns into a non-argument. If you need it, you need it.

I just don't think we need it, and we're better off solving problems other 
ways.

(For example, we might also solve such problems by creating a separate
fair_spin_lock abstraction, and only making the particular users that 
need it actually use it. It would depend a bit on whether the cost of 
implementing the fairness is noticeable enough for it to be worth having 
a separate construct for it).

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Ingo Molnar

* Linus Torvalds [EMAIL PROTECTED] wrote:

 Umm. i386 spinlocks could and should be *one*byte*.
 
 In fact, I don't even know why they are wasting four bytes right now: 
 the fact that somebody made them an int just wastes memory. All the 
 actual code uses decb, so it's not even a question of safety. I 
 wonder why we have that 32-bit thing and the ugly casts.
 
 Ingo, any memory of that?

no real reason that i can recall - i guess nobody dared to touch it 
because it used to have that 'volatile', indicating black voodoo ;-) Now 
that the bad stigma has been removed, we could try the patch below. It 
boots fine here, and we save 1K of kernel text size:

 textdata bss dec hex filename
  6236003  611992  401408 7249403  6e9dfb vmlinux.before
  6235075  611992  401408 7248475  6e9a5b vmlinux.after

I can understand why no data is saved by this change: gcc is aligning 
the next field to a natural boundary anyway and we dont really have 
arrays of spinlocks (fortunately). [and we save no data even if using 
the ((packed)) attribute.] Perhaps some data structure that is never in 
the kernel image itself still got smaller? Any good way to determine 
that?

But why is the text size different? Ah: i think it's spin_lock_init() 
getting shorter :-)

but this is certainly not something for 2.6.22, it's an early 2.6.23 
matter i suspect.

Ingo

---
From: Ingo Molnar [EMAIL PROTECTED]
Subject: [patch] spinlocks i386: change them to byte fields

all spinlock ops are on byte operands, so change the spinlock field to 
be unsigned char. This saves a bit of kernel text size:

   textdata bss dec hex filename
6236003  611992  401408 7249403  6e9dfb vmlinux.before
6235075  611992  401408 7248475  6e9a5b vmlinux.after

Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
---
 include/asm-i386/spinlock_types.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/include/asm-i386/spinlock_types.h
===
--- linux.orig/include/asm-i386/spinlock_types.h
+++ linux/include/asm-i386/spinlock_types.h
@@ -6,7 +6,7 @@
 #endif
 
 typedef struct {
-   unsigned int slock;
+   unsigned char slock;
 } raw_spinlock_t;
 
 #define __RAW_SPIN_LOCK_UNLOCKED   { 1 }
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Ingo Molnar

* Linus Torvalds [EMAIL PROTECTED] wrote:

 If somebody can actually come up with a sequence where we have 
 spinlock starvation, and it's not about an example of bad locking, and 
 nobody really can come up with any other way to fix it, we may 
 eventually have to add the notion of fair spinlocks.

there was one bad case i can remember: the spinlock debugging code had a 
trylock open-coded loop and on certain Opterons CPUs were starving each 
other. This used to trigger with the -tree_lock rwlock i think, on 
heavy MM loads. The starvation got so bad that the NMI watchdog started 
triggering ...

interestingly, this only triggered for certain rwlocks. Thus we, after a 
few failed attempts to pacify this open-coded loop, currently have that 
code disabled in lib/spinlock_debug.c:

 #if 0   /* This can cause lockups */
 static void __write_lock_debug(rwlock_t *lock)
 {
 u64 i;
 u64 loops = loops_per_jiffy * HZ;
 int print_once = 1;

 for (;;) {
 for (i = 0; i  loops; i++) {
 if (__raw_write_trylock(lock-raw_lock))
 return;
 __delay(1);
 }

the weird thing is that we still have the _very same_ construct in 
__spin_lock_debug():

 for (i = 0; i  loops; i++) {
 if (__raw_spin_trylock(lock-raw_lock))
 return;
 __delay(1);
 }

if there are any problems with this then people are not complaining loud 
enough :-)

note that because this is a trylock based loop, the acquire+release 
sequence problem should not apply to this problem.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Linus Torvalds


On Thu, 21 Jun 2007, Ingo Molnar wrote:
 
 I can understand why no data is saved by this change: gcc is aligning 
 the next field to a natural boundary anyway and we dont really have 
 arrays of spinlocks (fortunately).

Actually, some data structures could well shrink.

Look at struct task_struct, for example. Right now it has two spinlocks 
right next to each other (alloc_lock and pi_lock).

Other data structures may have things like bitfields etc.

But yeah, I'd not expect that to be very common, and in some cases you 
might have to re-order data structures to take advantage of better 
packing, and even then it's probably not all that noticeable.

 but this is certainly not something for 2.6.22, it's an early 2.6.23 
 matter i suspect.

Oh, absolutely.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Ingo Molnar

* Linus Torvalds [EMAIL PROTECTED] wrote:

 (And no, on 32-bit x86, we don't allow more than 128 CPU's. I don't 
 think such an insane machine has ever existed).

and if people _really_ want to boot a large-smp 32-bit kernel on some 
new, tons-of-cpus box, as a workaround they can enable the spinlock 
debugging code, which has no limitation on the number of CPUs supported.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Linus Torvalds


On Thu, 21 Jun 2007, Ingo Molnar wrote:
 
 * Linus Torvalds [EMAIL PROTECTED] wrote:
 
  If somebody can actually come up with a sequence where we have 
  spinlock starvation, and it's not about an example of bad locking, and 
  nobody really can come up with any other way to fix it, we may 
  eventually have to add the notion of fair spinlocks.
 
 there was one bad case i can remember: the spinlock debugging code had a 
 trylock open-coded loop and on certain Opterons CPUs were starving each 
 other.

But this is a perfect example of exactly what I'm talking about:

 THAT CODE IS HORRIBLY BUGGY!

It's not the spinlocks that are broken, it's that damn code.

  for (;;) {
  for (i = 0; i  loops; i++) {
  if (__raw_write_trylock(lock-raw_lock))
  return;
  __delay(1);
  }

What a piece of crap. 

Anybody who ever waits for a lock by busy-looping over it is BUGGY, 
dammit!

The only correct way to wait for a lock is:

  (a) try it *once* with an atomic r-m-w 
  (b) loop over just _reading_ it (and something that implies a memory 
  barrier, _not_ __delay(). Use cpu_relax() or smp_rmb())
  (c) rinse and repeat.

and code like the above should just be shot on sight.

So don't blame the spinlocks or the hardware for crap code.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Ingo Molnar

* Linus Torvalds [EMAIL PROTECTED] wrote:

 It's in fact entirely possible that the long freezes have always been 
 there, but the NOHZ option meant that we had much longer stretches of 
 time without things like timer interrupts to jumble up the timing! So 
 maybe the freezes existed before, but with timer interrupts happening 
 hundreds of times a second, they weren't noticeable to humans.

the freezes that Miklos was seeing were hardirq contexts blocking in 
task_rq_lock() - that is done with interrupts disabled. (Miklos i think 
also tried !NOHZ kernels and older kernels, with a similar result.)

plus on the ptrace side, the wait_task_inactive() code had most of its 
overhead in the atomic op, so if any timer IRQ hit _that_ core, it was 
likely while we were still holding the runqueue lock!

i think the only thing that eventually got Miklos' laptop out of the 
wedge were timer irqs hitting the ptrace CPU in exactly those 
instructions where it was not holding the runqueue lock. (or perhaps an 
asynchronous SMM event delaying it for a long time)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Ingo Molnar

* Linus Torvalds [EMAIL PROTECTED] wrote:

 No, the cache line arbitration doesn't know anything about locked vs 
 unlocked instructions (it could, but there really is no point).
 
 The real issue is that locked instructions on x86 are serializing, 
 which makes them extremely slow (compared to just a write), and then 
 by being slow they effectively just make the window for another core 
 bigger.
 
 IOW, it doesn't fix anything, it just hides the bug with timing.

yeah. I think Linux is i think the only OS on the planet that is using 
the movb trick for unlock, it even triggered a hardware erratum ;) So it 
might surprise some hw makers who might rely on the heuristics that each 
critical section lock and unlock is a LOCK-ed instruction.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Ingo Molnar

* Linus Torvalds [EMAIL PROTECTED] wrote:

 On Thu, 21 Jun 2007, Ingo Molnar wrote:
  
  I can understand why no data is saved by this change: gcc is 
  aligning the next field to a natural boundary anyway and we dont 
  really have arrays of spinlocks (fortunately).
 
 Actually, some data structures could well shrink.
 
 Look at struct task_struct, for example. Right now it has two 
 spinlocks right next to each other (alloc_lock and pi_lock).

yeah. We've got init_task's task-struct embedded in the vmlinux, but 
it's aligned to 32 bytes, which probably hides this effect. We'd only 
see it if the size change just happened to bring a key data structure 
(which is also embedded in the vmlinux) just below a modulo 32 bytes 
boundary. The chance for that is around 6:32 per 'merge' event. That 
means that there cannot be all that many such cases ;-)

anyway, the shorter init sequence is worth it already.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Ingo Molnar

* Linus Torvalds [EMAIL PROTECTED] wrote:

   for (;;) {
   for (i = 0; i  loops; i++) {
   if (__raw_write_trylock(lock-raw_lock))
   return;
   __delay(1);
   }
 
 What a piece of crap. 
 
 Anybody who ever waits for a lock by busy-looping over it is BUGGY, 
 dammit!
 
 The only correct way to wait for a lock is:
 
   (a) try it *once* with an atomic r-m-w 
   (b) loop over just _reading_ it (and something that implies a memory 
   barrier, _not_ __delay(). Use cpu_relax() or smp_rmb())
   (c) rinse and repeat.

damn, i first wrote up an explanation about why that ugly __delay(1) is 
there (it almost hurts my eyes when i look at it!) but then deleted it 
as superfluous :-/

really, it's not because i'm stupid (although i might still be stupid 
for other resons ;-), it wasnt there in earlier spin-debug versions. We 
even had an inner spin_is_locked() loop at a stage (and should add it 
again).

the reason for the __delay(1) was really mundane: to be able to figure 
out when to print a 'we locked up' message to the user. If it's 1 
second, it causes false positive on some systems. If it's 10 minutes, 
people press reset before we print out any useful data. It used to be 
just a loop of rep_nop()s, but that was hard to calibrate: on certain 
newer hardware it was triggering as fast as in 2 seconds, causing many 
false positives. We cannot use jiffies nor any other clocksource in this 
debug code.

so i settled for the butt-ugly but working __delay(1) thing, to be able 
to time the debug messages.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Eric Dumazet

Linus Torvalds a écrit :


On Thu, 21 Jun 2007, Linus Torvalds wrote:
We don't do nesting locking either, for exactly the same reason. Are 
nesting locks easier? Absolutely. They are also almost always a sign of 
a *bug*. So making spinlocks and/or mutexes nest by default is just a way 
to encourage bad programming!


Side note, and as a truth in advertising section: I'll have to admit 
that I argued against fair semaphores on the same grounds. I was wrong 
then (and eventually admitted it, and we obviously try to make our mutexes 
and semaphores fair these days!), and maybe I'm wrong now.


If somebody can actually come up with a sequence where we have spinlock 
starvation, and it's not about an example of bad locking, and nobody 
really can come up with any other way to fix it, we may eventually have to 
add the notion of fair spinlocks.




I tried to find such a sequence, but I think its more a matter of hardware 
evolution, and some degenerated cases.


In some years (months ?), it might possible to starve say the file struct 
spinlock of a process in a open()/close() infinite loop. This because the 
number of instruction per 'memory cache line transfert between cpus/core' is 
raising.


But then one can say its a bug in user code :)

Another way to starve kernel might be a loop doing settime() , since seqlock 
are quite special in serialization :


Only seqlock's writers perform atomic ops, readers could be starved because of 
some hardware 'optimization'.



So my arguments are purely pragmatic. It's not that I hate fairness per 
se. I dislike it only when it's used to solve (aka hide) other problems.


In the end, some situations do need fairness, and the fact that aiming for 
fairness is often harder, slower, and more complicated than not doing so 
at that point turns into a non-argument. If you need it, you need it.


Maybe some *big* NUMA machines really want this fairness (even if it cost some 
cycles as pointed by Davide in http://lkml.org/lkml/2007/3/29/246 ) , I am 
just guessing since I cannot test such monsters. I tested Davide program on a 
Dual Opteron and got some perf difference.



$ ./qspins  -n 2
now testing: TICKLOCK
timeres=4000
uscycles=1991
AVG[0]: 2195.25 cycles/loop
SIG[0]: 11.813657
AVG[1]: 2212.312500 cycles/loop
SIG[1]: 38.038991

$ ./qspins  -n 2 -s
now testing: SPINLOCK
timeres=4000
uscycles=1991
AVG[0]: 2066.00 cycles/loop
SIG[0]: 0.00
AVG[1]: 2115.687500 cycles/loop
SIG[1]: 63.083000




I just don't think we need it, and we're better off solving problems other 
ways.


(For example, we might also solve such problems by creating a separate
fair_spin_lock abstraction, and only making the particular users that 
need it actually use it. It would depend a bit on whether the cost of 
implementing the fairness is noticeable enough for it to be worth having 
a separate construct for it).


Linus




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Linus Torvalds


On Thu, 21 Jun 2007, Ingo Molnar wrote:
 
 yeah. I think Linux is i think the only OS on the planet that is using 
 the movb trick for unlock, it even triggered a hardware erratum ;)

I'm pretty sure others do it too.

Maybe not on an OS level (but I actually doubt that - I'd be surprised if 
Windows doesn't do the exact same thing), but I know for a fact that a lot 
of people in threaded libraries end up depending very much on the simple 
store closes a locked section.

Just googling for xchg mov spinlock -linux shows discussion boards 
for Windows developers with open-coded spinlocks like


int ResourceFlag = 0; // 0=Free, 1=Inuse
...
// Wait until we get the resource
while(InterlockedExchange(ResourceFlag, 1) != 0) {
   Sleep(0); } // Wait a tad
// Have the resource
... // do your thing
ResourceFlag = 0; // Release the resource


and that's definitely Windows code, not some Linux person doing it.

And this is from an OS2 forum

unsigned owned=0;

void request() {
  while(LockedExchanged(owned,1)!=0)
;
}

void release() {
  owned = 0;
}

so it's not even something unusual.

So while arguably these people don't know (and don't care) about subtle 
issues like memory ordering, I can *guarantee* that a lot of programs 
depend on them, even if that dependency may often come from a lack of 
knowledge, rather than actively understanding what we do like in the Linux 
kernel community.

(And yes, they rely on compilers not reordering either. Tough.)

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Linus Torvalds


On Thu, 21 Jun 2007, Ingo Molnar wrote:
 
 damn, i first wrote up an explanation about why that ugly __delay(1) is 
 there (it almost hurts my eyes when i look at it!) but then deleted it 
 as superfluous :-/

I'm fine with a delay, but the __delay(1) is simply not correct. It 
doesn't do anything.

udelay() waits for a certain time. Use that. 

 the reason for the __delay(1) was really mundane: to be able to figure 
 out when to print a 'we locked up' message to the user.

No it does not.

You may think it does, but it does nothing of the sort.

Use udelay() or somethign that actually takes a *time*.

Just __delay() is nothing but a loop, and calling it with an argument of 1 
is stupid and buggy. 

The only *possibly* valid use of __delay() implies using a counter that 
is based on the loops_per_sec thing, which depends on what the delay  
function actually is.

For example, the delay function may well turn out to be this:

__asm__ __volatile__(
\tjmp 1f\n
.align 16\n
1:\tjmp 2f\n
.align 16\n
2:\tdecl %0\n\tjns 2b
:=a (d0)
:0 (loops));

Notice? Your code, it does nothing!

When I said that the code was buggy, I meant it.

It has nothing to do with spinlocks. And __delay(1) is *always* a bug.

You migth want to replace it with

smp_rmb();
udelay(1);

instead, at which point it *does* something: it has that read barrier 
(which is not actually needed on x86, but whatever), and it has a delay 
that is *meaningful*.

A plain __delay(1) is neither.

So let me repeat my statement: What a piece of crap.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-21 Thread Ingo Molnar

* Linus Torvalds [EMAIL PROTECTED] wrote:

 On Thu, 21 Jun 2007, Ingo Molnar wrote:
  
  damn, i first wrote up an explanation about why that ugly __delay(1) is 
  there (it almost hurts my eyes when i look at it!) but then deleted it 
  as superfluous :-/
 
 I'm fine with a delay, but the __delay(1) is simply not correct. It 
 doesn't do anything.

it's a bit trickier than that. Yes, it's a simple 1-entry loop and thus 
makes little sense to call. But it's a loop that got boot-time 
calibrated, so we can do this in the spinlock-debug code:

u64 loops = loops_per_jiffy * HZ;

this guarantees that we will loop for _at least_ 1 second before 
printing a message. (in practice it's much longer, especially with the 
current naive trylock approach)

Why? Because as part of the activities that the spin-loop does, we also 
do everything that an mdelay(1000) would do. We do it 'piecemail-wise', 
and we do it very inefficiently, but the lower time-bound should be 
guaranteed. This is done because most of the problems were caused by too 
short looping and bogus debug printouts. So this is basically an 
open-coded udelay implementation.

Furthermore, with the spin_is_locked() fix i just sent, the __loop(1) 
solution should actually be quite close to a real udelay() thing, 
without the delay effect. It would probably more accurate to increase it 
to loops_per_jiffy*10*HZ/16 and to call a __loop(16) thing, to get 
closer timings. (but then some people would argue 'why dont you take the 
lock as soon as it's released'.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-20 Thread Linus Torvalds


On Wed, 20 Jun 2007, Jarek Poplawski wrote:
> 
> I don't agree with this (+ I know it doesn't matter).
> 
> The real bug is what Chuck Ebbert wrote: "Spinlocks aren't fair".
> And here they are simply lawlessly not fair.

Well, that's certainly a valid standpoint. I wouldn't claim you're 
_wrong_.

At least in theory.

In *practice*, fair spinlocks are simply not possible. Not as in "it's 
hard to do", but as in "you simply cannot do it".

The thing is, most spinlocks get held for a rather short time (if that 
wasn't the case, you'd use something like a mutex), and it's acually a 
short time not on a "human scale", but on a "CPU core scale".

IOW, the cost of doing the cacheline bouncing is often equal to, or bigger 
than, the actual cost of the operation that happens inside the spinlock!

What does this result in? It automatically means that software simply 
*cannot* do fair spinlocks, because all the timing costs are in parts that 
software doesn't even have any visibility into, or control over!

Yeah, you could add artificial delays, by doing things like cycle counting 
around the spinlock operation to see if you got the lock when it was 
_contended_, or whether you got a lock that wasn't, and then adding some 
statistics etc, but at that point, you don't have a spinlock any more, you 
have something else. I don't know what to call it.

And you could add flags like "this spinlock is getting a lot of 
contention", try some other algorithm etc. But it's all complicated and 
against the whole *point* of a spinlock, which is to get in and out as 
fast as humanly possible.

So in practice, spinlock fairness is inherently tied to the hardware 
behaviour of the cache coherency algorithm.

Which gets us to the next level: we can consider hardware that isn't 
"fair" in its cache coherency to be buggy hardware.

That's also a perfectly fine standpoint, and it's actually one that I have 
a lot of sympathy for. I think it's much easier to some degree to do 
fairness in the cache coherency than at any other level, because it's 
something where the hardware really *could* do things like counting 
bouncing etc.

However, while it's a perfectly fine standpoint, it's also totally 
unrealistic. First off, the hardware doesn't even know whether the 
spinlock "failed" or not on any architecture platform I'm aware of. On 
x86, the spinlock operation under Linux is actually just an atomic 
decrement, and it so happens that the rules for failure was that it didn't 
go negative. But that's just a Linux internal rule - the hardware doesn't 
know.

So you'd have to actualyl have some specific sequence with specific 
fairness rules (maybe you could make the rule be that a failed atomic 
"cmpxchg" counts as a real failure, and if you give higher priority to 
cores with lots of failures etc).

IOW, it's certainly possible *in*theory* to try to make hardware that has 
fairness guarantees. However, any hardware designer will tell you that 
 (a) they have enough problems as it is
 (b) it's simply not worth their time, since there are other (simpler) 
 things that are much much more important.

So the end result:
 - hardware won't do it, and they'd be crazy to really even try (apart 
   from maybe some really simplistic stuff that doesn't _guarantee_ 
   anything, but maybe helps fairness a bit)
 - software cannot do it, without turning spinlocks into something really 
   slow and complicated, at which point they've lost all meaning, and you 
   should just use a higher-level construct like a mutex instead.

In other words, spinlocks are optimized for *lack* of contention. If a 
spinlock has contention, you don't try to make the spinlock "fair". No, 
you try to fix the contention instead!

That way, you fix many things. You probably speed things up, _and_ you 
make it fairer. It might sometimes take some brains and effort, but it's 
worth it.

The patch I sent out was an example of that. You *can* fix contention 
problems. Does it take clever approaches? Yes. It's why we have hashed 
spinlocks, RCU, and code sequences that are entirely lockless and use 
optimistic approaches. And suddenly you get fairness *and* performance!

It's a win-win situation. It does require a bit of effort, but hey, we're 
good at effort.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-20 Thread Jarek Poplawski
On 18-06-2007 18:34, Linus Torvalds wrote:
> 
> On Mon, 18 Jun 2007, Ingo Molnar wrote:
>> To test this theory, could you try the patch below, does this fix your 
>> hangs too?
> 
> I really think this the the wrong approach, although *testing* it makes 
> sense.
> 
> I think we need to handle loops that take, release, and then immediately 
> re-take differently.
> 
> Such loops are _usually_ of the form where they really just release the 
> lock for some latency reason, but in this case I think it's actually just 
> a bug.
> 
> That code does:
> 
>   if (unlikely(p->array || task_running(rq, p))) {
> 
> to decide if it needs to just unlock and repeat, but then to decide if it 
> need to *yield* it only uses *one* of those tests (namely 
> 
>   preempted = !task_running(rq, p);
>   ..
>   if (preempted)
>   yield();
> 
> and I think that's just broken. It basically says:
> 
>  - if the task is running, I will busy-loop on getting/releasing the 
>task_rq_lock
> 
> and that is the _real_ bug here.

I don't agree with this (+ I know it doesn't matter).

The real bug is what Chuck Ebbert wrote: "Spinlocks aren't fair".
And here they are simply lawlessly not fair.

I cannot see any reason why any of tasks doing simultaneously
"busy-loop on getting/releasing" a spinlock should starve almost
to death another one doing the same (or simply waiting to get this
lock) even without cpu_relax. Of course, lawfulness of such
behavior is questionable, and should be fixed like here.
 
> Trying to make the spinlocks do somethign else than what they do is just 
> papering over the real bug. The real bug is that anybody who just 
> busy-loops getting a lock is wasting resources so much that we should not 
> be at all surprised that some multi-core or NUMA situations will get 
> starvation.

On the other hand it seems spinlocks should be at least a little
more immune to such bugs: slowdown is OK but not freezing. Current
behavior could suggest this unfairness could harm some tasks even
without any loops present - but it's not visible enough.

So, I'm surprised this thread seems to stop after this patch, and
there is no try to make the most of this ideal testing case to
improve spinlock design btw.

Regards,
Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-20 Thread Jarek Poplawski
On 18-06-2007 18:34, Linus Torvalds wrote:
 
 On Mon, 18 Jun 2007, Ingo Molnar wrote:
 To test this theory, could you try the patch below, does this fix your 
 hangs too?
 
 I really think this the the wrong approach, although *testing* it makes 
 sense.
 
 I think we need to handle loops that take, release, and then immediately 
 re-take differently.
 
 Such loops are _usually_ of the form where they really just release the 
 lock for some latency reason, but in this case I think it's actually just 
 a bug.
 
 That code does:
 
   if (unlikely(p-array || task_running(rq, p))) {
 
 to decide if it needs to just unlock and repeat, but then to decide if it 
 need to *yield* it only uses *one* of those tests (namely 
 
   preempted = !task_running(rq, p);
   ..
   if (preempted)
   yield();
 
 and I think that's just broken. It basically says:
 
  - if the task is running, I will busy-loop on getting/releasing the 
task_rq_lock
 
 and that is the _real_ bug here.

I don't agree with this (+ I know it doesn't matter).

The real bug is what Chuck Ebbert wrote: Spinlocks aren't fair.
And here they are simply lawlessly not fair.

I cannot see any reason why any of tasks doing simultaneously
busy-loop on getting/releasing a spinlock should starve almost
to death another one doing the same (or simply waiting to get this
lock) even without cpu_relax. Of course, lawfulness of such
behavior is questionable, and should be fixed like here.
 
 Trying to make the spinlocks do somethign else than what they do is just 
 papering over the real bug. The real bug is that anybody who just 
 busy-loops getting a lock is wasting resources so much that we should not 
 be at all surprised that some multi-core or NUMA situations will get 
 starvation.

On the other hand it seems spinlocks should be at least a little
more immune to such bugs: slowdown is OK but not freezing. Current
behavior could suggest this unfairness could harm some tasks even
without any loops present - but it's not visible enough.

So, I'm surprised this thread seems to stop after this patch, and
there is no try to make the most of this ideal testing case to
improve spinlock design btw.

Regards,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-20 Thread Linus Torvalds


On Wed, 20 Jun 2007, Jarek Poplawski wrote:
 
 I don't agree with this (+ I know it doesn't matter).
 
 The real bug is what Chuck Ebbert wrote: Spinlocks aren't fair.
 And here they are simply lawlessly not fair.

Well, that's certainly a valid standpoint. I wouldn't claim you're 
_wrong_.

At least in theory.

In *practice*, fair spinlocks are simply not possible. Not as in it's 
hard to do, but as in you simply cannot do it.

The thing is, most spinlocks get held for a rather short time (if that 
wasn't the case, you'd use something like a mutex), and it's acually a 
short time not on a human scale, but on a CPU core scale.

IOW, the cost of doing the cacheline bouncing is often equal to, or bigger 
than, the actual cost of the operation that happens inside the spinlock!

What does this result in? It automatically means that software simply 
*cannot* do fair spinlocks, because all the timing costs are in parts that 
software doesn't even have any visibility into, or control over!

Yeah, you could add artificial delays, by doing things like cycle counting 
around the spinlock operation to see if you got the lock when it was 
_contended_, or whether you got a lock that wasn't, and then adding some 
statistics etc, but at that point, you don't have a spinlock any more, you 
have something else. I don't know what to call it.

And you could add flags like this spinlock is getting a lot of 
contention, try some other algorithm etc. But it's all complicated and 
against the whole *point* of a spinlock, which is to get in and out as 
fast as humanly possible.

So in practice, spinlock fairness is inherently tied to the hardware 
behaviour of the cache coherency algorithm.

Which gets us to the next level: we can consider hardware that isn't 
fair in its cache coherency to be buggy hardware.

That's also a perfectly fine standpoint, and it's actually one that I have 
a lot of sympathy for. I think it's much easier to some degree to do 
fairness in the cache coherency than at any other level, because it's 
something where the hardware really *could* do things like counting 
bouncing etc.

However, while it's a perfectly fine standpoint, it's also totally 
unrealistic. First off, the hardware doesn't even know whether the 
spinlock failed or not on any architecture platform I'm aware of. On 
x86, the spinlock operation under Linux is actually just an atomic 
decrement, and it so happens that the rules for failure was that it didn't 
go negative. But that's just a Linux internal rule - the hardware doesn't 
know.

So you'd have to actualyl have some specific sequence with specific 
fairness rules (maybe you could make the rule be that a failed atomic 
cmpxchg counts as a real failure, and if you give higher priority to 
cores with lots of failures etc).

IOW, it's certainly possible *in*theory* to try to make hardware that has 
fairness guarantees. However, any hardware designer will tell you that 
 (a) they have enough problems as it is
 (b) it's simply not worth their time, since there are other (simpler) 
 things that are much much more important.

So the end result:
 - hardware won't do it, and they'd be crazy to really even try (apart 
   from maybe some really simplistic stuff that doesn't _guarantee_ 
   anything, but maybe helps fairness a bit)
 - software cannot do it, without turning spinlocks into something really 
   slow and complicated, at which point they've lost all meaning, and you 
   should just use a higher-level construct like a mutex instead.

In other words, spinlocks are optimized for *lack* of contention. If a 
spinlock has contention, you don't try to make the spinlock fair. No, 
you try to fix the contention instead!

That way, you fix many things. You probably speed things up, _and_ you 
make it fairer. It might sometimes take some brains and effort, but it's 
worth it.

The patch I sent out was an example of that. You *can* fix contention 
problems. Does it take clever approaches? Yes. It's why we have hashed 
spinlocks, RCU, and code sequences that are entirely lockless and use 
optimistic approaches. And suddenly you get fairness *and* performance!

It's a win-win situation. It does require a bit of effort, but hey, we're 
good at effort.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-18 Thread Ravikiran G Thirumalai
On Mon, Jun 18, 2007 at 01:20:55AM -0700, Andrew Morton wrote:
> On Mon, 18 Jun 2007 10:12:04 +0200 Ingo Molnar <[EMAIL PROTECTED]> wrote:
> 
> > >
> > Subject: [patch] x86: fix spin-loop starvation bug
> > From: Ingo Molnar <[EMAIL PROTECTED]>
> > 
> > Miklos Szeredi reported very long pauses (several seconds, sometimes 
> > more) on his T60 (with a Core2Duo) which he managed to track down to 
> > wait_task_inactive()'s open-coded busy-loop. He observed that an 
> > interrupt on one core tries to acquire the runqueue-lock but does not 
> > succeed in doing so for a very long time - while wait_task_inactive() on 
> > the other core loops waiting for the first core to deschedule a task 
> > (which it wont do while spinning in an interrupt handler).
> > 
> > The problem is: both the spin_lock() code and the wait_task_inactive() 
> > loop uses cpu_relax()/rep_nop(), so in theory the CPU should have 
> > guaranteed MESI-fairness to the two cores - but that didnt happen: one 
> > of the cores was able to monopolize the cacheline that holds the 
> > runqueue lock, for extended periods of time.
> > 
> > This patch changes the spin-loop to assert an atomic op after every REP 
> > NOP instance - this will cause the CPU to express its "MESI interest" in 
> > that cacheline after every REP NOP.
> 
> Kiran, if you're still able to reproduce that zone->lru_lock starvation 
> problem,
> this would be a good one to try...

We tried this approach a week back (speak of co-incidences), and it did not
help the problem.  I'd changed calls to the zone->lru_lock spin_lock
to do spin_trylock in a while loop with cpu_relax instead.  It did not help,
This was on top of 2.6.17 kernels.  But the good news is 2.6.21, as
is does not have the starvation issue -- that is, zone->lru_lock does not
seem to get contended that much under the same workload.

However, this was not on the same hardware I reported zone->lru_lock
contention on (8 socket dual core opteron).  I don't have access to it 
anymore :(

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-18 Thread Linus Torvalds


On Mon, 18 Jun 2007, Ingo Molnar wrote:
> 
> ok. Do we have an guarantee that cpu_relax() is also an smp_rmb()?

The common use for cpu_relax() is basically for code that does

while (*ptr != val)
cpu_relax();

so yes, an architecture that doesn't notice writes by other CPU's on its 
own had *better* have an implied read barrier in its "cpu_relax()" 
implementation. For example, the irq handling code does

while (desc->status & IRQ_INPROGRESS)
cpu_relax();

which is explicitly about waiting for another CPU to get out of their 
interrupt handler. And one classic use for it in drivers is obviously the

while (time_before (jiffies, next))
cpu_relax();

kind of setup (and "jiffies" may well be updated on another CPU: the fact 
that it is "volatile" is just a *compiler* barrier just like cpu_relax() 
itself will also be, not a "smp_rmb()" kind of hardware barrier).

So we could certainly add the smp_rmb() to make it more explicit, and it 
wouldn't be *wrong*.

But quite frankly, I'd personally rather not - if it were to make a 
difference in some situation, it would just be papering over a bug in 
cpu_relax() itself.

The whole point of cpu_relax() is about busy-looping, after all. And the 
only thing you really *can* busy-loop on in a CPU is basically a memory 
value.

So the smp_rmb() would I think distract from the issue, and at best paper 
over some totally separate bug.

> hm, this might still go into a non-nice busy loop on SMP: one cpu runs 
> the strace, another one runs two tasks, one of which is runnable but not 
> on the runqueue (the one we are waiting for). In that case we'd call 
> yield() on this CPU in a loop

Sure. I agree - we can get into a loop that calls yield(). But I think a 
loop that calls yield() had better be ok - we're explicitly giving the 
scheduler the ability to to schedule anything else that is relevant. 

So I think yield()'ing is fundamentally different from busy-looping any 
other way. 

Would it be better to be able to have a wait-queue, and actually *sleep* 
on it, and not even busy-loop using yield? Yeah, possibly. I cannot 
personally bring myself to care about that kind of corner-case situation, 
though.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-18 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> > Boots and runs fine.  Fixes the freezes as well, which is not such a 
> > big surprise, since basically any change in that function seems to 
> > do that ;)
> 
> Yeah, and that code really was *designed* to make all "locking in a 
> loop" go away. So unlike the patches adding random barriers and 
> cpu_relax() calls (which might fix it for some *particular* hw setup), 
> I pretty much guarantee that you can never get that code into a 
> situation where there is some lock being starved on *any* hw setup.
> 
> Ingo, an ack for the patch, and I'll just apply it?

yeah. It's definitely better than what we had before. Maybe one day 
someone frees us from this function =B-) We already had like 2-3 nasty 
SMP bugs related to it in the last 10 years.

Acked-by: Ingo Molnar <[EMAIL PROTECTED]>

and kudos to Miklos for the patience to debug this bug!

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-18 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> That code does:
> 
>   if (unlikely(p->array || task_running(rq, p))) {
> 
> to decide if it needs to just unlock and repeat, but then to decide if 
> it need to *yield* it only uses *one* of those tests (namely
> 
>   preempted = !task_running(rq, p);
>   ..
>   if (preempted)
>   yield();
> 
> and I think that's just broken. It basically says:
> 
>  - if the task is running, I will busy-loop on getting/releasing the 
>task_rq_lock
> 
> and that is the _real_ bug here.
> 
> Trying to make the spinlocks do somethign else than what they do is 
> just papering over the real bug. The real bug is that anybody who just 
> busy-loops getting a lock is wasting resources so much that we should 
> not be at all surprised that some multi-core or NUMA situations will 
> get starvation.
> 
> Blaming some random Core 2 hardware implementation issue that just 
> makes it show up is wrong. It's a software bug, plain and simple.

yeah, agreed. wait_task_inactive() is butt-ugly, and Roland i think 
found a way to get rid of it in utrace (but it's not implemented yet, 
boggle) - but nevertheless this needs fixing for .22.

> So how about this diff? The diff looks big, but the *code* is actually 
> simpler and shorter, I just added tons of comments, which is what 
> blows it up.

> 
> The new *code* looks like this:
> 
>   repeat:
>   /* Unlocked, optimistic looping! */
>   rq = task_rq(p);
>   while (task_running(rq, p))
>   cpu_relax();

ok. Do we have an guarantee that cpu_relax() is also an smp_rmb()?

> 
>   /* Get the *real* values */
>   rq = task_rq_lock(p, );
>   running = task_running(rq, p);
>   array = p->array;
>   task_rq_unlock(rq, );
> 
>   /* Check them.. */
>   if (unlikely(running)) {
>   cpu_relax();
>   goto repeat;
>   }
> 
>   if (unlikely(array)) {
>   yield();
>   goto repeat;
>   }

hm, this might still go into a non-nice busy loop on SMP: one cpu runs 
the strace, another one runs two tasks, one of which is runnable but not 
on the runqueue (the one we are waiting for). In that case we'd call 
yield() on this CPU in a loop (and likely wont pull that task over from 
that CPU). And yield() itself is a high-frequency rq-lock touching thing 
too, just a bit heavier than the other path in the wait function.

> Hmm? Untested, I know. Maybe I overlooked something. But even the 
> generated assembly code looks fine (much better than it looked 
> before!)

it looks certainly better and cleaner than what we had before!

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-18 Thread Linus Torvalds


On Mon, 18 Jun 2007, Miklos Szeredi wrote:
>
> > Hmm? Untested, I know. Maybe I overlooked something. But even the 
> > generated assembly code looks fine (much better than it looked before!)
> 
> Boots and runs fine.  Fixes the freezes as well, which is not such a
> big surprise, since basically any change in that function seems to do
> that ;)

Yeah, and that code really was *designed* to make all "locking in a loop" 
go away. So unlike the patches adding random barriers and cpu_relax() 
calls (which might fix it for some *particular* hw setup), I pretty much 
guarantee that you can never get that code into a situation where there is 
some lock being starved on *any* hw setup.

Ingo, an ack for the patch, and I'll just apply it?

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] long freezes on thinkpad t60

2007-06-18 Thread Miklos Szeredi
> Hmm? Untested, I know. Maybe I overlooked something. But even the 
> generated assembly code looks fine (much better than it looked before!)

Boots and runs fine.  Fixes the freezes as well, which is not such a
big surprise, since basically any change in that function seems to do
that ;)

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >