Re: [BUG] long freezes on thinkpad t60
Linus Torvalds wrote: Nick, call me a worry-wart, but I slept on this, and started worrying.. On Tue, 26 Jun 2007, Linus Torvalds wrote: So try it with just a byte counter, and test some stupid micro-benchmark on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make it the normal spinlock sequence just because it isn't noticeably slower. So I thought about this a bit more, and I like your sequence counter approach, but it still worried me. In the current spinlock code, we have a very simple setup for a successful grab of the spinlock: CPU#0 CPU#1 ... Yeah, thanks. Now, I have good reason to believe that all Intel and AMD CPU's have a stricter-than-documented memory ordering, and that your spinlock may actually work perfectly well. But it still worries me. As far as I can tell, there's a theoretical problem with your spinlock implementation. So I'd like you to ask around some CPU people, and get people from both Intel and AMD to sign off on your spinlocks as safe. I suspect you already have the required contacts, but if you don't, I can send things off to the appropriate people at least inside Intel. Haven't made too much progress on this, but I have asked [EMAIL PROTECTED] who might be able to at least know the right person to ask :P (might be faster to ask Andi to ask :)) If you know someone at Intel then that would be appreciated. It would be nice if it is safe (and can be guaranteed to be safe in future). However OTOH, the fastpath may be even faster if we do it in a "definitely safe" way. That is, do the xaddw against 16-bits with the head in 8 of those and the tail in the other 8. Then compare the byte registers of the register returned by xaddw for the test. Although the xaddw is going to be slower than an xaddb, this way we subsequently avoid the extra load completely, while avoiding ordering issues. In the slowpath we would have to have a token locked op in there (like the current spinlocks do), but this could be taken out iff our inquiries come back positive. Anyway, I'll try redoing the patch and getting some numbers. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
Linus Torvalds wrote: Nick, call me a worry-wart, but I slept on this, and started worrying.. On Tue, 26 Jun 2007, Linus Torvalds wrote: So try it with just a byte counter, and test some stupid micro-benchmark on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make it the normal spinlock sequence just because it isn't noticeably slower. So I thought about this a bit more, and I like your sequence counter approach, but it still worried me. In the current spinlock code, we have a very simple setup for a successful grab of the spinlock: CPU#0 CPU#1 ... Yeah, thanks. Now, I have good reason to believe that all Intel and AMD CPU's have a stricter-than-documented memory ordering, and that your spinlock may actually work perfectly well. But it still worries me. As far as I can tell, there's a theoretical problem with your spinlock implementation. So I'd like you to ask around some CPU people, and get people from both Intel and AMD to sign off on your spinlocks as safe. I suspect you already have the required contacts, but if you don't, I can send things off to the appropriate people at least inside Intel. Haven't made too much progress on this, but I have asked [EMAIL PROTECTED] who might be able to at least know the right person to ask :P (might be faster to ask Andi to ask :)) If you know someone at Intel then that would be appreciated. It would be nice if it is safe (and can be guaranteed to be safe in future). However OTOH, the fastpath may be even faster if we do it in a definitely safe way. That is, do the xaddw against 16-bits with the head in 8 of those and the tail in the other 8. Then compare the byte registers of the register returned by xaddw for the test. Although the xaddw is going to be slower than an xaddb, this way we subsequently avoid the extra load completely, while avoiding ordering issues. In the slowpath we would have to have a token locked op in there (like the current spinlocks do), but this could be taken out iff our inquiries come back positive. Anyway, I'll try redoing the patch and getting some numbers. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Wed, 27 Jun 2007, Linus Torvalds wrote: > On Wed, 27 Jun 2007, Davide Libenzi wrote: > > > On Wed, 27 Jun 2007, Linus Torvalds wrote: > > > > > > Stores never "leak up". They only ever leak down (ie past subsequent > > > loads > > > or stores), so you don't need to worry about them. That's actually > > > already > > > documented (although not in those terms), and if it wasn't true, then we > > > couldn't do the spin unlock with just a regular store anyway. > > > > Yes, Intel has never done that. They'll probably never do it since it'll > > break a lot of system software (unless they use a new mode-bit that > > allows system software to enable lose-ordering). Although I clearly > > remember to have read in one of their P4 optimization manuals to not > > assume this in the future. > > That optimization manual was confused. > > The Intel memory ordering documentation *clearly* states that only reads > pass writes, not the other way around. Yes, they were stating that clearly. IIWNOC (If I Were Not On Crack) I remember them saying to not assume any ordering besides the data dependency and the CPU self-consistency in the future CPUs, and to use *fence instructions when certain semantics were required. But google did not help me in finding that doc, so maybe I were really on crack :) - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Wed, 27 Jun 2007, Davide Libenzi wrote: > On Wed, 27 Jun 2007, Linus Torvalds wrote: > > > > Stores never "leak up". They only ever leak down (ie past subsequent loads > > or stores), so you don't need to worry about them. That's actually already > > documented (although not in those terms), and if it wasn't true, then we > > couldn't do the spin unlock with just a regular store anyway. > > Yes, Intel has never done that. They'll probably never do it since it'll > break a lot of system software (unless they use a new mode-bit that > allows system software to enable lose-ordering). Although I clearly > remember to have read in one of their P4 optimization manuals to not > assume this in the future. That optimization manual was confused. The Intel memory ordering documentation *clearly* states that only reads pass writes, not the other way around. Some very confused people have thought that "pass" is a two-way thing. It's not. "Passing" in the Intel memory ordering means "go _ahead_ of", exactly the same way it means in traffic. You don't "pass" people by falling behind them. It's also obvious from reading the manual, because any other reading would be very strange: it says 1. Reads can be carried out speculatively and in any order 2. Reads can pass buffered writes, but the processor is self-consistent 3. Writes to memory are always carried out in program order [.. and then lists exceptions that are not interesting - it's clflush and the non-temporal stores, not any normal writes ] 4. Writes can be buffered 5. Writes are not performed speculatively; they are only performed for instructions that have actually been retired. 6. Data from buffered writes can be forwarded to waiting reads within the processor. 7. Reads or writes cannot pass (be carried out ahead of) I/O instructions, locked instructions or serializing instructions. 8. Reads cannot pass LFENCE and MFENCE instructions. 9. Writes cannot pass SFENCE or MFENCE instructions. The thing to note is: a) in (1), Intel says that reads can occur in any order, but (2) makes it clear that that is only relevant wrt other _reads_ b) in (2), they say "pass", but then they actually explain that "pass" means "be carried out ahead of" in (7). HOWEVER, it should be obvious in (2) even _without_ the explicit clarification in (7) that "pass" is a one-way thing, because otherwise (2) is totally _meaningless_. It would be meaningless for two reasons: - (1) already said that reads can be done in any order, so if that was a "any order wrt writes", then (2) would be pointless. So (2) must mean something *else* than "any order", and the only sane reading of it that isn't "any order" is that "pass" is a one-way thing: you pass somebody when you go ahead of them, you do *not* pass somebody when you fall behind them! - if (2) really meant that reads and writes can just be re-ordered, then the choice of words makes no sense. It would be much more sensible to say that "reads can be carried out in any order wrt writes", instead of talking explicitly about "passing buffered writes" Anyway, I'm pretty damn sure my reading is correct. And no, it's not a "it happens to work". It's _architecturally_required_ to work, and nobody has ever complained about the use of a simple store to unlock a spinlock (which would only work if the "reads can pass" only means "*later* reads can pass *earlier* writes"). And it turns out that I think #1 is going away. Yes, the uarch will internally re-order reads, wrt each other, but if it isn't architecturally visible, then from an architectural standpoint #1 simply doesn't happen. I can't guarantee that will happen, of course, but from talking to both AMD and Intel people, I think that they'll just document the stricter rules as the de-facto rules. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Wed, 27 Jun 2007, Linus Torvalds wrote: > > IOW shouldn't an mfence always be there? Not only loads could leak up > > into the wait phase, but stores too, if they have no dependency with the > > "head" and "tail" loads. > > Stores never "leak up". They only ever leak down (ie past subsequent loads > or stores), so you don't need to worry about them. That's actually already > documented (although not in those terms), and if it wasn't true, then we > couldn't do the spin unlock with just a regular store anyway. Yes, Intel has never done that. They'll probably never do it since it'll break a lot of system software (unless they use a new mode-bit that allows system software to enable lose-ordering). Although I clearly remember to have read in one of their P4 optimization manuals to not assume this in the future. - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Wed, 27 Jun 2007, Davide Libenzi wrote: > > > > Now, I have good reason to believe that all Intel and AMD CPU's have a > > stricter-than-documented memory ordering, and that your spinlock may > > actually work perfectly well. But it still worries me. As far as I can > > tell, there's a theoretical problem with your spinlock implementation. > > Nice catch ;) But wasn't Intel suggesting in not relying on the old > "strict" ordering rules? Actually, both Intel and AMD engineers have been talking about making the documentation _stricter_, rather than looser. They apparently already are pretty damn strict, because not being stricter than the docs imply just ends up exposing too many potential problems in software that didn't really follow the rules. For example, it's quite possible to do loads out of order, but guarantee that the result is 100% equivalent with a totally in-order machine. One way you do that is to keep track of the cacheline for any speculative loads, and if it gets invalidated before the speculative instruction has completed, you just throw the speculation away. End result: you can do any amount of speculation you damn well please at a micro-architectural level, but if the speculation would ever have been architecturally _visible_, it never happens! (Yeah, that is just me in my non-professional capacity of hw engineer wanna-be: I'm not saying that that is necessarily what Intel or AMD actually ever do, and they may have other approaches entirely). > IOW shouldn't an mfence always be there? Not only loads could leak up > into the wait phase, but stores too, if they have no dependency with the > "head" and "tail" loads. Stores never "leak up". They only ever leak down (ie past subsequent loads or stores), so you don't need to worry about them. That's actually already documented (although not in those terms), and if it wasn't true, then we couldn't do the spin unlock with just a regular store anyway. (There's basically never any reason to "speculate" stores before other mem ops. It's hard, and pointless. Stores you want to just buffer and move as _late_ as possible, loads you want to speculate and move as _early_ as possible. Anything else doesn't make sense). So I'm fairly sure that the only thing you really need to worry about in this thing is the load-load ordering (the load for the spinlock compare vs any loads "inside" the spinlock), and I'm reasonably certain that no existing x86 (and likely no future x86) will make load-load reordering effects architecturally visible, even if the implementation may do so *internally* when it's not possible to see it in the end result. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Wed, 27 Jun 2007, Linus Torvalds wrote: > On Tue, 26 Jun 2007, Linus Torvalds wrote: > > > > So try it with just a byte counter, and test some stupid micro-benchmark > > on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make > > it the normal spinlock sequence just because it isn't noticeably slower. > > So I thought about this a bit more, and I like your sequence counter > approach, but it still worried me. > > In the current spinlock code, we have a very simple setup for a > successful grab of the spinlock: > > CPU#0 CPU#1 > > A (= code before the spinlock) > lock release > > lock decb mem (serializing instruction) > > B (= code after the spinlock) > > and there is no question that memory operations in B cannot leak into A. > > With the sequence counters, the situation is more complex: > > CPU #0 CPU #1 > > A (= code before the spinlock) > > lock xadd mem (serializing instruction) > > B (= code afte xadd, but not inside lock) > > lock release > > cmp head, tail > > C (= code inside the lock) > > Now, B is basically the empty set, but that's not the issue I worry about. > The thing is, I can guarantee by the Intel memory ordering rules that > neither B nor C will ever have memops that leak past the "xadd", but I'm > not at all as sure that we cannot have memops in C that leak into B! > > And B really isn't protected by the lock - it may run while another CPU > still holds the lock, and we know the other CPU released it only as part > of the compare. But that compare isn't a serializing instruction! > > IOW, I could imagine a load inside C being speculated, and being moved > *ahead* of the load that compares the spinlock head with the tail! IOW, > the load that is _inside_ the spinlock has effectively moved to outside > the protected region, and the spinlock isn't really a reliable mutual > exclusion barrier any more! > > (Yes, there is a data-dependency on the compare, but it is only used for a > conditional branch, and conditional branches are control dependencies and > can be speculated, so CPU speculation can easily break that apparent > dependency chain and do later loads *before* the spinlock load completes!) > > Now, I have good reason to believe that all Intel and AMD CPU's have a > stricter-than-documented memory ordering, and that your spinlock may > actually work perfectly well. But it still worries me. As far as I can > tell, there's a theoretical problem with your spinlock implementation. Nice catch ;) But wasn't Intel suggesting in not relying on the old "strict" ordering rules? IOW shouldn't an mfence always be there? Not only loads could leak up into the wait phase, but stores too, if they have no dependency with the "head" and "tail" loads. - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > With the sequence counters, the situation is more complex: > > CPU #0 CPU #1 > > A (= code before the spinlock) > > lock xadd mem (serializing instruction) > > B (= code afte xadd, but not inside lock) > > lock release > > cmp head, tail > > C (= code inside the lock) > > Now, B is basically the empty set, but that's not the issue I worry > about. The thing is, I can guarantee by the Intel memory ordering > rules that neither B nor C will ever have memops that leak past the > "xadd", but I'm not at all as sure that we cannot have memops in C > that leak into B! > > And B really isn't protected by the lock - it may run while another > CPU still holds the lock, and we know the other CPU released it only > as part of the compare. But that compare isn't a serializing > instruction! > > IOW, I could imagine a load inside C being speculated, and being moved > *ahead* of the load that compares the spinlock head with the tail! > IOW, the load that is _inside_ the spinlock has effectively moved to > outside the protected region, and the spinlock isn't really a reliable > mutual exclusion barrier any more! > > (Yes, there is a data-dependency on the compare, but it is only used > for a conditional branch, and conditional branches are control > dependencies and can be speculated, so CPU speculation can easily > break that apparent dependency chain and do later loads *before* the > spinlock load completes!) > > Now, I have good reason to believe that all Intel and AMD CPU's have a > stricter-than-documented memory ordering, and that your spinlock may > actually work perfectly well. But it still worries me. As far as I can > tell, there's a theoretical problem with your spinlock implementation. hm, i agree with you that this is problematic. Especially on an SMT CPU it would be a big architectural restriction if prefetches couldnt cross cache misses. (and that's the only way i could see Nick's scheme working: MESI coherency coupled with the speculative use of that cacheline's value never "surviving" a MESI invalidation of that cacheline. That would guarantee that once we have the lock, any speculative result is fully coherent and no other CPU has modified it.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
Nick, call me a worry-wart, but I slept on this, and started worrying.. On Tue, 26 Jun 2007, Linus Torvalds wrote: > > So try it with just a byte counter, and test some stupid micro-benchmark > on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make > it the normal spinlock sequence just because it isn't noticeably slower. So I thought about this a bit more, and I like your sequence counter approach, but it still worried me. In the current spinlock code, we have a very simple setup for a successful grab of the spinlock: CPU#0 CPU#1 A (= code before the spinlock) lock release lock decb mem (serializing instruction) B (= code after the spinlock) and there is no question that memory operations in B cannot leak into A. With the sequence counters, the situation is more complex: CPU #0 CPU #1 A (= code before the spinlock) lock xadd mem (serializing instruction) B (= code afte xadd, but not inside lock) lock release cmp head, tail C (= code inside the lock) Now, B is basically the empty set, but that's not the issue I worry about. The thing is, I can guarantee by the Intel memory ordering rules that neither B nor C will ever have memops that leak past the "xadd", but I'm not at all as sure that we cannot have memops in C that leak into B! And B really isn't protected by the lock - it may run while another CPU still holds the lock, and we know the other CPU released it only as part of the compare. But that compare isn't a serializing instruction! IOW, I could imagine a load inside C being speculated, and being moved *ahead* of the load that compares the spinlock head with the tail! IOW, the load that is _inside_ the spinlock has effectively moved to outside the protected region, and the spinlock isn't really a reliable mutual exclusion barrier any more! (Yes, there is a data-dependency on the compare, but it is only used for a conditional branch, and conditional branches are control dependencies and can be speculated, so CPU speculation can easily break that apparent dependency chain and do later loads *before* the spinlock load completes!) Now, I have good reason to believe that all Intel and AMD CPU's have a stricter-than-documented memory ordering, and that your spinlock may actually work perfectly well. But it still worries me. As far as I can tell, there's a theoretical problem with your spinlock implementation. So I'd like you to ask around some CPU people, and get people from both Intel and AMD to sign off on your spinlocks as safe. I suspect you already have the required contacts, but if you don't, I can send things off to the appropriate people at least inside Intel. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
Linus Torvalds wrote: On Wed, 27 Jun 2007, Nick Piggin wrote: I don't know why my unlock sequence should be that much slower? Unlocked mov vs unlocked add? Definitely in dumb micro-benchmark testing it wasn't twice as slow (IIRC). Oh, that releasing "add" can be unlocked, and only the holder of the lock ever touches that field? Right. I must not have looked closely enough. In that case, I withdraw that objection, and the sequence-number-based spinlock sounds like a perfectly fine one. Yes, the add will be slightly slower than the plain byte move, and the locked xadd will be slightly slower than a regular locked add, but compared to the serialization cost, that should be small. For some reason I thought you needed a locked instruction for the unlock too. So try it with just a byte counter, and test some stupid micro-benchmark on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make it the normal spinlock sequence just because it isn't noticeably slower. In fact, I think a "incb " instruction is even a byte shorter than "movb $1,mem", and with "unlock" being inlined, that could actually be a slight _win_. OK, I'll try running some tests and get back to you on it. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Wed, 27 Jun 2007, Nick Piggin wrote: > > I don't know why my unlock sequence should be that much slower? Unlocked > mov vs unlocked add? Definitely in dumb micro-benchmark testing it wasn't > twice as slow (IIRC). Oh, that releasing "add" can be unlocked, and only the holder of the lock ever touches that field? I must not have looked closely enough. In that case, I withdraw that objection, and the sequence-number-based spinlock sounds like a perfectly fine one. Yes, the add will be slightly slower than the plain byte move, and the locked xadd will be slightly slower than a regular locked add, but compared to the serialization cost, that should be small. For some reason I thought you needed a locked instruction for the unlock too. So try it with just a byte counter, and test some stupid micro-benchmark on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make it the normal spinlock sequence just because it isn't noticeably slower. In fact, I think a "incb " instruction is even a byte shorter than "movb $1,mem", and with "unlock" being inlined, that could actually be a slight _win_. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Wed, 27 Jun 2007, Nick Piggin wrote: I don't know why my unlock sequence should be that much slower? Unlocked mov vs unlocked add? Definitely in dumb micro-benchmark testing it wasn't twice as slow (IIRC). Oh, that releasing add can be unlocked, and only the holder of the lock ever touches that field? I must not have looked closely enough. In that case, I withdraw that objection, and the sequence-number-based spinlock sounds like a perfectly fine one. Yes, the add will be slightly slower than the plain byte move, and the locked xadd will be slightly slower than a regular locked add, but compared to the serialization cost, that should be small. For some reason I thought you needed a locked instruction for the unlock too. So try it with just a byte counter, and test some stupid micro-benchmark on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make it the normal spinlock sequence just because it isn't noticeably slower. In fact, I think a incb mem instruction is even a byte shorter than movb $1,mem, and with unlock being inlined, that could actually be a slight _win_. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
Linus Torvalds wrote: On Wed, 27 Jun 2007, Nick Piggin wrote: I don't know why my unlock sequence should be that much slower? Unlocked mov vs unlocked add? Definitely in dumb micro-benchmark testing it wasn't twice as slow (IIRC). Oh, that releasing add can be unlocked, and only the holder of the lock ever touches that field? Right. I must not have looked closely enough. In that case, I withdraw that objection, and the sequence-number-based spinlock sounds like a perfectly fine one. Yes, the add will be slightly slower than the plain byte move, and the locked xadd will be slightly slower than a regular locked add, but compared to the serialization cost, that should be small. For some reason I thought you needed a locked instruction for the unlock too. So try it with just a byte counter, and test some stupid micro-benchmark on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make it the normal spinlock sequence just because it isn't noticeably slower. In fact, I think a incb mem instruction is even a byte shorter than movb $1,mem, and with unlock being inlined, that could actually be a slight _win_. OK, I'll try running some tests and get back to you on it. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
Nick, call me a worry-wart, but I slept on this, and started worrying.. On Tue, 26 Jun 2007, Linus Torvalds wrote: So try it with just a byte counter, and test some stupid micro-benchmark on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make it the normal spinlock sequence just because it isn't noticeably slower. So I thought about this a bit more, and I like your sequence counter approach, but it still worried me. In the current spinlock code, we have a very simple setup for a successful grab of the spinlock: CPU#0 CPU#1 A (= code before the spinlock) lock release lock decb mem (serializing instruction) B (= code after the spinlock) and there is no question that memory operations in B cannot leak into A. With the sequence counters, the situation is more complex: CPU #0 CPU #1 A (= code before the spinlock) lock xadd mem (serializing instruction) B (= code afte xadd, but not inside lock) lock release cmp head, tail C (= code inside the lock) Now, B is basically the empty set, but that's not the issue I worry about. The thing is, I can guarantee by the Intel memory ordering rules that neither B nor C will ever have memops that leak past the xadd, but I'm not at all as sure that we cannot have memops in C that leak into B! And B really isn't protected by the lock - it may run while another CPU still holds the lock, and we know the other CPU released it only as part of the compare. But that compare isn't a serializing instruction! IOW, I could imagine a load inside C being speculated, and being moved *ahead* of the load that compares the spinlock head with the tail! IOW, the load that is _inside_ the spinlock has effectively moved to outside the protected region, and the spinlock isn't really a reliable mutual exclusion barrier any more! (Yes, there is a data-dependency on the compare, but it is only used for a conditional branch, and conditional branches are control dependencies and can be speculated, so CPU speculation can easily break that apparent dependency chain and do later loads *before* the spinlock load completes!) Now, I have good reason to believe that all Intel and AMD CPU's have a stricter-than-documented memory ordering, and that your spinlock may actually work perfectly well. But it still worries me. As far as I can tell, there's a theoretical problem with your spinlock implementation. So I'd like you to ask around some CPU people, and get people from both Intel and AMD to sign off on your spinlocks as safe. I suspect you already have the required contacts, but if you don't, I can send things off to the appropriate people at least inside Intel. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds [EMAIL PROTECTED] wrote: With the sequence counters, the situation is more complex: CPU #0 CPU #1 A (= code before the spinlock) lock xadd mem (serializing instruction) B (= code afte xadd, but not inside lock) lock release cmp head, tail C (= code inside the lock) Now, B is basically the empty set, but that's not the issue I worry about. The thing is, I can guarantee by the Intel memory ordering rules that neither B nor C will ever have memops that leak past the xadd, but I'm not at all as sure that we cannot have memops in C that leak into B! And B really isn't protected by the lock - it may run while another CPU still holds the lock, and we know the other CPU released it only as part of the compare. But that compare isn't a serializing instruction! IOW, I could imagine a load inside C being speculated, and being moved *ahead* of the load that compares the spinlock head with the tail! IOW, the load that is _inside_ the spinlock has effectively moved to outside the protected region, and the spinlock isn't really a reliable mutual exclusion barrier any more! (Yes, there is a data-dependency on the compare, but it is only used for a conditional branch, and conditional branches are control dependencies and can be speculated, so CPU speculation can easily break that apparent dependency chain and do later loads *before* the spinlock load completes!) Now, I have good reason to believe that all Intel and AMD CPU's have a stricter-than-documented memory ordering, and that your spinlock may actually work perfectly well. But it still worries me. As far as I can tell, there's a theoretical problem with your spinlock implementation. hm, i agree with you that this is problematic. Especially on an SMT CPU it would be a big architectural restriction if prefetches couldnt cross cache misses. (and that's the only way i could see Nick's scheme working: MESI coherency coupled with the speculative use of that cacheline's value never surviving a MESI invalidation of that cacheline. That would guarantee that once we have the lock, any speculative result is fully coherent and no other CPU has modified it.) Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Wed, 27 Jun 2007, Linus Torvalds wrote: On Tue, 26 Jun 2007, Linus Torvalds wrote: So try it with just a byte counter, and test some stupid micro-benchmark on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make it the normal spinlock sequence just because it isn't noticeably slower. So I thought about this a bit more, and I like your sequence counter approach, but it still worried me. In the current spinlock code, we have a very simple setup for a successful grab of the spinlock: CPU#0 CPU#1 A (= code before the spinlock) lock release lock decb mem (serializing instruction) B (= code after the spinlock) and there is no question that memory operations in B cannot leak into A. With the sequence counters, the situation is more complex: CPU #0 CPU #1 A (= code before the spinlock) lock xadd mem (serializing instruction) B (= code afte xadd, but not inside lock) lock release cmp head, tail C (= code inside the lock) Now, B is basically the empty set, but that's not the issue I worry about. The thing is, I can guarantee by the Intel memory ordering rules that neither B nor C will ever have memops that leak past the xadd, but I'm not at all as sure that we cannot have memops in C that leak into B! And B really isn't protected by the lock - it may run while another CPU still holds the lock, and we know the other CPU released it only as part of the compare. But that compare isn't a serializing instruction! IOW, I could imagine a load inside C being speculated, and being moved *ahead* of the load that compares the spinlock head with the tail! IOW, the load that is _inside_ the spinlock has effectively moved to outside the protected region, and the spinlock isn't really a reliable mutual exclusion barrier any more! (Yes, there is a data-dependency on the compare, but it is only used for a conditional branch, and conditional branches are control dependencies and can be speculated, so CPU speculation can easily break that apparent dependency chain and do later loads *before* the spinlock load completes!) Now, I have good reason to believe that all Intel and AMD CPU's have a stricter-than-documented memory ordering, and that your spinlock may actually work perfectly well. But it still worries me. As far as I can tell, there's a theoretical problem with your spinlock implementation. Nice catch ;) But wasn't Intel suggesting in not relying on the old strict ordering rules? IOW shouldn't an mfence always be there? Not only loads could leak up into the wait phase, but stores too, if they have no dependency with the head and tail loads. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Wed, 27 Jun 2007, Davide Libenzi wrote: Now, I have good reason to believe that all Intel and AMD CPU's have a stricter-than-documented memory ordering, and that your spinlock may actually work perfectly well. But it still worries me. As far as I can tell, there's a theoretical problem with your spinlock implementation. Nice catch ;) But wasn't Intel suggesting in not relying on the old strict ordering rules? Actually, both Intel and AMD engineers have been talking about making the documentation _stricter_, rather than looser. They apparently already are pretty damn strict, because not being stricter than the docs imply just ends up exposing too many potential problems in software that didn't really follow the rules. For example, it's quite possible to do loads out of order, but guarantee that the result is 100% equivalent with a totally in-order machine. One way you do that is to keep track of the cacheline for any speculative loads, and if it gets invalidated before the speculative instruction has completed, you just throw the speculation away. End result: you can do any amount of speculation you damn well please at a micro-architectural level, but if the speculation would ever have been architecturally _visible_, it never happens! (Yeah, that is just me in my non-professional capacity of hw engineer wanna-be: I'm not saying that that is necessarily what Intel or AMD actually ever do, and they may have other approaches entirely). IOW shouldn't an mfence always be there? Not only loads could leak up into the wait phase, but stores too, if they have no dependency with the head and tail loads. Stores never leak up. They only ever leak down (ie past subsequent loads or stores), so you don't need to worry about them. That's actually already documented (although not in those terms), and if it wasn't true, then we couldn't do the spin unlock with just a regular store anyway. (There's basically never any reason to speculate stores before other mem ops. It's hard, and pointless. Stores you want to just buffer and move as _late_ as possible, loads you want to speculate and move as _early_ as possible. Anything else doesn't make sense). So I'm fairly sure that the only thing you really need to worry about in this thing is the load-load ordering (the load for the spinlock compare vs any loads inside the spinlock), and I'm reasonably certain that no existing x86 (and likely no future x86) will make load-load reordering effects architecturally visible, even if the implementation may do so *internally* when it's not possible to see it in the end result. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Wed, 27 Jun 2007, Linus Torvalds wrote: IOW shouldn't an mfence always be there? Not only loads could leak up into the wait phase, but stores too, if they have no dependency with the head and tail loads. Stores never leak up. They only ever leak down (ie past subsequent loads or stores), so you don't need to worry about them. That's actually already documented (although not in those terms), and if it wasn't true, then we couldn't do the spin unlock with just a regular store anyway. Yes, Intel has never done that. They'll probably never do it since it'll break a lot of system software (unless they use a new mode-bit that allows system software to enable lose-ordering). Although I clearly remember to have read in one of their P4 optimization manuals to not assume this in the future. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Wed, 27 Jun 2007, Davide Libenzi wrote: On Wed, 27 Jun 2007, Linus Torvalds wrote: Stores never leak up. They only ever leak down (ie past subsequent loads or stores), so you don't need to worry about them. That's actually already documented (although not in those terms), and if it wasn't true, then we couldn't do the spin unlock with just a regular store anyway. Yes, Intel has never done that. They'll probably never do it since it'll break a lot of system software (unless they use a new mode-bit that allows system software to enable lose-ordering). Although I clearly remember to have read in one of their P4 optimization manuals to not assume this in the future. That optimization manual was confused. The Intel memory ordering documentation *clearly* states that only reads pass writes, not the other way around. Some very confused people have thought that pass is a two-way thing. It's not. Passing in the Intel memory ordering means go _ahead_ of, exactly the same way it means in traffic. You don't pass people by falling behind them. It's also obvious from reading the manual, because any other reading would be very strange: it says 1. Reads can be carried out speculatively and in any order 2. Reads can pass buffered writes, but the processor is self-consistent 3. Writes to memory are always carried out in program order [.. and then lists exceptions that are not interesting - it's clflush and the non-temporal stores, not any normal writes ] 4. Writes can be buffered 5. Writes are not performed speculatively; they are only performed for instructions that have actually been retired. 6. Data from buffered writes can be forwarded to waiting reads within the processor. 7. Reads or writes cannot pass (be carried out ahead of) I/O instructions, locked instructions or serializing instructions. 8. Reads cannot pass LFENCE and MFENCE instructions. 9. Writes cannot pass SFENCE or MFENCE instructions. The thing to note is: a) in (1), Intel says that reads can occur in any order, but (2) makes it clear that that is only relevant wrt other _reads_ b) in (2), they say pass, but then they actually explain that pass means be carried out ahead of in (7). HOWEVER, it should be obvious in (2) even _without_ the explicit clarification in (7) that pass is a one-way thing, because otherwise (2) is totally _meaningless_. It would be meaningless for two reasons: - (1) already said that reads can be done in any order, so if that was a any order wrt writes, then (2) would be pointless. So (2) must mean something *else* than any order, and the only sane reading of it that isn't any order is that pass is a one-way thing: you pass somebody when you go ahead of them, you do *not* pass somebody when you fall behind them! - if (2) really meant that reads and writes can just be re-ordered, then the choice of words makes no sense. It would be much more sensible to say that reads can be carried out in any order wrt writes, instead of talking explicitly about passing buffered writes Anyway, I'm pretty damn sure my reading is correct. And no, it's not a it happens to work. It's _architecturally_required_ to work, and nobody has ever complained about the use of a simple store to unlock a spinlock (which would only work if the reads can pass only means *later* reads can pass *earlier* writes). And it turns out that I think #1 is going away. Yes, the uarch will internally re-order reads, wrt each other, but if it isn't architecturally visible, then from an architectural standpoint #1 simply doesn't happen. I can't guarantee that will happen, of course, but from talking to both AMD and Intel people, I think that they'll just document the stricter rules as the de-facto rules. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Wed, 27 Jun 2007, Linus Torvalds wrote: On Wed, 27 Jun 2007, Davide Libenzi wrote: On Wed, 27 Jun 2007, Linus Torvalds wrote: Stores never leak up. They only ever leak down (ie past subsequent loads or stores), so you don't need to worry about them. That's actually already documented (although not in those terms), and if it wasn't true, then we couldn't do the spin unlock with just a regular store anyway. Yes, Intel has never done that. They'll probably never do it since it'll break a lot of system software (unless they use a new mode-bit that allows system software to enable lose-ordering). Although I clearly remember to have read in one of their P4 optimization manuals to not assume this in the future. That optimization manual was confused. The Intel memory ordering documentation *clearly* states that only reads pass writes, not the other way around. Yes, they were stating that clearly. IIWNOC (If I Were Not On Crack) I remember them saying to not assume any ordering besides the data dependency and the CPU self-consistency in the future CPUs, and to use *fence instructions when certain semantics were required. But google did not help me in finding that doc, so maybe I were really on crack :) - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
Linus Torvalds wrote: On Tue, 26 Jun 2007, Nick Piggin wrote: Hmm, not that I have a strong opinion one way or the other, but I don't know that they would encourage bad code. They are not going to reduce latency under a locked section, but will improve determinism in the contended case. xadd really generally *is* slower than an add. One is often microcoded, the other is not. Oh. I found xadd to be not hugely slower on my P4, but it was a little bit. But the real problem is that your "unlock" sequence is now about two orders of magnitude slower than it used to be. So it used to be that a spinlocked sequence only had a single synchronization point, now it has two. *That* is really bad, and I guarantee that it makes your spinlocks effectively twice as slow for the non-contended parts. I don't know why my unlock sequence should be that much slower? Unlocked mov vs unlocked add? Definitely in dumb micro-benchmark testing it wasn't twice as slow (IIRC). But your xadd thing might be worth looking at, just to see how expensive it is. As an _alternative_ to spinlocks, it's certainly viable. (Side note: why make it a word? Word operations are slower on many x86 implementations, because they add yet another prefix. You only need a byte) No real reason I guess. I'll change it. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Tue, 26 Jun 2007, Nick Piggin wrote: > > Hmm, not that I have a strong opinion one way or the other, but I > don't know that they would encourage bad code. They are not going to > reduce latency under a locked section, but will improve determinism > in the contended case. xadd really generally *is* slower than an add. One is often microcoded, the other is not. But the real problem is that your "unlock" sequence is now about two orders of magnitude slower than it used to be. So it used to be that a spinlocked sequence only had a single synchronization point, now it has two. *That* is really bad, and I guarantee that it makes your spinlocks effectively twice as slow for the non-contended parts. But your xadd thing might be worth looking at, just to see how expensive it is. As an _alternative_ to spinlocks, it's certainly viable. (Side note: why make it a word? Word operations are slower on many x86 implementations, because they add yet another prefix. You only need a byte) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Tue, Jun 26, 2007 at 06:42:10PM +1000, Nick Piggin wrote: ... > They should also improve performance in heavily contended case due to > the nature of how they spin, but I know that's not something you want > to hear about. And theoretically there should be no reason why xadd is > any slower than dec and look at the status flags, should there? I never > implementedit in optimised assembly to test though... ... BTW, could you explain why the below diagnose doesn't relate to your solution? On 06/21/2007 12:08 PM, Ingo Molnar wrote: ... > So it seems the problem was that if a core kept _truly_ modifying a > cacheline via atomics in a high enough frequency, it could artificially > starve the other core. (which would keep waiting for the cacheline to be > released one day, and which kept the first core from ever making any > progress) To me that looks like a real problem on the hardware side - > shouldnt cacheline ownership be arbitrated a bit better than that? > Thanks & regards, Jarek P. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
Linus Torvalds wrote: On Thu, 21 Jun 2007, Eric Dumazet wrote: This reminds me Nick's proposal of 'queued spinlocks' 3 months ago Maybe this should be re-considered ? (unlock is still a non atomic op, so we dont pay the serializing cost twice) No. The point is simple: IF YOU NEED THIS, YOU ARE DOING SOMETHING WRONG! I don't understand why this is even controversial. Especially since we have a patch for the problem that proves my point: the _proper_ way to fix things is to just not do the bad thing, instead of trying to allow the bad behaviour and try to handle it. Things like queued spinlocks just make excuses for bad code. We don't do nesting locking either, for exactly the same reason. Are nesting locks "easier"? Absolutely. They are also almost always a sign of a *bug*. So making spinlocks and/or mutexes nest by default is just a way to encourage bad programming! Hmm, not that I have a strong opinion one way or the other, but I don't know that they would encourage bad code. They are not going to reduce latency under a locked section, but will improve determinism in the contended case. They should also improve performance in heavily contended case due to the nature of how they spin, but I know that's not something you want to hear about. And theoretically there should be no reason why xadd is any slower than dec and look at the status flags, should there? I never implementedit in optimised assembly to test though... Some hardware seems to have no idea of fair cacheline scheduling, and especially when there are more than 2 CPUs contending for the cacheline, there can be large starvations. And actually some times we have code that really wants to drop the lock and queue behind other contenders. Most of the lockbreak stuff for example. Suppose we could have a completely fair spinlock primitive that has *virtually* no downsides over the unfair version, you'd take the fair one, right? Not that I'm saying they'd ever be a good solution to bad code, but I do think fairness is better than none, all else being equal. extract : Implement queued spinlocks for i386. This shouldn't increase the size of the spinlock structure, while still able to handle 2^16 CPUs. Umm. i386 spinlocks could and should be *one*byte*. In fact, I don't even know why they are wasting four bytes right now: the fact that somebody made them an "int" just wastes memory. All the actual code uses "decb", so it's not even a question of safety. I wonder why we have that 32-bit thing and the ugly casts. Anyway, I think the important point is that they can remain within 4 bytes, which is obviously the most important boundary (they could be 2 bytes on i386). -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
Linus Torvalds wrote: On Thu, 21 Jun 2007, Eric Dumazet wrote: This reminds me Nick's proposal of 'queued spinlocks' 3 months ago Maybe this should be re-considered ? (unlock is still a non atomic op, so we dont pay the serializing cost twice) No. The point is simple: IF YOU NEED THIS, YOU ARE DOING SOMETHING WRONG! I don't understand why this is even controversial. Especially since we have a patch for the problem that proves my point: the _proper_ way to fix things is to just not do the bad thing, instead of trying to allow the bad behaviour and try to handle it. Things like queued spinlocks just make excuses for bad code. We don't do nesting locking either, for exactly the same reason. Are nesting locks easier? Absolutely. They are also almost always a sign of a *bug*. So making spinlocks and/or mutexes nest by default is just a way to encourage bad programming! Hmm, not that I have a strong opinion one way or the other, but I don't know that they would encourage bad code. They are not going to reduce latency under a locked section, but will improve determinism in the contended case. They should also improve performance in heavily contended case due to the nature of how they spin, but I know that's not something you want to hear about. And theoretically there should be no reason why xadd is any slower than dec and look at the status flags, should there? I never implementedit in optimised assembly to test though... Some hardware seems to have no idea of fair cacheline scheduling, and especially when there are more than 2 CPUs contending for the cacheline, there can be large starvations. And actually some times we have code that really wants to drop the lock and queue behind other contenders. Most of the lockbreak stuff for example. Suppose we could have a completely fair spinlock primitive that has *virtually* no downsides over the unfair version, you'd take the fair one, right? Not that I'm saying they'd ever be a good solution to bad code, but I do think fairness is better than none, all else being equal. extract : Implement queued spinlocks for i386. This shouldn't increase the size of the spinlock structure, while still able to handle 2^16 CPUs. Umm. i386 spinlocks could and should be *one*byte*. In fact, I don't even know why they are wasting four bytes right now: the fact that somebody made them an int just wastes memory. All the actual code uses decb, so it's not even a question of safety. I wonder why we have that 32-bit thing and the ugly casts. Anyway, I think the important point is that they can remain within 4 bytes, which is obviously the most important boundary (they could be 2 bytes on i386). -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Tue, Jun 26, 2007 at 06:42:10PM +1000, Nick Piggin wrote: ... They should also improve performance in heavily contended case due to the nature of how they spin, but I know that's not something you want to hear about. And theoretically there should be no reason why xadd is any slower than dec and look at the status flags, should there? I never implementedit in optimised assembly to test though... ... BTW, could you explain why the below diagnose doesn't relate to your solution? On 06/21/2007 12:08 PM, Ingo Molnar wrote: ... So it seems the problem was that if a core kept _truly_ modifying a cacheline via atomics in a high enough frequency, it could artificially starve the other core. (which would keep waiting for the cacheline to be released one day, and which kept the first core from ever making any progress) To me that looks like a real problem on the hardware side - shouldnt cacheline ownership be arbitrated a bit better than that? Thanks regards, Jarek P. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Tue, 26 Jun 2007, Nick Piggin wrote: Hmm, not that I have a strong opinion one way or the other, but I don't know that they would encourage bad code. They are not going to reduce latency under a locked section, but will improve determinism in the contended case. xadd really generally *is* slower than an add. One is often microcoded, the other is not. But the real problem is that your unlock sequence is now about two orders of magnitude slower than it used to be. So it used to be that a spinlocked sequence only had a single synchronization point, now it has two. *That* is really bad, and I guarantee that it makes your spinlocks effectively twice as slow for the non-contended parts. But your xadd thing might be worth looking at, just to see how expensive it is. As an _alternative_ to spinlocks, it's certainly viable. (Side note: why make it a word? Word operations are slower on many x86 implementations, because they add yet another prefix. You only need a byte) Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
Linus Torvalds wrote: On Tue, 26 Jun 2007, Nick Piggin wrote: Hmm, not that I have a strong opinion one way or the other, but I don't know that they would encourage bad code. They are not going to reduce latency under a locked section, but will improve determinism in the contended case. xadd really generally *is* slower than an add. One is often microcoded, the other is not. Oh. I found xadd to be not hugely slower on my P4, but it was a little bit. But the real problem is that your unlock sequence is now about two orders of magnitude slower than it used to be. So it used to be that a spinlocked sequence only had a single synchronization point, now it has two. *That* is really bad, and I guarantee that it makes your spinlocks effectively twice as slow for the non-contended parts. I don't know why my unlock sequence should be that much slower? Unlocked mov vs unlocked add? Definitely in dumb micro-benchmark testing it wasn't twice as slow (IIRC). But your xadd thing might be worth looking at, just to see how expensive it is. As an _alternative_ to spinlocks, it's certainly viable. (Side note: why make it a word? Word operations are slower on many x86 implementations, because they add yet another prefix. You only need a byte) No real reason I guess. I'll change it. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Sat, Jun 23, 2007 at 12:36:08PM +0200, Miklos Szeredi wrote: ... > And it's not a NO_HZ kernel. ... BTW, maybe I've missed this and it's unconnected, but I hope the first config has been changed - especially this CONFIG_AGP_AMD64 = y, and this bug from mm/slab.c has gone long ago... Jarek P. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Sat, Jun 23, 2007 at 12:36:08PM +0200, Miklos Szeredi wrote: ... And it's not a NO_HZ kernel. ... BTW, maybe I've missed this and it's unconnected, but I hope the first config has been changed - especially this CONFIG_AGP_AMD64 = y, and this bug from mm/slab.c has gone long ago... Jarek P. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Sat, 23 Jun 2007, Miklos Szeredi wrote: > > What I notice is that the interrupt distribution between the CPUs is > very asymmetric like this: > >CPU0 CPU1 > 0: 220496 42 IO-APIC-edge timer > 1: 3841 0 IO-APIC-edge i8042 ... > LOC: 220499 220463 > ERR: 0 > > and the freezes don't really change that. And the NMI traces show, > that it's always CPU1 which is spinning in wait_task_inactive(). Well, the LOC thing is for the local apic timer, so while regular interrupts are indeed very skewed, both CPU's is nicely getting the local apic timer thing.. That said, the timer interrupt generally happens just a few hundred times a second, and if there's just a higher likelihood that it happens when the spinlock is taken, then half-a-second pauses could easily be just because even when the interrupt happens, it could be skewed to happen when the lock is held. And that definitely is the case: the most expensive instruction _by_far_ in that loop is the actual locked instruction that acquires the lock (especially with the cache-line bouncing around), so an interrupt would be much more likely to happen right after that one rather than after the store that releases the lock, which can be buffered. It can be quite interesting to look at instruction-level cycle profiling with oprofile, just to see where the costs are.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
> > the freezes that Miklos was seeing were hardirq contexts blocking in > > task_rq_lock() - that is done with interrupts disabled. (Miklos i > > think also tried !NOHZ kernels and older kernels, with a similar > > result.) > > > > plus on the ptrace side, the wait_task_inactive() code had most of its > > overhead in the atomic op, so if any timer IRQ hit _that_ core, it was > > likely while we were still holding the runqueue lock! > > > > i think the only thing that eventually got Miklos' laptop out of the > > wedge were timer irqs hitting the ptrace CPU in exactly those > > instructions where it was not holding the runqueue lock. (or perhaps > > an asynchronous SMM event delaying it for a long time) > > even considering that the 'LOCK'-ed intruction was the heaviest in the > busy-loop, the numbers still just dont add up to 'tens of seconds of > lockups', so there must be something else happening too. > > So here's an addition to the existing theories: the Core2Duo is a > 4-issue CPU architecture. Now, why does this matter? It matters to the > timing of the delivery of interrupts. For example, on a 3-issue > architecture, the instruction level profile of well-cached workloads > often looks like this: > > c05a3b71: 710 89 d6 mov%edx,%esi > c05a3b73:0 8b 55 c0mov0xffc0(%ebp),%edx > c05a3b76:0 89 c3 mov%eax,%ebx > c05a3b78: 775 8b 82 e8 00 00 00 mov0xe8(%edx),%eax > c05a3b7e:0 8b 48 18mov0x18(%eax),%ecx > c05a3b81:0 8b 45 c8mov0xffc8(%ebp),%eax > c05a3b84: 792 89 1c 24mov%ebx,(%esp) > c05a3b87:0 89 74 24 04 mov%esi,0x4(%esp) > c05a3b8b:0 ff d1 call *%ecx > c05a3b8d:0 8b 4d c8mov0xffc8(%ebp),%ecx > c05a3b90: 925 8b 41 6cmov0x6c(%ecx),%eax > c05a3b93:0 39 41 10cmp%eax,0x10(%ecx) > c05a3b96:0 0f 85 a8 01 00 00 jnec05a3d44 > > c05a3b9c: 949 89 da mov%ebx,%edx > c05a3b9e:0 89 f1 mov%esi,%ecx > c05a3ba0:0 8b 45 c8mov0xffc8(%ebp),%eax > > the second column is the number of times the profiling interrupt has hit > that particular instruction. > > Note the many zero entries - this means that for instructions that are > well-cached, the issue order _prevents_ interrupts from _ever_ hitting > to within a bundle of micro-ops that the decoder will issue! The above > workload was a plain lat_ctx, so nothing special, and interrupts and DMA > traffic were coming and going. Still the bundling of instructions was > very strong. > > There's no guarantee of 'instruction bundling': a cachemiss can still > stall the pipeline and allow an interrupt to hit any instruction [where > interrupt delivery is valid], but on a well-cached workload like the > above, even a 3-issue architecture can effectively 'merge' instructions > to each other, and can make them essentially 'atomic' as far as external > interrupts go. > > [ also note another interesting thing in the profile above: the > CALL *%ecx was likely BTB-optimized and hence we have a 'bundling' > effect that is even larger than 3 instructions. ] > > i think that is what might have happened on Miklos's laptop too: the > 'movb' of the spin_unlock() done by the wait_task_inactive() got > 'bundled' together with the first LOCK instruction that took it again, > making it very unlikely for a timer interrupt to ever hit that small > window in wait_task_inactive(). The cpu_relax()'s "REP; NOP" was likely > a simple NOP, because the Core2Duo is not an SMT platform. > > to check this theory, adding 3 NOPs to the critical section should make > the lockups a lot less prominent too. (While NOPs are not actually > 'issued', they do take up decoder bandwidth, so they hopefully are able > to break up any 'bundle' of instructions.) > > Miklos, if you've got some time to test this - could you revert the > fa490cfd15d7 commit and apply the patch below - does it have any impact > on the lockups you were experiencing? No. If anything it made the freezes somwhat more frequent. And it's not a NO_HZ kernel. What I notice is that the interrupt distribution between the CPUs is very asymmetric like this: CPU0 CPU1 0: 220496 42 IO-APIC-edge timer 1: 3841 0 IO-APIC-edge i8042 8: 1 0 IO-APIC-edge rtc 9: 2756 0 IO-APIC-fasteoi acpi 12: 2638 0 IO-APIC-edge i8042 14: 7776 0 IO-APIC-edge ide0 16: 6083 0 IO-APIC-fasteoi uhci_hcd:usb2 17: 34414 3 IO-APIC-fasteoi
Re: [BUG] long freezes on thinkpad t60
the freezes that Miklos was seeing were hardirq contexts blocking in task_rq_lock() - that is done with interrupts disabled. (Miklos i think also tried !NOHZ kernels and older kernels, with a similar result.) plus on the ptrace side, the wait_task_inactive() code had most of its overhead in the atomic op, so if any timer IRQ hit _that_ core, it was likely while we were still holding the runqueue lock! i think the only thing that eventually got Miklos' laptop out of the wedge were timer irqs hitting the ptrace CPU in exactly those instructions where it was not holding the runqueue lock. (or perhaps an asynchronous SMM event delaying it for a long time) even considering that the 'LOCK'-ed intruction was the heaviest in the busy-loop, the numbers still just dont add up to 'tens of seconds of lockups', so there must be something else happening too. So here's an addition to the existing theories: the Core2Duo is a 4-issue CPU architecture. Now, why does this matter? It matters to the timing of the delivery of interrupts. For example, on a 3-issue architecture, the instruction level profile of well-cached workloads often looks like this: c05a3b71: 710 89 d6 mov%edx,%esi c05a3b73:0 8b 55 c0mov0xffc0(%ebp),%edx c05a3b76:0 89 c3 mov%eax,%ebx c05a3b78: 775 8b 82 e8 00 00 00 mov0xe8(%edx),%eax c05a3b7e:0 8b 48 18mov0x18(%eax),%ecx c05a3b81:0 8b 45 c8mov0xffc8(%ebp),%eax c05a3b84: 792 89 1c 24mov%ebx,(%esp) c05a3b87:0 89 74 24 04 mov%esi,0x4(%esp) c05a3b8b:0 ff d1 call *%ecx c05a3b8d:0 8b 4d c8mov0xffc8(%ebp),%ecx c05a3b90: 925 8b 41 6cmov0x6c(%ecx),%eax c05a3b93:0 39 41 10cmp%eax,0x10(%ecx) c05a3b96:0 0f 85 a8 01 00 00 jnec05a3d44 schedule+0x2a4 c05a3b9c: 949 89 da mov%ebx,%edx c05a3b9e:0 89 f1 mov%esi,%ecx c05a3ba0:0 8b 45 c8mov0xffc8(%ebp),%eax the second column is the number of times the profiling interrupt has hit that particular instruction. Note the many zero entries - this means that for instructions that are well-cached, the issue order _prevents_ interrupts from _ever_ hitting to within a bundle of micro-ops that the decoder will issue! The above workload was a plain lat_ctx, so nothing special, and interrupts and DMA traffic were coming and going. Still the bundling of instructions was very strong. There's no guarantee of 'instruction bundling': a cachemiss can still stall the pipeline and allow an interrupt to hit any instruction [where interrupt delivery is valid], but on a well-cached workload like the above, even a 3-issue architecture can effectively 'merge' instructions to each other, and can make them essentially 'atomic' as far as external interrupts go. [ also note another interesting thing in the profile above: the CALL *%ecx was likely BTB-optimized and hence we have a 'bundling' effect that is even larger than 3 instructions. ] i think that is what might have happened on Miklos's laptop too: the 'movb' of the spin_unlock() done by the wait_task_inactive() got 'bundled' together with the first LOCK instruction that took it again, making it very unlikely for a timer interrupt to ever hit that small window in wait_task_inactive(). The cpu_relax()'s REP; NOP was likely a simple NOP, because the Core2Duo is not an SMT platform. to check this theory, adding 3 NOPs to the critical section should make the lockups a lot less prominent too. (While NOPs are not actually 'issued', they do take up decoder bandwidth, so they hopefully are able to break up any 'bundle' of instructions.) Miklos, if you've got some time to test this - could you revert the fa490cfd15d7 commit and apply the patch below - does it have any impact on the lockups you were experiencing? No. If anything it made the freezes somwhat more frequent. And it's not a NO_HZ kernel. What I notice is that the interrupt distribution between the CPUs is very asymmetric like this: CPU0 CPU1 0: 220496 42 IO-APIC-edge timer 1: 3841 0 IO-APIC-edge i8042 8: 1 0 IO-APIC-edge rtc 9: 2756 0 IO-APIC-fasteoi acpi 12: 2638 0 IO-APIC-edge i8042 14: 7776 0 IO-APIC-edge ide0 16: 6083 0 IO-APIC-fasteoi uhci_hcd:usb2 17: 34414 3 IO-APIC-fasteoi uhci_hcd:usb3, HDA Intel 18: 0 0 IO-APIC-fasteoi
Re: [BUG] long freezes on thinkpad t60
On Sat, 23 Jun 2007, Miklos Szeredi wrote: What I notice is that the interrupt distribution between the CPUs is very asymmetric like this: CPU0 CPU1 0: 220496 42 IO-APIC-edge timer 1: 3841 0 IO-APIC-edge i8042 ... LOC: 220499 220463 ERR: 0 and the freezes don't really change that. And the NMI traces show, that it's always CPU1 which is spinning in wait_task_inactive(). Well, the LOC thing is for the local apic timer, so while regular interrupts are indeed very skewed, both CPU's is nicely getting the local apic timer thing.. That said, the timer interrupt generally happens just a few hundred times a second, and if there's just a higher likelihood that it happens when the spinlock is taken, then half-a-second pauses could easily be just because even when the interrupt happens, it could be skewed to happen when the lock is held. And that definitely is the case: the most expensive instruction _by_far_ in that loop is the actual locked instruction that acquires the lock (especially with the cache-line bouncing around), so an interrupt would be much more likely to happen right after that one rather than after the store that releases the lock, which can be buffered. It can be quite interesting to look at instruction-level cycle profiling with oprofile, just to see where the costs are.. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, Jun 21, 2007 at 09:01:28AM -0700, Linus Torvalds wrote: > > > On Thu, 21 Jun 2007, Jarek Poplawski wrote: ... > So I don't see how you could possibly having two different CPU's getting > into some lock-step in that loop: changing "task_rq()" is a really quite > heavy operation (it's about migrating between CPU's), and generally > happens at a fairly low frequency (ie "a couple of times a second" kind of > thing, not "tight CPU loop"). Yes, I've agreed with Ingo it was only one of my illusions... > But bugs happen.. > > > Another possible problem could be a result of some wrong optimization > > or wrong propagation of change of this task_rq(p) value. > > I agree, but that kind of bug would likely not cause temporary hangs, but > actual "the machine is dead" operations. If you get the totally *wrong* > value due to some systematic bug, you'd be waiting forever for it to > match, not get into a loop for a half second and then it clearing up.. > > But I don't think we can throw the "it's another bug" theory _entirely_ > out the window. Alas, I'm the last here who should talk with you or Ingo about hardware, but my point is that until it's not 100% proven this is spinlocks vs. cpu case any nearby possibilities should be considered. One of them is this loop, which can ... loop. Of course we can't see any reason for this, but if something is theoretically allowed it can happen. Here it's enough the task_rq(p) is for some very unprobable reason (maybe buggy hardware, code or compiling) cached or read too late, and maybe it's only on this one only notebook in the world? If you're sure this is not the case, let's forget about this. Of course, I don't mean it's a direct optimization. I think about something that could be triggered by something like this tested smp_mb() in entirely another place. IMHO, it would be very interesting to assure if this barrier fixed the spinlock or maybe some variable. There is also another interesting question: if it was only about spinlocks (which may be) why on those watchdog traces there are mainly two "fighters": wait_task_inactive() and try_to_wake_up(). It seems there should be seen more clients of task_rq_lock(). So, my another unlikely idea is: maybe for some other strange, unprobable but only theoretically possibility there could be something wrong around this yield() in wait_task_inactive(). The comment above this function reads about the task that *will* unschedule. So, maybe it would be not nice if e.g. during this yield() something wakes it up? ... > But it only happens with badly coded software: the rule simply is that you > MUST NOT release and immediately re-acquire the same spinlock on the same > core, because as far as other cores are concerned, that's basically the > same as never releasing it in the first place. I totally agree! Let's only get certainty there is nothing more here. BTW, is there any reason not to add some test with a warning under spinlock debugging against such badly behaving places? Regards, Jarek P. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > the freezes that Miklos was seeing were hardirq contexts blocking in > task_rq_lock() - that is done with interrupts disabled. (Miklos i > think also tried !NOHZ kernels and older kernels, with a similar > result.) > > plus on the ptrace side, the wait_task_inactive() code had most of its > overhead in the atomic op, so if any timer IRQ hit _that_ core, it was > likely while we were still holding the runqueue lock! > > i think the only thing that eventually got Miklos' laptop out of the > wedge were timer irqs hitting the ptrace CPU in exactly those > instructions where it was not holding the runqueue lock. (or perhaps > an asynchronous SMM event delaying it for a long time) even considering that the 'LOCK'-ed intruction was the heaviest in the busy-loop, the numbers still just dont add up to 'tens of seconds of lockups', so there must be something else happening too. So here's an addition to the existing theories: the Core2Duo is a 4-issue CPU architecture. Now, why does this matter? It matters to the timing of the delivery of interrupts. For example, on a 3-issue architecture, the instruction level profile of well-cached workloads often looks like this: c05a3b71: 710 89 d6 mov%edx,%esi c05a3b73:0 8b 55 c0mov0xffc0(%ebp),%edx c05a3b76:0 89 c3 mov%eax,%ebx c05a3b78: 775 8b 82 e8 00 00 00 mov0xe8(%edx),%eax c05a3b7e:0 8b 48 18mov0x18(%eax),%ecx c05a3b81:0 8b 45 c8mov0xffc8(%ebp),%eax c05a3b84: 792 89 1c 24mov%ebx,(%esp) c05a3b87:0 89 74 24 04 mov%esi,0x4(%esp) c05a3b8b:0 ff d1 call *%ecx c05a3b8d:0 8b 4d c8mov0xffc8(%ebp),%ecx c05a3b90: 925 8b 41 6cmov0x6c(%ecx),%eax c05a3b93:0 39 41 10cmp%eax,0x10(%ecx) c05a3b96:0 0f 85 a8 01 00 00 jnec05a3d44 c05a3b9c: 949 89 da mov%ebx,%edx c05a3b9e:0 89 f1 mov%esi,%ecx c05a3ba0:0 8b 45 c8mov0xffc8(%ebp),%eax the second column is the number of times the profiling interrupt has hit that particular instruction. Note the many zero entries - this means that for instructions that are well-cached, the issue order _prevents_ interrupts from _ever_ hitting to within a bundle of micro-ops that the decoder will issue! The above workload was a plain lat_ctx, so nothing special, and interrupts and DMA traffic were coming and going. Still the bundling of instructions was very strong. There's no guarantee of 'instruction bundling': a cachemiss can still stall the pipeline and allow an interrupt to hit any instruction [where interrupt delivery is valid], but on a well-cached workload like the above, even a 3-issue architecture can effectively 'merge' instructions to each other, and can make them essentially 'atomic' as far as external interrupts go. [ also note another interesting thing in the profile above: the CALL *%ecx was likely BTB-optimized and hence we have a 'bundling' effect that is even larger than 3 instructions. ] i think that is what might have happened on Miklos's laptop too: the 'movb' of the spin_unlock() done by the wait_task_inactive() got 'bundled' together with the first LOCK instruction that took it again, making it very unlikely for a timer interrupt to ever hit that small window in wait_task_inactive(). The cpu_relax()'s "REP; NOP" was likely a simple NOP, because the Core2Duo is not an SMT platform. to check this theory, adding 3 NOPs to the critical section should make the lockups a lot less prominent too. (While NOPs are not actually 'issued', they do take up decoder bandwidth, so they hopefully are able to break up any 'bundle' of instructions.) Miklos, if you've got some time to test this - could you revert the fa490cfd15d7 commit and apply the patch below - does it have any impact on the lockups you were experiencing? Ingo --- kernel/sched.c |1 + 1 file changed, 1 insertion(+) Index: linux/kernel/sched.c === --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -1131,6 +1131,7 @@ repeat: preempted = !task_running(rq, p); task_rq_unlock(rq, ); cpu_relax(); + asm volatile ("nop; nop; nop;"); if (preempted) yield(); goto repeat; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Ingo Molnar [EMAIL PROTECTED] wrote: the freezes that Miklos was seeing were hardirq contexts blocking in task_rq_lock() - that is done with interrupts disabled. (Miklos i think also tried !NOHZ kernels and older kernels, with a similar result.) plus on the ptrace side, the wait_task_inactive() code had most of its overhead in the atomic op, so if any timer IRQ hit _that_ core, it was likely while we were still holding the runqueue lock! i think the only thing that eventually got Miklos' laptop out of the wedge were timer irqs hitting the ptrace CPU in exactly those instructions where it was not holding the runqueue lock. (or perhaps an asynchronous SMM event delaying it for a long time) even considering that the 'LOCK'-ed intruction was the heaviest in the busy-loop, the numbers still just dont add up to 'tens of seconds of lockups', so there must be something else happening too. So here's an addition to the existing theories: the Core2Duo is a 4-issue CPU architecture. Now, why does this matter? It matters to the timing of the delivery of interrupts. For example, on a 3-issue architecture, the instruction level profile of well-cached workloads often looks like this: c05a3b71: 710 89 d6 mov%edx,%esi c05a3b73:0 8b 55 c0mov0xffc0(%ebp),%edx c05a3b76:0 89 c3 mov%eax,%ebx c05a3b78: 775 8b 82 e8 00 00 00 mov0xe8(%edx),%eax c05a3b7e:0 8b 48 18mov0x18(%eax),%ecx c05a3b81:0 8b 45 c8mov0xffc8(%ebp),%eax c05a3b84: 792 89 1c 24mov%ebx,(%esp) c05a3b87:0 89 74 24 04 mov%esi,0x4(%esp) c05a3b8b:0 ff d1 call *%ecx c05a3b8d:0 8b 4d c8mov0xffc8(%ebp),%ecx c05a3b90: 925 8b 41 6cmov0x6c(%ecx),%eax c05a3b93:0 39 41 10cmp%eax,0x10(%ecx) c05a3b96:0 0f 85 a8 01 00 00 jnec05a3d44 schedule+0x2a4 c05a3b9c: 949 89 da mov%ebx,%edx c05a3b9e:0 89 f1 mov%esi,%ecx c05a3ba0:0 8b 45 c8mov0xffc8(%ebp),%eax the second column is the number of times the profiling interrupt has hit that particular instruction. Note the many zero entries - this means that for instructions that are well-cached, the issue order _prevents_ interrupts from _ever_ hitting to within a bundle of micro-ops that the decoder will issue! The above workload was a plain lat_ctx, so nothing special, and interrupts and DMA traffic were coming and going. Still the bundling of instructions was very strong. There's no guarantee of 'instruction bundling': a cachemiss can still stall the pipeline and allow an interrupt to hit any instruction [where interrupt delivery is valid], but on a well-cached workload like the above, even a 3-issue architecture can effectively 'merge' instructions to each other, and can make them essentially 'atomic' as far as external interrupts go. [ also note another interesting thing in the profile above: the CALL *%ecx was likely BTB-optimized and hence we have a 'bundling' effect that is even larger than 3 instructions. ] i think that is what might have happened on Miklos's laptop too: the 'movb' of the spin_unlock() done by the wait_task_inactive() got 'bundled' together with the first LOCK instruction that took it again, making it very unlikely for a timer interrupt to ever hit that small window in wait_task_inactive(). The cpu_relax()'s REP; NOP was likely a simple NOP, because the Core2Duo is not an SMT platform. to check this theory, adding 3 NOPs to the critical section should make the lockups a lot less prominent too. (While NOPs are not actually 'issued', they do take up decoder bandwidth, so they hopefully are able to break up any 'bundle' of instructions.) Miklos, if you've got some time to test this - could you revert the fa490cfd15d7 commit and apply the patch below - does it have any impact on the lockups you were experiencing? Ingo --- kernel/sched.c |1 + 1 file changed, 1 insertion(+) Index: linux/kernel/sched.c === --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -1131,6 +1131,7 @@ repeat: preempted = !task_running(rq, p); task_rq_unlock(rq, flags); cpu_relax(); + asm volatile (nop; nop; nop;); if (preempted) yield(); goto repeat; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, Jun 21, 2007 at 09:01:28AM -0700, Linus Torvalds wrote: On Thu, 21 Jun 2007, Jarek Poplawski wrote: ... So I don't see how you could possibly having two different CPU's getting into some lock-step in that loop: changing task_rq() is a really quite heavy operation (it's about migrating between CPU's), and generally happens at a fairly low frequency (ie a couple of times a second kind of thing, not tight CPU loop). Yes, I've agreed with Ingo it was only one of my illusions... But bugs happen.. Another possible problem could be a result of some wrong optimization or wrong propagation of change of this task_rq(p) value. I agree, but that kind of bug would likely not cause temporary hangs, but actual the machine is dead operations. If you get the totally *wrong* value due to some systematic bug, you'd be waiting forever for it to match, not get into a loop for a half second and then it clearing up.. But I don't think we can throw the it's another bug theory _entirely_ out the window. Alas, I'm the last here who should talk with you or Ingo about hardware, but my point is that until it's not 100% proven this is spinlocks vs. cpu case any nearby possibilities should be considered. One of them is this loop, which can ... loop. Of course we can't see any reason for this, but if something is theoretically allowed it can happen. Here it's enough the task_rq(p) is for some very unprobable reason (maybe buggy hardware, code or compiling) cached or read too late, and maybe it's only on this one only notebook in the world? If you're sure this is not the case, let's forget about this. Of course, I don't mean it's a direct optimization. I think about something that could be triggered by something like this tested smp_mb() in entirely another place. IMHO, it would be very interesting to assure if this barrier fixed the spinlock or maybe some variable. There is also another interesting question: if it was only about spinlocks (which may be) why on those watchdog traces there are mainly two fighters: wait_task_inactive() and try_to_wake_up(). It seems there should be seen more clients of task_rq_lock(). So, my another unlikely idea is: maybe for some other strange, unprobable but only theoretically possibility there could be something wrong around this yield() in wait_task_inactive(). The comment above this function reads about the task that *will* unschedule. So, maybe it would be not nice if e.g. during this yield() something wakes it up? ... But it only happens with badly coded software: the rule simply is that you MUST NOT release and immediately re-acquire the same spinlock on the same core, because as far as other cores are concerned, that's basically the same as never releasing it in the first place. I totally agree! Let's only get certainty there is nothing more here. BTW, is there any reason not to add some test with a warning under spinlock debugging against such badly behaving places? Regards, Jarek P. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > On Thu, 21 Jun 2007, Ingo Molnar wrote: > > > > damn, i first wrote up an explanation about why that ugly __delay(1) is > > there (it almost hurts my eyes when i look at it!) but then deleted it > > as superfluous :-/ > > I'm fine with a delay, but the __delay(1) is simply not "correct". It > doesn't do anything. it's a bit trickier than that. Yes, it's a simple 1-entry loop and thus makes little sense to call. But it's a loop that got boot-time calibrated, so we can do this in the spinlock-debug code: u64 loops = loops_per_jiffy * HZ; this guarantees that we will loop for _at least_ 1 second before printing a message. (in practice it's much longer, especially with the current naive trylock approach) Why? Because as part of the activities that the spin-loop does, we also do everything that an mdelay(1000) would do. We do it 'piecemail-wise', and we do it very inefficiently, but the lower time-bound should be guaranteed. This is done because most of the problems were caused by too short looping and bogus debug printouts. So this is basically an open-coded udelay implementation. Furthermore, with the spin_is_locked() fix i just sent, the __loop(1) solution should actually be quite close to a real udelay() thing, without the delay effect. It would probably more accurate to increase it to loops_per_jiffy*10*HZ/16 and to call a __loop(16) thing, to get closer timings. (but then some people would argue 'why dont you take the lock as soon as it's released'.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, 21 Jun 2007, Ingo Molnar wrote: > > damn, i first wrote up an explanation about why that ugly __delay(1) is > there (it almost hurts my eyes when i look at it!) but then deleted it > as superfluous :-/ I'm fine with a delay, but the __delay(1) is simply not "correct". It doesn't do anything. "udelay()" waits for a certain time. Use that. > the reason for the __delay(1) was really mundane: to be able to figure > out when to print a 'we locked up' message to the user. No it does not. You may think it does, but it does nothing of the sort. Use "udelay()" or somethign that actually takes a *time*. Just __delay() is nothing but a loop, and calling it with an argument of 1 is stupid and buggy. The only *possibly* valid use of "__delay()" implies using a counter that is based on the "loops_per_sec" thing, which depends on what the delay function actually is. For example, the delay function may well turn out to be this: __asm__ __volatile__( "\tjmp 1f\n" ".align 16\n" "1:\tjmp 2f\n" ".align 16\n" "2:\tdecl %0\n\tjns 2b" :"=" (d0) :"0" (loops)); Notice? "Your code, it does nothing!" When I said that the code was buggy, I meant it. It has nothing to do with spinlocks. And "__delay(1)" is *always* a bug. You migth want to replace it with smp_rmb(); udelay(1); instead, at which point it *does* something: it has that read barrier (which is not actually needed on x86, but whatever), and it has a delay that is *meaningful*. A plain "__delay(1)" is neither. So let me repeat my statement: "What a piece of crap". Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
Linus Torvalds a écrit : On Thu, 21 Jun 2007, Linus Torvalds wrote: We don't do nesting locking either, for exactly the same reason. Are nesting locks "easier"? Absolutely. They are also almost always a sign of a *bug*. So making spinlocks and/or mutexes nest by default is just a way to encourage bad programming! Side note, and as a "truth in advertising" section: I'll have to admit that I argued against fair semaphores on the same grounds. I was wrong then (and eventually admitted it, and we obviously try to make our mutexes and semaphores fair these days!), and maybe I'm wrong now. If somebody can actually come up with a sequence where we have spinlock starvation, and it's not about an example of bad locking, and nobody really can come up with any other way to fix it, we may eventually have to add the notion of "fair spinlocks". I tried to find such a sequence, but I think its more a matter of hardware evolution, and some degenerated cases. In some years (months ?), it might possible to starve say the file struct spinlock of a process in a open()/close() infinite loop. This because the number of instruction per 'memory cache line transfert between cpus/core' is raising. But then one can say its a bug in user code :) Another way to starve kernel might be a loop doing settime() , since seqlock are quite special in serialization : Only seqlock's writers perform atomic ops, readers could be starved because of some hardware 'optimization'. So my arguments are purely pragmatic. It's not that I hate fairness per se. I dislike it only when it's used to "solve" (aka hide) other problems. In the end, some situations do need fairness, and the fact that aiming for fairness is often harder, slower, and more complicated than not doing so at that point turns into a non-argument. If you need it, you need it. Maybe some *big* NUMA machines really want this fairness (even if it cost some cycles as pointed by Davide in http://lkml.org/lkml/2007/3/29/246 ) , I am just guessing since I cannot test such monsters. I tested Davide program on a Dual Opteron and got some perf difference. $ ./qspins -n 2 now testing: TICKLOCK timeres=4000 uscycles=1991 AVG[0]: 2195.25 cycles/loop SIG[0]: 11.813657 AVG[1]: 2212.312500 cycles/loop SIG[1]: 38.038991 $ ./qspins -n 2 -s now testing: SPINLOCK timeres=4000 uscycles=1991 AVG[0]: 2066.00 cycles/loop SIG[0]: 0.00 AVG[1]: 2115.687500 cycles/loop SIG[1]: 63.083000 I just don't think we need it, and we're better off solving problems other ways. (For example, we might also solve such problems by creating a separate "fair_spin_lock" abstraction, and only making the particular users that need it actually use it. It would depend a bit on whether the cost of implementing the fairness is noticeable enough for it to be worth having a separate construct for it). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, 21 Jun 2007, Ingo Molnar wrote: > > yeah. I think Linux is i think the only OS on the planet that is using > the movb trick for unlock, it even triggered a hardware erratum ;) I'm pretty sure others do it too. Maybe not on an OS level (but I actually doubt that - I'd be surprised if Windows doesn't do the exact same thing), but I know for a fact that a lot of people in threaded libraries end up depending very much on the "simple store closes a locked section". Just googling for "xchg" "mov" "spinlock" "-linux" shows discussion boards for Windows developers with open-coded spinlocks like int ResourceFlag = 0; // 0=Free, 1=Inuse ... // Wait until we get the resource while(InterlockedExchange(, 1) != 0) { Sleep(0); } // Wait a tad // Have the resource ... // do your thing ResourceFlag = 0; // Release the resource and that's definitely Windows code, not some Linux person doing it. And this is from an OS2 forum unsigned owned=0; void request() { while(LockedExchanged(,1)!=0) ; } void release() { owned = 0; } so it's not even something unusual. So while arguably these people don't know (and don't care) about subtle issues like memory ordering, I can *guarantee* that a lot of programs depend on them, even if that dependency may often come from a lack of knowledge, rather than actively understanding what we do like in the Linux kernel community. (And yes, they rely on compilers not reordering either. Tough.) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > > for (;;) { > > for (i = 0; i < loops; i++) { > > if (__raw_write_trylock(>raw_lock)) > > return; > > __delay(1); > > } > > What a piece of crap. > > Anybody who ever waits for a lock by busy-looping over it is BUGGY, > dammit! > > The only correct way to wait for a lock is: > > (a) try it *once* with an atomic r-m-w > (b) loop over just _reading_ it (and something that implies a memory > barrier, _not_ "__delay()". Use "cpu_relax()" or "smp_rmb()") > (c) rinse and repeat. damn, i first wrote up an explanation about why that ugly __delay(1) is there (it almost hurts my eyes when i look at it!) but then deleted it as superfluous :-/ really, it's not because i'm stupid (although i might still be stupid for other resons ;-), it wasnt there in earlier spin-debug versions. We even had an inner spin_is_locked() loop at a stage (and should add it again). the reason for the __delay(1) was really mundane: to be able to figure out when to print a 'we locked up' message to the user. If it's 1 second, it causes false positive on some systems. If it's 10 minutes, people press reset before we print out any useful data. It used to be just a loop of rep_nop()s, but that was hard to calibrate: on certain newer hardware it was triggering as fast as in 2 seconds, causing many false positives. We cannot use jiffies nor any other clocksource in this debug code. so i settled for the butt-ugly but working __delay(1) thing, to be able to time the debug messages. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > On Thu, 21 Jun 2007, Ingo Molnar wrote: > > > > I can understand why no data is saved by this change: gcc is > > aligning the next field to a natural boundary anyway and we dont > > really have arrays of spinlocks (fortunately). > > Actually, some data structures could well shrink. > > Look at "struct task_struct", for example. Right now it has two > spinlocks right next to each other (alloc_lock and pi_lock). yeah. We've got init_task's task-struct embedded in the vmlinux, but it's aligned to 32 bytes, which probably hides this effect. We'd only see it if the size change just happened to bring a key data structure (which is also embedded in the vmlinux) just below a modulo 32 bytes boundary. The chance for that is around 6:32 per 'merge' event. That means that there cannot be all that many such cases ;-) anyway, the shorter init sequence is worth it already. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > No, the cache line arbitration doesn't know anything about "locked" vs > "unlocked" instructions (it could, but there really is no point). > > The real issue is that locked instructions on x86 are serializing, > which makes them extremely slow (compared to just a write), and then > by being slow they effectively just make the window for another core > bigger. > > IOW, it doesn't "fix" anything, it just hides the bug with timing. yeah. I think Linux is i think the only OS on the planet that is using the movb trick for unlock, it even triggered a hardware erratum ;) So it might surprise some hw makers who might rely on the heuristics that each critical section lock and unlock is a LOCK-ed instruction. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, 21 Jun 2007, Ingo Molnar wrote: > > * Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > If somebody can actually come up with a sequence where we have > > spinlock starvation, and it's not about an example of bad locking, and > > nobody really can come up with any other way to fix it, we may > > eventually have to add the notion of "fair spinlocks". > > there was one bad case i can remember: the spinlock debugging code had a > trylock open-coded loop and on certain Opterons CPUs were starving each > other. But this is a perfect example of exactly what I'm talking about: THAT CODE IS HORRIBLY BUGGY! It's not the spinlocks that are broken, it's that damn code. > for (;;) { > for (i = 0; i < loops; i++) { > if (__raw_write_trylock(>raw_lock)) > return; > __delay(1); > } What a piece of crap. Anybody who ever waits for a lock by busy-looping over it is BUGGY, dammit! The only correct way to wait for a lock is: (a) try it *once* with an atomic r-m-w (b) loop over just _reading_ it (and something that implies a memory barrier, _not_ "__delay()". Use "cpu_relax()" or "smp_rmb()") (c) rinse and repeat. and code like the above should just be shot on sight. So don't blame the spinlocks or the hardware for crap code. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > It's in fact entirely possible that the long freezes have always been > there, but the NOHZ option meant that we had much longer stretches of > time without things like timer interrupts to jumble up the timing! So > maybe the freezes existed before, but with timer interrupts happening > hundreds of times a second, they weren't noticeable to humans. the freezes that Miklos was seeing were hardirq contexts blocking in task_rq_lock() - that is done with interrupts disabled. (Miklos i think also tried !NOHZ kernels and older kernels, with a similar result.) plus on the ptrace side, the wait_task_inactive() code had most of its overhead in the atomic op, so if any timer IRQ hit _that_ core, it was likely while we were still holding the runqueue lock! i think the only thing that eventually got Miklos' laptop out of the wedge were timer irqs hitting the ptrace CPU in exactly those instructions where it was not holding the runqueue lock. (or perhaps an asynchronous SMM event delaying it for a long time) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > (And no, on 32-bit x86, we don't allow more than 128 CPU's. I don't > think such an insane machine has ever existed). and if people _really_ want to boot a large-smp 32-bit kernel on some new, tons-of-cpus box, as a workaround they can enable the spinlock debugging code, which has no limitation on the number of CPUs supported. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, 21 Jun 2007, Ingo Molnar wrote: > > I can understand why no data is saved by this change: gcc is aligning > the next field to a natural boundary anyway and we dont really have > arrays of spinlocks (fortunately). Actually, some data structures could well shrink. Look at "struct task_struct", for example. Right now it has two spinlocks right next to each other (alloc_lock and pi_lock). Other data structures may have things like bitfields etc. But yeah, I'd not expect that to be very common, and in some cases you might have to re-order data structures to take advantage of better packing, and even then it's probably not all that noticeable. > but this is certainly not something for 2.6.22, it's an early 2.6.23 > matter i suspect. Oh, absolutely. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > If somebody can actually come up with a sequence where we have > spinlock starvation, and it's not about an example of bad locking, and > nobody really can come up with any other way to fix it, we may > eventually have to add the notion of "fair spinlocks". there was one bad case i can remember: the spinlock debugging code had a trylock open-coded loop and on certain Opterons CPUs were starving each other. This used to trigger with the ->tree_lock rwlock i think, on heavy MM loads. The starvation got so bad that the NMI watchdog started triggering ... interestingly, this only triggered for certain rwlocks. Thus we, after a few failed attempts to pacify this open-coded loop, currently have that code disabled in lib/spinlock_debug.c: #if 0 /* This can cause lockups */ static void __write_lock_debug(rwlock_t *lock) { u64 i; u64 loops = loops_per_jiffy * HZ; int print_once = 1; for (;;) { for (i = 0; i < loops; i++) { if (__raw_write_trylock(>raw_lock)) return; __delay(1); } the weird thing is that we still have the _very same_ construct in __spin_lock_debug(): for (i = 0; i < loops; i++) { if (__raw_spin_trylock(>raw_lock)) return; __delay(1); } if there are any problems with this then people are not complaining loud enough :-) note that because this is a trylock based loop, the acquire+release sequence problem should not apply to this problem. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > Umm. i386 spinlocks could and should be *one*byte*. > > In fact, I don't even know why they are wasting four bytes right now: > the fact that somebody made them an "int" just wastes memory. All the > actual code uses "decb", so it's not even a question of safety. I > wonder why we have that 32-bit thing and the ugly casts. > > Ingo, any memory of that? no real reason that i can recall - i guess nobody dared to touch it because it used to have that 'volatile', indicating black voodoo ;-) Now that the bad stigma has been removed, we could try the patch below. It boots fine here, and we save 1K of kernel text size: textdata bss dec hex filename 6236003 611992 401408 7249403 6e9dfb vmlinux.before 6235075 611992 401408 7248475 6e9a5b vmlinux.after I can understand why no data is saved by this change: gcc is aligning the next field to a natural boundary anyway and we dont really have arrays of spinlocks (fortunately). [and we save no data even if using the ((packed)) attribute.] Perhaps some data structure that is never in the kernel image itself still got smaller? Any good way to determine that? But why is the text size different? Ah: i think it's spin_lock_init() getting shorter :-) but this is certainly not something for 2.6.22, it's an early 2.6.23 matter i suspect. Ingo ---> From: Ingo Molnar <[EMAIL PROTECTED]> Subject: [patch] spinlocks i386: change them to byte fields all spinlock ops are on byte operands, so change the spinlock field to be unsigned char. This saves a bit of kernel text size: textdata bss dec hex filename 6236003 611992 401408 7249403 6e9dfb vmlinux.before 6235075 611992 401408 7248475 6e9a5b vmlinux.after Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]> --- include/asm-i386/spinlock_types.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux/include/asm-i386/spinlock_types.h === --- linux.orig/include/asm-i386/spinlock_types.h +++ linux/include/asm-i386/spinlock_types.h @@ -6,7 +6,7 @@ #endif typedef struct { - unsigned int slock; + unsigned char slock; } raw_spinlock_t; #define __RAW_SPIN_LOCK_UNLOCKED { 1 } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, 21 Jun 2007, Linus Torvalds wrote: > > We don't do nesting locking either, for exactly the same reason. Are > nesting locks "easier"? Absolutely. They are also almost always a sign of > a *bug*. So making spinlocks and/or mutexes nest by default is just a way > to encourage bad programming! Side note, and as a "truth in advertising" section: I'll have to admit that I argued against fair semaphores on the same grounds. I was wrong then (and eventually admitted it, and we obviously try to make our mutexes and semaphores fair these days!), and maybe I'm wrong now. If somebody can actually come up with a sequence where we have spinlock starvation, and it's not about an example of bad locking, and nobody really can come up with any other way to fix it, we may eventually have to add the notion of "fair spinlocks". So my arguments are purely pragmatic. It's not that I hate fairness per se. I dislike it only when it's used to "solve" (aka hide) other problems. In the end, some situations do need fairness, and the fact that aiming for fairness is often harder, slower, and more complicated than not doing so at that point turns into a non-argument. If you need it, you need it. I just don't think we need it, and we're better off solving problems other ways. (For example, we might also solve such problems by creating a separate "fair_spin_lock" abstraction, and only making the particular users that need it actually use it. It would depend a bit on whether the cost of implementing the fairness is noticeable enough for it to be worth having a separate construct for it). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, 21 Jun 2007, Eric Dumazet wrote: > > This reminds me Nick's proposal of 'queued spinlocks' 3 months ago > > Maybe this should be re-considered ? (unlock is still a non atomic op, > so we dont pay the serializing cost twice) No. The point is simple: IF YOU NEED THIS, YOU ARE DOING SOMETHING WRONG! I don't understand why this is even controversial. Especially since we have a patch for the problem that proves my point: the _proper_ way to fix things is to just not do the bad thing, instead of trying to allow the bad behaviour and try to handle it. Things like queued spinlocks just make excuses for bad code. We don't do nesting locking either, for exactly the same reason. Are nesting locks "easier"? Absolutely. They are also almost always a sign of a *bug*. So making spinlocks and/or mutexes nest by default is just a way to encourage bad programming! > extract : > > Implement queued spinlocks for i386. This shouldn't increase the size of > the spinlock structure, while still able to handle 2^16 CPUs. Umm. i386 spinlocks could and should be *one*byte*. In fact, I don't even know why they are wasting four bytes right now: the fact that somebody made them an "int" just wastes memory. All the actual code uses "decb", so it's not even a question of safety. I wonder why we have that 32-bit thing and the ugly casts. Ingo, any memory of that? (And no, on 32-bit x86, we don't allow more than 128 CPU's. I don't think such an insane machine has ever existed). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, 21 Jun 2007 10:31:53 -0700 (PDT) Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > On Thu, 21 Jun 2007, Chuck Ebbert wrote: > > > > A while ago I showed that spinlocks were a lot more fair when doing > > unlock with the xchg instruction on x86. Probably the arbitration is all > > screwed up because we use a mov instruction, which while atomic is not > > locked. > > No, the cache line arbitration doesn't know anything about "locked" vs > "unlocked" instructions (it could, but there really is no point). > > The real issue is that locked instructions on x86 are serializing, which > makes them extremely slow (compared to just a write), and then by being > slow they effectively just make the window for another core bigger. > This reminds me Nick's proposal of 'queued spinlocks' 3 months ago Maybe this should be re-considered ? (unlock is still a non atomic op, so we dont pay the serializing cost twice) http://lwn.net/Articles/227506/ extract : Implement queued spinlocks for i386. This shouldn't increase the size of the spinlock structure, while still able to handle 2^16 CPUs. The queued spinlock has 2 fields, a head and a tail, which are indexes into a FIFO of waiting CPUs. To take a spinlock, a CPU performs an "atomic_inc_return" on the head index, and keeps the returned value as a ticket. The CPU then spins until the tail index is equal to that ticket. To unlock a spinlock, the tail index is incremented (this can be non atomic, because only the lock owner will modify tail). Implementation inefficiencies aside, this change should have little effect on performance for uncontended locks, but will have quite a large cost for highly contended locks [O(N) cacheline transfers vs O(1) per lock aquisition, where N is the number of CPUs contending]. The benefit is is that contended locks will not cause any starvation. Just an idea. Big NUMA hardware seems to have fairness logic that prevents starvation for the regular spinlock logic. But it might be interesting for -rt kernel or systems with starvation issues. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, 21 Jun 2007, Chuck Ebbert wrote: > > A while ago I showed that spinlocks were a lot more fair when doing > unlock with the xchg instruction on x86. Probably the arbitration is all > screwed up because we use a mov instruction, which while atomic is not > locked. No, the cache line arbitration doesn't know anything about "locked" vs "unlocked" instructions (it could, but there really is no point). The real issue is that locked instructions on x86 are serializing, which makes them extremely slow (compared to just a write), and then by being slow they effectively just make the window for another core bigger. IOW, it doesn't "fix" anything, it just hides the bug with timing. You can hide the problem other ways by just increasing the delay between the unlock and the lock (and adding one or more serializing instruction in between is generally the best way of doing that, simply because otherwise micro- architecture may just be re-ordering things on you, so that your "delay" isn't actually in between any more!). But adding delays doesn't really fix anything, of course. It makes things "fairer" by making *both* sides suck more, but especially if both sides are actually the same exact thing, I could well imagine that they'd both just suck equally, and get into some pattern where they are now both slower, but still see exactly the same problem! Of course, as long as interrupts are on, or things like DMA happen etc, it's really *really* hard to be totally unlucky, and after a while you're likely to break out of the steplock on your own, just because the CPU's get interrupted by something else. It's in fact entirely possible that the long freezes have always been there, but the NOHZ option meant that we had much longer stretches of time without things like timer interrupts to jumble up the timing! So maybe the freezes existed before, but with timer interrupts happening hundreds of times a second, they weren't noticeable to humans. (Btw, that's just _one_ theory. Don't take it _too_ seriously, but it could be one of the reasons why this showed up as a "new" problem, even though I don't think the "wait_for_inactive()" thing has changed lately.) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On 06/21/2007 12:08 PM, Ingo Molnar wrote: > yeah - i'm not at all arguing in favor of the BTRL patch i did: i always > liked the 'nicer' inner loop of spinlocks, which could btw also easily > use MONITOR/MWAIT. The "nice" inner loop is necessary or else it would generate huge amounts of bus traffic while spinning. > So it seems the problem was that if a core kept _truly_ modifying a > cacheline via atomics in a high enough frequency, it could artificially > starve the other core. (which would keep waiting for the cacheline to be > released one day, and which kept the first core from ever making any > progress) To me that looks like a real problem on the hardware side - > shouldnt cacheline ownership be arbitrated a bit better than that? > A while ago I showed that spinlocks were a lot more fair when doing unlock with the xchg instruction on x86. Probably the arbitration is all screwed up because we use a mov instruction, which while atomic is not locked. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, 21 Jun 2007, Ingo Molnar wrote: > > so the problem was not the trylock based spin_lock() itself (no matter > how it's structured in the assembly), the problem was actually modifying > the lock and re-modifying it again and again in a very tight > high-frequency loop, and hence not giving it to the other core? That's what I think, yes. And it matches the behaviour we've seen before (remember the whole thread about whether Opteron or Core 2 hardware is "more fair" between cores?) It all simply boils down to the fact that releasing and almost immediately re-taking a lock is "invisible" to outside cores, because it will happen entirely within the L1 cache of one core if the cacheline is already in "owned" state. Another core that spins wildly trying to get it ends up using a much slower "beat" (the cache coherency clock), and with just a bit of bad luck the L1 cache would simply never get probed, and the line never stolen, at the point where the lock happens to be released. The fact that writes (as in the store that releases the lock) automatically get delayed by any x86 core by the store buffer, and the fact that atomic read-modify-write cycles do *not* get delayed, just means that if the "spinlock release" was "fairly close" to the "reacquire" in a software sense, the hardware will actually make them *even*closer*. So you could make the spinlock release be a read-modify-write cycle, and it would make the spinlock much slower, but it would also make it much more likely that the other core will *see* the release. For the same reasons, if you add a "smp_mb()" in between the release and the re-acquire, the other core is much more likely to see it: it means that the release won't be delayed, and thus it just won't be as "close" to the re-acquire. So you can *hide* this problem in many ways, but it's still just hiding it. The proper fix is to not do that kind of "release and re-acquire" behaviour in a loop. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > On Thu, 21 Jun 2007, Ingo Molnar wrote: > > > > what worries me a bit though is that my patch that made spinlocks > > equally agressive to that loop didnt solve the hangs! > > Your parch kept doing "spin_trylock()", didn't it? yeah - it changed spin_lock()'s assembly to do a "LOCK BTRL", which is a trylock which tries to dirty the cacheline. There was a "REP NOP" after it and a loop back to the "LOCK BTRL". > That's a read-modify-write thing, and keeps bouncing the cacheline > back and forth, and together with the fact that even *after* you get > the spinlock the "wait_for_inactive()" would actually end up looping > back, releasing it, and re-getting it. > > So the problem was that "wait_for_inactive()" kept the lock (because > it actually *got* it), and looped over getting it, and because it was > an exclusive cacheline ownership, that implies that somebody else is > not getting it, and is kept from ever getting it. ok, it's not completely clear where exactly the other core was spinning, but i took it from Miklos' observations that the other core was hanging in the _very same_ task_rq_lock() - which is a true spinlock as well that acquires it. So on one core the spin_lock() was starving, on another one it was always succeeding. > So trying to use "trylock" doesn't help. It still has all the same bad > sides - it still gets the lock (getting the lock wasn't the problem: > _holding_ the lock was the problem), and it still kept the cache line > for the lock on one core. so the problem was not the trylock based spin_lock() itself (no matter how it's structured in the assembly), the problem was actually modifying the lock and re-modifying it again and again in a very tight high-frequency loop, and hence not giving it to the other core? > The only way to avoid lock contention is to avoid any exclusive use at > all. yeah - i'm not at all arguing in favor of the BTRL patch i did: i always liked the 'nicer' inner loop of spinlocks, which could btw also easily use MONITOR/MWAIT. (my patch is also quite close to what we did in spinlocks many years ago, so it's more of a step backwards than real progress.) So it seems the problem was that if a core kept _truly_ modifying a cacheline via atomics in a high enough frequency, it could artificially starve the other core. (which would keep waiting for the cacheline to be released one day, and which kept the first core from ever making any progress) To me that looks like a real problem on the hardware side - shouldnt cacheline ownership be arbitrated a bit better than that? Up to the point where some external event (perhaps a periodic SMM related to thermal management) broke the deadlock/livelock scenario? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, 21 Jun 2007, Jarek Poplawski wrote: > > BTW, I've looked a bit at these NMI watchdog traces, and now I'm not > even sure it's necessarily the spinlock's problem (but I don't exclude > this possibility yet). It seems both processors use task_rq_lock(), so > there could be also a problem with that loop. I agree that that is also this kind of "loop getting a lock" loop, but if you can see it actually happening there, we have some other (and major) bug. That loop is indeed a loop, but realistically speaking it should never be run through more than once. The only reason it's a loop is that there's a small small race where we get the information at the wrong time, get the wrong lock, and try again. So the loop certainly *can* trigger (or it would be pointless), but I'd normally expect it not to, and even if it does end up looping around it should happen maybe *one* more time, absolutely not "get stuck" in the loop for any appreciable number of iterations. So I don't see how you could possibly having two different CPU's getting into some lock-step in that loop: changing "task_rq()" is a really quite heavy operation (it's about migrating between CPU's), and generally happens at a fairly low frequency (ie "a couple of times a second" kind of thing, not "tight CPU loop"). But bugs happen.. > Another possible problem could be a result of some wrong optimization > or wrong propagation of change of this task_rq(p) value. I agree, but that kind of bug would likely not cause temporary hangs, but actual "the machine is dead" operations. If you get the totally *wrong* value due to some systematic bug, you'd be waiting forever for it to match, not get into a loop for a half second and then it clearing up.. But I don't think we can throw the "it's another bug" theory _entirely_ out the window. That said, both Ingo and me have done "fairness" testing of hardware, and yes, if you have a reasonably tight loop with spinlocks, existing hardware really *is* unfair. Not by a "small amount" either - I had this program that did statistics (and Ingo improved on it and had some other tests too), and basically if you have two CPU's that try to get the same spinlock, they really *can* get into situations where one of them gets it millions of times in a row, and the other _never_ gets it. But it only happens with badly coded software: the rule simply is that you MUST NOT release and immediately re-acquire the same spinlock on the same core, because as far as other cores are concerned, that's basically the same as never releasing it in the first place. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, 21 Jun 2007, Ingo Molnar wrote: > > what worries me a bit though is that my patch that made spinlocks > equally agressive to that loop didnt solve the hangs! Your parch kept doing "spin_trylock()", didn't it? That's a read-modify-write thing, and keeps bouncing the cacheline back and forth, and together with the fact that even *after* you get the spinlock the "wait_for_inactive()" would actually end up looping back, releasing it, and re-getting it. So the problem was that "wait_for_inactive()" kept the lock (because it actually *got* it), and looped over getting it, and because it was an exclusive cacheline ownership, that implies that somebody else is not getting it, and is kept from ever getting it. So trying to use "trylock" doesn't help. It still has all the same bad sides - it still gets the lock (getting the lock wasn't the problem: _holding_ the lock was the problem), and it still kept the cache line for the lock on one core. The only way to avoid lock contention is to avoid any exclusive use at all. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, Jun 21, 2007 at 10:39:31AM +0200, Ingo Molnar wrote: > > * Jarek Poplawski <[EMAIL PROTECTED]> wrote: > > > BTW, I've looked a bit at these NMI watchdog traces, and now I'm not > > even sure it's necessarily the spinlock's problem (but I don't exclude > > this possibility yet). It seems both processors use task_rq_lock(), so > > there could be also a problem with that loop. The way the correctness > > of the taken lock is verified is racy: there is a small probability > > that if we have taken the wrong lock the check inside the loop is done > > just before the value is beeing changed elsewhere under the right > > lock. Another possible problem could be a result of some wrong > > optimization or wrong propagation of change of this task_rq(p) value. > > ok, could you elaborate this in a bit more detail? You say it's racy - > any correctness bug in task_rq_lock() will cause the kernel to blow up > in spectacular ways. It's a fairly straightforward loop: > > static inline struct rq *__task_rq_lock(struct task_struct *p) > __acquires(rq->lock) > { > struct rq *rq; > > repeat_lock_task: > rq = task_rq(p); > spin_lock(>lock); > if (unlikely(rq != task_rq(p))) { > spin_unlock(>lock); > goto repeat_lock_task; > } > return rq; > } > > the result of task_rq() depends on p->thread_info->cpu wich will only > change if a task has migrated over to another CPU. That is a > fundamentally 'slow' operation, but even if a task does it intentionally > in a high frequency way (for example via repeated calls to > sched_setaffinity) there's no way it could be faster than the spinlock > code here. So ... what problems can you see with it? OK, you are right - I withdraw this "idea". Sorry! Jarek P. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Jarek Poplawski <[EMAIL PROTECTED]> wrote: > BTW, I've looked a bit at these NMI watchdog traces, and now I'm not > even sure it's necessarily the spinlock's problem (but I don't exclude > this possibility yet). It seems both processors use task_rq_lock(), so > there could be also a problem with that loop. The way the correctness > of the taken lock is verified is racy: there is a small probability > that if we have taken the wrong lock the check inside the loop is done > just before the value is beeing changed elsewhere under the right > lock. Another possible problem could be a result of some wrong > optimization or wrong propagation of change of this task_rq(p) value. ok, could you elaborate this in a bit more detail? You say it's racy - any correctness bug in task_rq_lock() will cause the kernel to blow up in spectacular ways. It's a fairly straightforward loop: static inline struct rq *__task_rq_lock(struct task_struct *p) __acquires(rq->lock) { struct rq *rq; repeat_lock_task: rq = task_rq(p); spin_lock(>lock); if (unlikely(rq != task_rq(p))) { spin_unlock(>lock); goto repeat_lock_task; } return rq; } the result of task_rq() depends on p->thread_info->cpu wich will only change if a task has migrated over to another CPU. That is a fundamentally 'slow' operation, but even if a task does it intentionally in a high frequency way (for example via repeated calls to sched_setaffinity) there's no way it could be faster than the spinlock code here. So ... what problems can you see with it? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > In other words, spinlocks are optimized for *lack* of contention. If a > spinlock has contention, you don't try to make the spinlock "fair". > No, you try to fix the contention instead! yeah, and if there's no easy solution, change it to a mutex. Fastpath performance of spinlocks and mutexes is essentially the same, and if there's any measurable contention then the scheduler is pretty good at sorting things out. Say if the average contention is longer than 10-20 microseconds then likely we could already win by scheduling away to some other task. (the best is of course to have no contention at all - but there are causes where it is real hard, and there are cases where it's outright unmaintainable.) Hw makers are currently producing transistors disproportionatly faster than humans are producing parallel code, as a result of which we've got more CPU cache than ever, even taking natural application bloat into account. (it just makes no sense to spend those transistors on parallelism when applications are just not making use of it yet. Plus caches are a lot less power intense than functional units of the CPU, and the limit these days is power input.) So scheduling more frequently and more agressively makes more sense than ever before and that trend will likely not stop for some time to come. > The patch I sent out was an example of that. You *can* fix contention > problems. Does it take clever approaches? Yes. It's why we have hashed > spinlocks, RCU, and code sequences that are entirely lockless and use > optimistic approaches. And suddenly you get fairness *and* > performance! what worries me a bit though is that my patch that made spinlocks equally agressive to that loop didnt solve the hangs! So there is some issue we dont understand yet - why was the wait_inactive_task() open-coded spin-trylock loop starving the other core which had ... an open-coded spin-trylock loop coded up in assembly? And we've got a handful of other open-coded loops in the kernel (networking for example) so this issue could come back and haunt us in a situation where we dont have a gifted hacker like Miklos being able to spend _weeks_ to track down the problem... Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Wed, Jun 20, 2007 at 10:34:15AM -0700, Linus Torvalds wrote: > > > On Wed, 20 Jun 2007, Jarek Poplawski wrote: > > > > I don't agree with this (+ I know it doesn't matter). > > > > The real bug is what Chuck Ebbert wrote: "Spinlocks aren't fair". > > And here they are simply lawlessly not fair. > > Well, that's certainly a valid standpoint. I wouldn't claim you're > _wrong_. > > At least in theory. > > In *practice*, fair spinlocks are simply not possible. Not as in "it's > hard to do", but as in "you simply cannot do it". I think we can agree it was more about some minimal fairness. ... > Which gets us to the next level: we can consider hardware that isn't > "fair" in its cache coherency to be buggy hardware. IMHO, we shouldn't try to blame the hardware until we know exactly the source of this bug. ... > In other words, spinlocks are optimized for *lack* of contention. If a > spinlock has contention, you don't try to make the spinlock "fair". No, > you try to fix the contention instead! > > That way, you fix many things. You probably speed things up, _and_ you > make it fairer. It might sometimes take some brains and effort, but it's > worth it. I could agree with this, but only when we know exactly what place should be fixed if we don't care about speed. Then, of course, the cost could be estimated. But after last Ingo's patch I'm not sure he, or anybody else here, has this knowledge. I'd also remind that adding one smp_mb() also did the work, and it doesn't look like a big performance hit. We should only better know why this works. > The patch I sent out was an example of that. You *can* fix contention > problems. Does it take clever approaches? Yes. It's why we have hashed > spinlocks, RCU, and code sequences that are entirely lockless and use > optimistic approaches. And suddenly you get fairness *and* performance! > > It's a win-win situation. It does require a bit of effort, but hey, we're > good at effort. Not necessarily so. Until the exact reason isn't known "for sure", this one place could be fixed, but the same problem could appear somewhere else in more masked form or is far less repeatable. BTW, I've looked a bit at these NMI watchdog traces, and now I'm not even sure it's necessarily the spinlock's problem (but I don't exclude this possibility yet). It seems both processors use task_rq_lock(), so there could be also a problem with that loop. The way the correctness of the taken lock is verified is racy: there is a small probability that if we have taken the wrong lock the check inside the loop is done just before the value is beeing changed elsewhere under the right lock. Another possible problem could be a result of some wrong optimization or wrong propagation of change of this task_rq(p) value. Thanks for response & best regards, Jarek P. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Wed, Jun 20, 2007 at 10:34:15AM -0700, Linus Torvalds wrote: On Wed, 20 Jun 2007, Jarek Poplawski wrote: I don't agree with this (+ I know it doesn't matter). The real bug is what Chuck Ebbert wrote: Spinlocks aren't fair. And here they are simply lawlessly not fair. Well, that's certainly a valid standpoint. I wouldn't claim you're _wrong_. At least in theory. In *practice*, fair spinlocks are simply not possible. Not as in it's hard to do, but as in you simply cannot do it. I think we can agree it was more about some minimal fairness. ... Which gets us to the next level: we can consider hardware that isn't fair in its cache coherency to be buggy hardware. IMHO, we shouldn't try to blame the hardware until we know exactly the source of this bug. ... In other words, spinlocks are optimized for *lack* of contention. If a spinlock has contention, you don't try to make the spinlock fair. No, you try to fix the contention instead! That way, you fix many things. You probably speed things up, _and_ you make it fairer. It might sometimes take some brains and effort, but it's worth it. I could agree with this, but only when we know exactly what place should be fixed if we don't care about speed. Then, of course, the cost could be estimated. But after last Ingo's patch I'm not sure he, or anybody else here, has this knowledge. I'd also remind that adding one smp_mb() also did the work, and it doesn't look like a big performance hit. We should only better know why this works. The patch I sent out was an example of that. You *can* fix contention problems. Does it take clever approaches? Yes. It's why we have hashed spinlocks, RCU, and code sequences that are entirely lockless and use optimistic approaches. And suddenly you get fairness *and* performance! It's a win-win situation. It does require a bit of effort, but hey, we're good at effort. Not necessarily so. Until the exact reason isn't known for sure, this one place could be fixed, but the same problem could appear somewhere else in more masked form or is far less repeatable. BTW, I've looked a bit at these NMI watchdog traces, and now I'm not even sure it's necessarily the spinlock's problem (but I don't exclude this possibility yet). It seems both processors use task_rq_lock(), so there could be also a problem with that loop. The way the correctness of the taken lock is verified is racy: there is a small probability that if we have taken the wrong lock the check inside the loop is done just before the value is beeing changed elsewhere under the right lock. Another possible problem could be a result of some wrong optimization or wrong propagation of change of this task_rq(p) value. Thanks for response best regards, Jarek P. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds [EMAIL PROTECTED] wrote: In other words, spinlocks are optimized for *lack* of contention. If a spinlock has contention, you don't try to make the spinlock fair. No, you try to fix the contention instead! yeah, and if there's no easy solution, change it to a mutex. Fastpath performance of spinlocks and mutexes is essentially the same, and if there's any measurable contention then the scheduler is pretty good at sorting things out. Say if the average contention is longer than 10-20 microseconds then likely we could already win by scheduling away to some other task. (the best is of course to have no contention at all - but there are causes where it is real hard, and there are cases where it's outright unmaintainable.) Hw makers are currently producing transistors disproportionatly faster than humans are producing parallel code, as a result of which we've got more CPU cache than ever, even taking natural application bloat into account. (it just makes no sense to spend those transistors on parallelism when applications are just not making use of it yet. Plus caches are a lot less power intense than functional units of the CPU, and the limit these days is power input.) So scheduling more frequently and more agressively makes more sense than ever before and that trend will likely not stop for some time to come. The patch I sent out was an example of that. You *can* fix contention problems. Does it take clever approaches? Yes. It's why we have hashed spinlocks, RCU, and code sequences that are entirely lockless and use optimistic approaches. And suddenly you get fairness *and* performance! what worries me a bit though is that my patch that made spinlocks equally agressive to that loop didnt solve the hangs! So there is some issue we dont understand yet - why was the wait_inactive_task() open-coded spin-trylock loop starving the other core which had ... an open-coded spin-trylock loop coded up in assembly? And we've got a handful of other open-coded loops in the kernel (networking for example) so this issue could come back and haunt us in a situation where we dont have a gifted hacker like Miklos being able to spend _weeks_ to track down the problem... Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Jarek Poplawski [EMAIL PROTECTED] wrote: BTW, I've looked a bit at these NMI watchdog traces, and now I'm not even sure it's necessarily the spinlock's problem (but I don't exclude this possibility yet). It seems both processors use task_rq_lock(), so there could be also a problem with that loop. The way the correctness of the taken lock is verified is racy: there is a small probability that if we have taken the wrong lock the check inside the loop is done just before the value is beeing changed elsewhere under the right lock. Another possible problem could be a result of some wrong optimization or wrong propagation of change of this task_rq(p) value. ok, could you elaborate this in a bit more detail? You say it's racy - any correctness bug in task_rq_lock() will cause the kernel to blow up in spectacular ways. It's a fairly straightforward loop: static inline struct rq *__task_rq_lock(struct task_struct *p) __acquires(rq-lock) { struct rq *rq; repeat_lock_task: rq = task_rq(p); spin_lock(rq-lock); if (unlikely(rq != task_rq(p))) { spin_unlock(rq-lock); goto repeat_lock_task; } return rq; } the result of task_rq() depends on p-thread_info-cpu wich will only change if a task has migrated over to another CPU. That is a fundamentally 'slow' operation, but even if a task does it intentionally in a high frequency way (for example via repeated calls to sched_setaffinity) there's no way it could be faster than the spinlock code here. So ... what problems can you see with it? Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, Jun 21, 2007 at 10:39:31AM +0200, Ingo Molnar wrote: * Jarek Poplawski [EMAIL PROTECTED] wrote: BTW, I've looked a bit at these NMI watchdog traces, and now I'm not even sure it's necessarily the spinlock's problem (but I don't exclude this possibility yet). It seems both processors use task_rq_lock(), so there could be also a problem with that loop. The way the correctness of the taken lock is verified is racy: there is a small probability that if we have taken the wrong lock the check inside the loop is done just before the value is beeing changed elsewhere under the right lock. Another possible problem could be a result of some wrong optimization or wrong propagation of change of this task_rq(p) value. ok, could you elaborate this in a bit more detail? You say it's racy - any correctness bug in task_rq_lock() will cause the kernel to blow up in spectacular ways. It's a fairly straightforward loop: static inline struct rq *__task_rq_lock(struct task_struct *p) __acquires(rq-lock) { struct rq *rq; repeat_lock_task: rq = task_rq(p); spin_lock(rq-lock); if (unlikely(rq != task_rq(p))) { spin_unlock(rq-lock); goto repeat_lock_task; } return rq; } the result of task_rq() depends on p-thread_info-cpu wich will only change if a task has migrated over to another CPU. That is a fundamentally 'slow' operation, but even if a task does it intentionally in a high frequency way (for example via repeated calls to sched_setaffinity) there's no way it could be faster than the spinlock code here. So ... what problems can you see with it? OK, you are right - I withdraw this idea. Sorry! Jarek P. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, 21 Jun 2007, Ingo Molnar wrote: what worries me a bit though is that my patch that made spinlocks equally agressive to that loop didnt solve the hangs! Your parch kept doing spin_trylock(), didn't it? That's a read-modify-write thing, and keeps bouncing the cacheline back and forth, and together with the fact that even *after* you get the spinlock the wait_for_inactive() would actually end up looping back, releasing it, and re-getting it. So the problem was that wait_for_inactive() kept the lock (because it actually *got* it), and looped over getting it, and because it was an exclusive cacheline ownership, that implies that somebody else is not getting it, and is kept from ever getting it. So trying to use trylock doesn't help. It still has all the same bad sides - it still gets the lock (getting the lock wasn't the problem: _holding_ the lock was the problem), and it still kept the cache line for the lock on one core. The only way to avoid lock contention is to avoid any exclusive use at all. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, 21 Jun 2007, Jarek Poplawski wrote: BTW, I've looked a bit at these NMI watchdog traces, and now I'm not even sure it's necessarily the spinlock's problem (but I don't exclude this possibility yet). It seems both processors use task_rq_lock(), so there could be also a problem with that loop. I agree that that is also this kind of loop getting a lock loop, but if you can see it actually happening there, we have some other (and major) bug. That loop is indeed a loop, but realistically speaking it should never be run through more than once. The only reason it's a loop is that there's a small small race where we get the information at the wrong time, get the wrong lock, and try again. So the loop certainly *can* trigger (or it would be pointless), but I'd normally expect it not to, and even if it does end up looping around it should happen maybe *one* more time, absolutely not get stuck in the loop for any appreciable number of iterations. So I don't see how you could possibly having two different CPU's getting into some lock-step in that loop: changing task_rq() is a really quite heavy operation (it's about migrating between CPU's), and generally happens at a fairly low frequency (ie a couple of times a second kind of thing, not tight CPU loop). But bugs happen.. Another possible problem could be a result of some wrong optimization or wrong propagation of change of this task_rq(p) value. I agree, but that kind of bug would likely not cause temporary hangs, but actual the machine is dead operations. If you get the totally *wrong* value due to some systematic bug, you'd be waiting forever for it to match, not get into a loop for a half second and then it clearing up.. But I don't think we can throw the it's another bug theory _entirely_ out the window. That said, both Ingo and me have done fairness testing of hardware, and yes, if you have a reasonably tight loop with spinlocks, existing hardware really *is* unfair. Not by a small amount either - I had this program that did statistics (and Ingo improved on it and had some other tests too), and basically if you have two CPU's that try to get the same spinlock, they really *can* get into situations where one of them gets it millions of times in a row, and the other _never_ gets it. But it only happens with badly coded software: the rule simply is that you MUST NOT release and immediately re-acquire the same spinlock on the same core, because as far as other cores are concerned, that's basically the same as never releasing it in the first place. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds [EMAIL PROTECTED] wrote: On Thu, 21 Jun 2007, Ingo Molnar wrote: what worries me a bit though is that my patch that made spinlocks equally agressive to that loop didnt solve the hangs! Your parch kept doing spin_trylock(), didn't it? yeah - it changed spin_lock()'s assembly to do a LOCK BTRL, which is a trylock which tries to dirty the cacheline. There was a REP NOP after it and a loop back to the LOCK BTRL. That's a read-modify-write thing, and keeps bouncing the cacheline back and forth, and together with the fact that even *after* you get the spinlock the wait_for_inactive() would actually end up looping back, releasing it, and re-getting it. So the problem was that wait_for_inactive() kept the lock (because it actually *got* it), and looped over getting it, and because it was an exclusive cacheline ownership, that implies that somebody else is not getting it, and is kept from ever getting it. ok, it's not completely clear where exactly the other core was spinning, but i took it from Miklos' observations that the other core was hanging in the _very same_ task_rq_lock() - which is a true spinlock as well that acquires it. So on one core the spin_lock() was starving, on another one it was always succeeding. So trying to use trylock doesn't help. It still has all the same bad sides - it still gets the lock (getting the lock wasn't the problem: _holding_ the lock was the problem), and it still kept the cache line for the lock on one core. so the problem was not the trylock based spin_lock() itself (no matter how it's structured in the assembly), the problem was actually modifying the lock and re-modifying it again and again in a very tight high-frequency loop, and hence not giving it to the other core? The only way to avoid lock contention is to avoid any exclusive use at all. yeah - i'm not at all arguing in favor of the BTRL patch i did: i always liked the 'nicer' inner loop of spinlocks, which could btw also easily use MONITOR/MWAIT. (my patch is also quite close to what we did in spinlocks many years ago, so it's more of a step backwards than real progress.) So it seems the problem was that if a core kept _truly_ modifying a cacheline via atomics in a high enough frequency, it could artificially starve the other core. (which would keep waiting for the cacheline to be released one day, and which kept the first core from ever making any progress) To me that looks like a real problem on the hardware side - shouldnt cacheline ownership be arbitrated a bit better than that? Up to the point where some external event (perhaps a periodic SMM related to thermal management) broke the deadlock/livelock scenario? Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, 21 Jun 2007, Ingo Molnar wrote: so the problem was not the trylock based spin_lock() itself (no matter how it's structured in the assembly), the problem was actually modifying the lock and re-modifying it again and again in a very tight high-frequency loop, and hence not giving it to the other core? That's what I think, yes. And it matches the behaviour we've seen before (remember the whole thread about whether Opteron or Core 2 hardware is more fair between cores?) It all simply boils down to the fact that releasing and almost immediately re-taking a lock is invisible to outside cores, because it will happen entirely within the L1 cache of one core if the cacheline is already in owned state. Another core that spins wildly trying to get it ends up using a much slower beat (the cache coherency clock), and with just a bit of bad luck the L1 cache would simply never get probed, and the line never stolen, at the point where the lock happens to be released. The fact that writes (as in the store that releases the lock) automatically get delayed by any x86 core by the store buffer, and the fact that atomic read-modify-write cycles do *not* get delayed, just means that if the spinlock release was fairly close to the reacquire in a software sense, the hardware will actually make them *even*closer*. So you could make the spinlock release be a read-modify-write cycle, and it would make the spinlock much slower, but it would also make it much more likely that the other core will *see* the release. For the same reasons, if you add a smp_mb() in between the release and the re-acquire, the other core is much more likely to see it: it means that the release won't be delayed, and thus it just won't be as close to the re-acquire. So you can *hide* this problem in many ways, but it's still just hiding it. The proper fix is to not do that kind of release and re-acquire behaviour in a loop. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On 06/21/2007 12:08 PM, Ingo Molnar wrote: yeah - i'm not at all arguing in favor of the BTRL patch i did: i always liked the 'nicer' inner loop of spinlocks, which could btw also easily use MONITOR/MWAIT. The nice inner loop is necessary or else it would generate huge amounts of bus traffic while spinning. So it seems the problem was that if a core kept _truly_ modifying a cacheline via atomics in a high enough frequency, it could artificially starve the other core. (which would keep waiting for the cacheline to be released one day, and which kept the first core from ever making any progress) To me that looks like a real problem on the hardware side - shouldnt cacheline ownership be arbitrated a bit better than that? A while ago I showed that spinlocks were a lot more fair when doing unlock with the xchg instruction on x86. Probably the arbitration is all screwed up because we use a mov instruction, which while atomic is not locked. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, 21 Jun 2007, Chuck Ebbert wrote: A while ago I showed that spinlocks were a lot more fair when doing unlock with the xchg instruction on x86. Probably the arbitration is all screwed up because we use a mov instruction, which while atomic is not locked. No, the cache line arbitration doesn't know anything about locked vs unlocked instructions (it could, but there really is no point). The real issue is that locked instructions on x86 are serializing, which makes them extremely slow (compared to just a write), and then by being slow they effectively just make the window for another core bigger. IOW, it doesn't fix anything, it just hides the bug with timing. You can hide the problem other ways by just increasing the delay between the unlock and the lock (and adding one or more serializing instruction in between is generally the best way of doing that, simply because otherwise micro- architecture may just be re-ordering things on you, so that your delay isn't actually in between any more!). But adding delays doesn't really fix anything, of course. It makes things fairer by making *both* sides suck more, but especially if both sides are actually the same exact thing, I could well imagine that they'd both just suck equally, and get into some pattern where they are now both slower, but still see exactly the same problem! Of course, as long as interrupts are on, or things like DMA happen etc, it's really *really* hard to be totally unlucky, and after a while you're likely to break out of the steplock on your own, just because the CPU's get interrupted by something else. It's in fact entirely possible that the long freezes have always been there, but the NOHZ option meant that we had much longer stretches of time without things like timer interrupts to jumble up the timing! So maybe the freezes existed before, but with timer interrupts happening hundreds of times a second, they weren't noticeable to humans. (Btw, that's just _one_ theory. Don't take it _too_ seriously, but it could be one of the reasons why this showed up as a new problem, even though I don't think the wait_for_inactive() thing has changed lately.) Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, 21 Jun 2007 10:31:53 -0700 (PDT) Linus Torvalds [EMAIL PROTECTED] wrote: On Thu, 21 Jun 2007, Chuck Ebbert wrote: A while ago I showed that spinlocks were a lot more fair when doing unlock with the xchg instruction on x86. Probably the arbitration is all screwed up because we use a mov instruction, which while atomic is not locked. No, the cache line arbitration doesn't know anything about locked vs unlocked instructions (it could, but there really is no point). The real issue is that locked instructions on x86 are serializing, which makes them extremely slow (compared to just a write), and then by being slow they effectively just make the window for another core bigger. This reminds me Nick's proposal of 'queued spinlocks' 3 months ago Maybe this should be re-considered ? (unlock is still a non atomic op, so we dont pay the serializing cost twice) http://lwn.net/Articles/227506/ extract : Implement queued spinlocks for i386. This shouldn't increase the size of the spinlock structure, while still able to handle 2^16 CPUs. The queued spinlock has 2 fields, a head and a tail, which are indexes into a FIFO of waiting CPUs. To take a spinlock, a CPU performs an atomic_inc_return on the head index, and keeps the returned value as a ticket. The CPU then spins until the tail index is equal to that ticket. To unlock a spinlock, the tail index is incremented (this can be non atomic, because only the lock owner will modify tail). Implementation inefficiencies aside, this change should have little effect on performance for uncontended locks, but will have quite a large cost for highly contended locks [O(N) cacheline transfers vs O(1) per lock aquisition, where N is the number of CPUs contending]. The benefit is is that contended locks will not cause any starvation. Just an idea. Big NUMA hardware seems to have fairness logic that prevents starvation for the regular spinlock logic. But it might be interesting for -rt kernel or systems with starvation issues. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, 21 Jun 2007, Eric Dumazet wrote: This reminds me Nick's proposal of 'queued spinlocks' 3 months ago Maybe this should be re-considered ? (unlock is still a non atomic op, so we dont pay the serializing cost twice) No. The point is simple: IF YOU NEED THIS, YOU ARE DOING SOMETHING WRONG! I don't understand why this is even controversial. Especially since we have a patch for the problem that proves my point: the _proper_ way to fix things is to just not do the bad thing, instead of trying to allow the bad behaviour and try to handle it. Things like queued spinlocks just make excuses for bad code. We don't do nesting locking either, for exactly the same reason. Are nesting locks easier? Absolutely. They are also almost always a sign of a *bug*. So making spinlocks and/or mutexes nest by default is just a way to encourage bad programming! extract : Implement queued spinlocks for i386. This shouldn't increase the size of the spinlock structure, while still able to handle 2^16 CPUs. Umm. i386 spinlocks could and should be *one*byte*. In fact, I don't even know why they are wasting four bytes right now: the fact that somebody made them an int just wastes memory. All the actual code uses decb, so it's not even a question of safety. I wonder why we have that 32-bit thing and the ugly casts. Ingo, any memory of that? (And no, on 32-bit x86, we don't allow more than 128 CPU's. I don't think such an insane machine has ever existed). Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, 21 Jun 2007, Linus Torvalds wrote: We don't do nesting locking either, for exactly the same reason. Are nesting locks easier? Absolutely. They are also almost always a sign of a *bug*. So making spinlocks and/or mutexes nest by default is just a way to encourage bad programming! Side note, and as a truth in advertising section: I'll have to admit that I argued against fair semaphores on the same grounds. I was wrong then (and eventually admitted it, and we obviously try to make our mutexes and semaphores fair these days!), and maybe I'm wrong now. If somebody can actually come up with a sequence where we have spinlock starvation, and it's not about an example of bad locking, and nobody really can come up with any other way to fix it, we may eventually have to add the notion of fair spinlocks. So my arguments are purely pragmatic. It's not that I hate fairness per se. I dislike it only when it's used to solve (aka hide) other problems. In the end, some situations do need fairness, and the fact that aiming for fairness is often harder, slower, and more complicated than not doing so at that point turns into a non-argument. If you need it, you need it. I just don't think we need it, and we're better off solving problems other ways. (For example, we might also solve such problems by creating a separate fair_spin_lock abstraction, and only making the particular users that need it actually use it. It would depend a bit on whether the cost of implementing the fairness is noticeable enough for it to be worth having a separate construct for it). Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds [EMAIL PROTECTED] wrote: Umm. i386 spinlocks could and should be *one*byte*. In fact, I don't even know why they are wasting four bytes right now: the fact that somebody made them an int just wastes memory. All the actual code uses decb, so it's not even a question of safety. I wonder why we have that 32-bit thing and the ugly casts. Ingo, any memory of that? no real reason that i can recall - i guess nobody dared to touch it because it used to have that 'volatile', indicating black voodoo ;-) Now that the bad stigma has been removed, we could try the patch below. It boots fine here, and we save 1K of kernel text size: textdata bss dec hex filename 6236003 611992 401408 7249403 6e9dfb vmlinux.before 6235075 611992 401408 7248475 6e9a5b vmlinux.after I can understand why no data is saved by this change: gcc is aligning the next field to a natural boundary anyway and we dont really have arrays of spinlocks (fortunately). [and we save no data even if using the ((packed)) attribute.] Perhaps some data structure that is never in the kernel image itself still got smaller? Any good way to determine that? But why is the text size different? Ah: i think it's spin_lock_init() getting shorter :-) but this is certainly not something for 2.6.22, it's an early 2.6.23 matter i suspect. Ingo --- From: Ingo Molnar [EMAIL PROTECTED] Subject: [patch] spinlocks i386: change them to byte fields all spinlock ops are on byte operands, so change the spinlock field to be unsigned char. This saves a bit of kernel text size: textdata bss dec hex filename 6236003 611992 401408 7249403 6e9dfb vmlinux.before 6235075 611992 401408 7248475 6e9a5b vmlinux.after Signed-off-by: Ingo Molnar [EMAIL PROTECTED] --- include/asm-i386/spinlock_types.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux/include/asm-i386/spinlock_types.h === --- linux.orig/include/asm-i386/spinlock_types.h +++ linux/include/asm-i386/spinlock_types.h @@ -6,7 +6,7 @@ #endif typedef struct { - unsigned int slock; + unsigned char slock; } raw_spinlock_t; #define __RAW_SPIN_LOCK_UNLOCKED { 1 } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds [EMAIL PROTECTED] wrote: If somebody can actually come up with a sequence where we have spinlock starvation, and it's not about an example of bad locking, and nobody really can come up with any other way to fix it, we may eventually have to add the notion of fair spinlocks. there was one bad case i can remember: the spinlock debugging code had a trylock open-coded loop and on certain Opterons CPUs were starving each other. This used to trigger with the -tree_lock rwlock i think, on heavy MM loads. The starvation got so bad that the NMI watchdog started triggering ... interestingly, this only triggered for certain rwlocks. Thus we, after a few failed attempts to pacify this open-coded loop, currently have that code disabled in lib/spinlock_debug.c: #if 0 /* This can cause lockups */ static void __write_lock_debug(rwlock_t *lock) { u64 i; u64 loops = loops_per_jiffy * HZ; int print_once = 1; for (;;) { for (i = 0; i loops; i++) { if (__raw_write_trylock(lock-raw_lock)) return; __delay(1); } the weird thing is that we still have the _very same_ construct in __spin_lock_debug(): for (i = 0; i loops; i++) { if (__raw_spin_trylock(lock-raw_lock)) return; __delay(1); } if there are any problems with this then people are not complaining loud enough :-) note that because this is a trylock based loop, the acquire+release sequence problem should not apply to this problem. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, 21 Jun 2007, Ingo Molnar wrote: I can understand why no data is saved by this change: gcc is aligning the next field to a natural boundary anyway and we dont really have arrays of spinlocks (fortunately). Actually, some data structures could well shrink. Look at struct task_struct, for example. Right now it has two spinlocks right next to each other (alloc_lock and pi_lock). Other data structures may have things like bitfields etc. But yeah, I'd not expect that to be very common, and in some cases you might have to re-order data structures to take advantage of better packing, and even then it's probably not all that noticeable. but this is certainly not something for 2.6.22, it's an early 2.6.23 matter i suspect. Oh, absolutely. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds [EMAIL PROTECTED] wrote: (And no, on 32-bit x86, we don't allow more than 128 CPU's. I don't think such an insane machine has ever existed). and if people _really_ want to boot a large-smp 32-bit kernel on some new, tons-of-cpus box, as a workaround they can enable the spinlock debugging code, which has no limitation on the number of CPUs supported. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, 21 Jun 2007, Ingo Molnar wrote: * Linus Torvalds [EMAIL PROTECTED] wrote: If somebody can actually come up with a sequence where we have spinlock starvation, and it's not about an example of bad locking, and nobody really can come up with any other way to fix it, we may eventually have to add the notion of fair spinlocks. there was one bad case i can remember: the spinlock debugging code had a trylock open-coded loop and on certain Opterons CPUs were starving each other. But this is a perfect example of exactly what I'm talking about: THAT CODE IS HORRIBLY BUGGY! It's not the spinlocks that are broken, it's that damn code. for (;;) { for (i = 0; i loops; i++) { if (__raw_write_trylock(lock-raw_lock)) return; __delay(1); } What a piece of crap. Anybody who ever waits for a lock by busy-looping over it is BUGGY, dammit! The only correct way to wait for a lock is: (a) try it *once* with an atomic r-m-w (b) loop over just _reading_ it (and something that implies a memory barrier, _not_ __delay(). Use cpu_relax() or smp_rmb()) (c) rinse and repeat. and code like the above should just be shot on sight. So don't blame the spinlocks or the hardware for crap code. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds [EMAIL PROTECTED] wrote: It's in fact entirely possible that the long freezes have always been there, but the NOHZ option meant that we had much longer stretches of time without things like timer interrupts to jumble up the timing! So maybe the freezes existed before, but with timer interrupts happening hundreds of times a second, they weren't noticeable to humans. the freezes that Miklos was seeing were hardirq contexts blocking in task_rq_lock() - that is done with interrupts disabled. (Miklos i think also tried !NOHZ kernels and older kernels, with a similar result.) plus on the ptrace side, the wait_task_inactive() code had most of its overhead in the atomic op, so if any timer IRQ hit _that_ core, it was likely while we were still holding the runqueue lock! i think the only thing that eventually got Miklos' laptop out of the wedge were timer irqs hitting the ptrace CPU in exactly those instructions where it was not holding the runqueue lock. (or perhaps an asynchronous SMM event delaying it for a long time) Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds [EMAIL PROTECTED] wrote: No, the cache line arbitration doesn't know anything about locked vs unlocked instructions (it could, but there really is no point). The real issue is that locked instructions on x86 are serializing, which makes them extremely slow (compared to just a write), and then by being slow they effectively just make the window for another core bigger. IOW, it doesn't fix anything, it just hides the bug with timing. yeah. I think Linux is i think the only OS on the planet that is using the movb trick for unlock, it even triggered a hardware erratum ;) So it might surprise some hw makers who might rely on the heuristics that each critical section lock and unlock is a LOCK-ed instruction. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds [EMAIL PROTECTED] wrote: On Thu, 21 Jun 2007, Ingo Molnar wrote: I can understand why no data is saved by this change: gcc is aligning the next field to a natural boundary anyway and we dont really have arrays of spinlocks (fortunately). Actually, some data structures could well shrink. Look at struct task_struct, for example. Right now it has two spinlocks right next to each other (alloc_lock and pi_lock). yeah. We've got init_task's task-struct embedded in the vmlinux, but it's aligned to 32 bytes, which probably hides this effect. We'd only see it if the size change just happened to bring a key data structure (which is also embedded in the vmlinux) just below a modulo 32 bytes boundary. The chance for that is around 6:32 per 'merge' event. That means that there cannot be all that many such cases ;-) anyway, the shorter init sequence is worth it already. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds [EMAIL PROTECTED] wrote: for (;;) { for (i = 0; i loops; i++) { if (__raw_write_trylock(lock-raw_lock)) return; __delay(1); } What a piece of crap. Anybody who ever waits for a lock by busy-looping over it is BUGGY, dammit! The only correct way to wait for a lock is: (a) try it *once* with an atomic r-m-w (b) loop over just _reading_ it (and something that implies a memory barrier, _not_ __delay(). Use cpu_relax() or smp_rmb()) (c) rinse and repeat. damn, i first wrote up an explanation about why that ugly __delay(1) is there (it almost hurts my eyes when i look at it!) but then deleted it as superfluous :-/ really, it's not because i'm stupid (although i might still be stupid for other resons ;-), it wasnt there in earlier spin-debug versions. We even had an inner spin_is_locked() loop at a stage (and should add it again). the reason for the __delay(1) was really mundane: to be able to figure out when to print a 'we locked up' message to the user. If it's 1 second, it causes false positive on some systems. If it's 10 minutes, people press reset before we print out any useful data. It used to be just a loop of rep_nop()s, but that was hard to calibrate: on certain newer hardware it was triggering as fast as in 2 seconds, causing many false positives. We cannot use jiffies nor any other clocksource in this debug code. so i settled for the butt-ugly but working __delay(1) thing, to be able to time the debug messages. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
Linus Torvalds a écrit : On Thu, 21 Jun 2007, Linus Torvalds wrote: We don't do nesting locking either, for exactly the same reason. Are nesting locks easier? Absolutely. They are also almost always a sign of a *bug*. So making spinlocks and/or mutexes nest by default is just a way to encourage bad programming! Side note, and as a truth in advertising section: I'll have to admit that I argued against fair semaphores on the same grounds. I was wrong then (and eventually admitted it, and we obviously try to make our mutexes and semaphores fair these days!), and maybe I'm wrong now. If somebody can actually come up with a sequence where we have spinlock starvation, and it's not about an example of bad locking, and nobody really can come up with any other way to fix it, we may eventually have to add the notion of fair spinlocks. I tried to find such a sequence, but I think its more a matter of hardware evolution, and some degenerated cases. In some years (months ?), it might possible to starve say the file struct spinlock of a process in a open()/close() infinite loop. This because the number of instruction per 'memory cache line transfert between cpus/core' is raising. But then one can say its a bug in user code :) Another way to starve kernel might be a loop doing settime() , since seqlock are quite special in serialization : Only seqlock's writers perform atomic ops, readers could be starved because of some hardware 'optimization'. So my arguments are purely pragmatic. It's not that I hate fairness per se. I dislike it only when it's used to solve (aka hide) other problems. In the end, some situations do need fairness, and the fact that aiming for fairness is often harder, slower, and more complicated than not doing so at that point turns into a non-argument. If you need it, you need it. Maybe some *big* NUMA machines really want this fairness (even if it cost some cycles as pointed by Davide in http://lkml.org/lkml/2007/3/29/246 ) , I am just guessing since I cannot test such monsters. I tested Davide program on a Dual Opteron and got some perf difference. $ ./qspins -n 2 now testing: TICKLOCK timeres=4000 uscycles=1991 AVG[0]: 2195.25 cycles/loop SIG[0]: 11.813657 AVG[1]: 2212.312500 cycles/loop SIG[1]: 38.038991 $ ./qspins -n 2 -s now testing: SPINLOCK timeres=4000 uscycles=1991 AVG[0]: 2066.00 cycles/loop SIG[0]: 0.00 AVG[1]: 2115.687500 cycles/loop SIG[1]: 63.083000 I just don't think we need it, and we're better off solving problems other ways. (For example, we might also solve such problems by creating a separate fair_spin_lock abstraction, and only making the particular users that need it actually use it. It would depend a bit on whether the cost of implementing the fairness is noticeable enough for it to be worth having a separate construct for it). Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, 21 Jun 2007, Ingo Molnar wrote: yeah. I think Linux is i think the only OS on the planet that is using the movb trick for unlock, it even triggered a hardware erratum ;) I'm pretty sure others do it too. Maybe not on an OS level (but I actually doubt that - I'd be surprised if Windows doesn't do the exact same thing), but I know for a fact that a lot of people in threaded libraries end up depending very much on the simple store closes a locked section. Just googling for xchg mov spinlock -linux shows discussion boards for Windows developers with open-coded spinlocks like int ResourceFlag = 0; // 0=Free, 1=Inuse ... // Wait until we get the resource while(InterlockedExchange(ResourceFlag, 1) != 0) { Sleep(0); } // Wait a tad // Have the resource ... // do your thing ResourceFlag = 0; // Release the resource and that's definitely Windows code, not some Linux person doing it. And this is from an OS2 forum unsigned owned=0; void request() { while(LockedExchanged(owned,1)!=0) ; } void release() { owned = 0; } so it's not even something unusual. So while arguably these people don't know (and don't care) about subtle issues like memory ordering, I can *guarantee* that a lot of programs depend on them, even if that dependency may often come from a lack of knowledge, rather than actively understanding what we do like in the Linux kernel community. (And yes, they rely on compilers not reordering either. Tough.) Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Thu, 21 Jun 2007, Ingo Molnar wrote: damn, i first wrote up an explanation about why that ugly __delay(1) is there (it almost hurts my eyes when i look at it!) but then deleted it as superfluous :-/ I'm fine with a delay, but the __delay(1) is simply not correct. It doesn't do anything. udelay() waits for a certain time. Use that. the reason for the __delay(1) was really mundane: to be able to figure out when to print a 'we locked up' message to the user. No it does not. You may think it does, but it does nothing of the sort. Use udelay() or somethign that actually takes a *time*. Just __delay() is nothing but a loop, and calling it with an argument of 1 is stupid and buggy. The only *possibly* valid use of __delay() implies using a counter that is based on the loops_per_sec thing, which depends on what the delay function actually is. For example, the delay function may well turn out to be this: __asm__ __volatile__( \tjmp 1f\n .align 16\n 1:\tjmp 2f\n .align 16\n 2:\tdecl %0\n\tjns 2b :=a (d0) :0 (loops)); Notice? Your code, it does nothing! When I said that the code was buggy, I meant it. It has nothing to do with spinlocks. And __delay(1) is *always* a bug. You migth want to replace it with smp_rmb(); udelay(1); instead, at which point it *does* something: it has that read barrier (which is not actually needed on x86, but whatever), and it has a delay that is *meaningful*. A plain __delay(1) is neither. So let me repeat my statement: What a piece of crap. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds [EMAIL PROTECTED] wrote: On Thu, 21 Jun 2007, Ingo Molnar wrote: damn, i first wrote up an explanation about why that ugly __delay(1) is there (it almost hurts my eyes when i look at it!) but then deleted it as superfluous :-/ I'm fine with a delay, but the __delay(1) is simply not correct. It doesn't do anything. it's a bit trickier than that. Yes, it's a simple 1-entry loop and thus makes little sense to call. But it's a loop that got boot-time calibrated, so we can do this in the spinlock-debug code: u64 loops = loops_per_jiffy * HZ; this guarantees that we will loop for _at least_ 1 second before printing a message. (in practice it's much longer, especially with the current naive trylock approach) Why? Because as part of the activities that the spin-loop does, we also do everything that an mdelay(1000) would do. We do it 'piecemail-wise', and we do it very inefficiently, but the lower time-bound should be guaranteed. This is done because most of the problems were caused by too short looping and bogus debug printouts. So this is basically an open-coded udelay implementation. Furthermore, with the spin_is_locked() fix i just sent, the __loop(1) solution should actually be quite close to a real udelay() thing, without the delay effect. It would probably more accurate to increase it to loops_per_jiffy*10*HZ/16 and to call a __loop(16) thing, to get closer timings. (but then some people would argue 'why dont you take the lock as soon as it's released'.) Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Wed, 20 Jun 2007, Jarek Poplawski wrote: > > I don't agree with this (+ I know it doesn't matter). > > The real bug is what Chuck Ebbert wrote: "Spinlocks aren't fair". > And here they are simply lawlessly not fair. Well, that's certainly a valid standpoint. I wouldn't claim you're _wrong_. At least in theory. In *practice*, fair spinlocks are simply not possible. Not as in "it's hard to do", but as in "you simply cannot do it". The thing is, most spinlocks get held for a rather short time (if that wasn't the case, you'd use something like a mutex), and it's acually a short time not on a "human scale", but on a "CPU core scale". IOW, the cost of doing the cacheline bouncing is often equal to, or bigger than, the actual cost of the operation that happens inside the spinlock! What does this result in? It automatically means that software simply *cannot* do fair spinlocks, because all the timing costs are in parts that software doesn't even have any visibility into, or control over! Yeah, you could add artificial delays, by doing things like cycle counting around the spinlock operation to see if you got the lock when it was _contended_, or whether you got a lock that wasn't, and then adding some statistics etc, but at that point, you don't have a spinlock any more, you have something else. I don't know what to call it. And you could add flags like "this spinlock is getting a lot of contention", try some other algorithm etc. But it's all complicated and against the whole *point* of a spinlock, which is to get in and out as fast as humanly possible. So in practice, spinlock fairness is inherently tied to the hardware behaviour of the cache coherency algorithm. Which gets us to the next level: we can consider hardware that isn't "fair" in its cache coherency to be buggy hardware. That's also a perfectly fine standpoint, and it's actually one that I have a lot of sympathy for. I think it's much easier to some degree to do fairness in the cache coherency than at any other level, because it's something where the hardware really *could* do things like counting bouncing etc. However, while it's a perfectly fine standpoint, it's also totally unrealistic. First off, the hardware doesn't even know whether the spinlock "failed" or not on any architecture platform I'm aware of. On x86, the spinlock operation under Linux is actually just an atomic decrement, and it so happens that the rules for failure was that it didn't go negative. But that's just a Linux internal rule - the hardware doesn't know. So you'd have to actualyl have some specific sequence with specific fairness rules (maybe you could make the rule be that a failed atomic "cmpxchg" counts as a real failure, and if you give higher priority to cores with lots of failures etc). IOW, it's certainly possible *in*theory* to try to make hardware that has fairness guarantees. However, any hardware designer will tell you that (a) they have enough problems as it is (b) it's simply not worth their time, since there are other (simpler) things that are much much more important. So the end result: - hardware won't do it, and they'd be crazy to really even try (apart from maybe some really simplistic stuff that doesn't _guarantee_ anything, but maybe helps fairness a bit) - software cannot do it, without turning spinlocks into something really slow and complicated, at which point they've lost all meaning, and you should just use a higher-level construct like a mutex instead. In other words, spinlocks are optimized for *lack* of contention. If a spinlock has contention, you don't try to make the spinlock "fair". No, you try to fix the contention instead! That way, you fix many things. You probably speed things up, _and_ you make it fairer. It might sometimes take some brains and effort, but it's worth it. The patch I sent out was an example of that. You *can* fix contention problems. Does it take clever approaches? Yes. It's why we have hashed spinlocks, RCU, and code sequences that are entirely lockless and use optimistic approaches. And suddenly you get fairness *and* performance! It's a win-win situation. It does require a bit of effort, but hey, we're good at effort. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On 18-06-2007 18:34, Linus Torvalds wrote: > > On Mon, 18 Jun 2007, Ingo Molnar wrote: >> To test this theory, could you try the patch below, does this fix your >> hangs too? > > I really think this the the wrong approach, although *testing* it makes > sense. > > I think we need to handle loops that take, release, and then immediately > re-take differently. > > Such loops are _usually_ of the form where they really just release the > lock for some latency reason, but in this case I think it's actually just > a bug. > > That code does: > > if (unlikely(p->array || task_running(rq, p))) { > > to decide if it needs to just unlock and repeat, but then to decide if it > need to *yield* it only uses *one* of those tests (namely > > preempted = !task_running(rq, p); > .. > if (preempted) > yield(); > > and I think that's just broken. It basically says: > > - if the task is running, I will busy-loop on getting/releasing the >task_rq_lock > > and that is the _real_ bug here. I don't agree with this (+ I know it doesn't matter). The real bug is what Chuck Ebbert wrote: "Spinlocks aren't fair". And here they are simply lawlessly not fair. I cannot see any reason why any of tasks doing simultaneously "busy-loop on getting/releasing" a spinlock should starve almost to death another one doing the same (or simply waiting to get this lock) even without cpu_relax. Of course, lawfulness of such behavior is questionable, and should be fixed like here. > Trying to make the spinlocks do somethign else than what they do is just > papering over the real bug. The real bug is that anybody who just > busy-loops getting a lock is wasting resources so much that we should not > be at all surprised that some multi-core or NUMA situations will get > starvation. On the other hand it seems spinlocks should be at least a little more immune to such bugs: slowdown is OK but not freezing. Current behavior could suggest this unfairness could harm some tasks even without any loops present - but it's not visible enough. So, I'm surprised this thread seems to stop after this patch, and there is no try to make the most of this ideal testing case to improve spinlock design btw. Regards, Jarek P. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On 18-06-2007 18:34, Linus Torvalds wrote: On Mon, 18 Jun 2007, Ingo Molnar wrote: To test this theory, could you try the patch below, does this fix your hangs too? I really think this the the wrong approach, although *testing* it makes sense. I think we need to handle loops that take, release, and then immediately re-take differently. Such loops are _usually_ of the form where they really just release the lock for some latency reason, but in this case I think it's actually just a bug. That code does: if (unlikely(p-array || task_running(rq, p))) { to decide if it needs to just unlock and repeat, but then to decide if it need to *yield* it only uses *one* of those tests (namely preempted = !task_running(rq, p); .. if (preempted) yield(); and I think that's just broken. It basically says: - if the task is running, I will busy-loop on getting/releasing the task_rq_lock and that is the _real_ bug here. I don't agree with this (+ I know it doesn't matter). The real bug is what Chuck Ebbert wrote: Spinlocks aren't fair. And here they are simply lawlessly not fair. I cannot see any reason why any of tasks doing simultaneously busy-loop on getting/releasing a spinlock should starve almost to death another one doing the same (or simply waiting to get this lock) even without cpu_relax. Of course, lawfulness of such behavior is questionable, and should be fixed like here. Trying to make the spinlocks do somethign else than what they do is just papering over the real bug. The real bug is that anybody who just busy-loops getting a lock is wasting resources so much that we should not be at all surprised that some multi-core or NUMA situations will get starvation. On the other hand it seems spinlocks should be at least a little more immune to such bugs: slowdown is OK but not freezing. Current behavior could suggest this unfairness could harm some tasks even without any loops present - but it's not visible enough. So, I'm surprised this thread seems to stop after this patch, and there is no try to make the most of this ideal testing case to improve spinlock design btw. Regards, Jarek P. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Wed, 20 Jun 2007, Jarek Poplawski wrote: I don't agree with this (+ I know it doesn't matter). The real bug is what Chuck Ebbert wrote: Spinlocks aren't fair. And here they are simply lawlessly not fair. Well, that's certainly a valid standpoint. I wouldn't claim you're _wrong_. At least in theory. In *practice*, fair spinlocks are simply not possible. Not as in it's hard to do, but as in you simply cannot do it. The thing is, most spinlocks get held for a rather short time (if that wasn't the case, you'd use something like a mutex), and it's acually a short time not on a human scale, but on a CPU core scale. IOW, the cost of doing the cacheline bouncing is often equal to, or bigger than, the actual cost of the operation that happens inside the spinlock! What does this result in? It automatically means that software simply *cannot* do fair spinlocks, because all the timing costs are in parts that software doesn't even have any visibility into, or control over! Yeah, you could add artificial delays, by doing things like cycle counting around the spinlock operation to see if you got the lock when it was _contended_, or whether you got a lock that wasn't, and then adding some statistics etc, but at that point, you don't have a spinlock any more, you have something else. I don't know what to call it. And you could add flags like this spinlock is getting a lot of contention, try some other algorithm etc. But it's all complicated and against the whole *point* of a spinlock, which is to get in and out as fast as humanly possible. So in practice, spinlock fairness is inherently tied to the hardware behaviour of the cache coherency algorithm. Which gets us to the next level: we can consider hardware that isn't fair in its cache coherency to be buggy hardware. That's also a perfectly fine standpoint, and it's actually one that I have a lot of sympathy for. I think it's much easier to some degree to do fairness in the cache coherency than at any other level, because it's something where the hardware really *could* do things like counting bouncing etc. However, while it's a perfectly fine standpoint, it's also totally unrealistic. First off, the hardware doesn't even know whether the spinlock failed or not on any architecture platform I'm aware of. On x86, the spinlock operation under Linux is actually just an atomic decrement, and it so happens that the rules for failure was that it didn't go negative. But that's just a Linux internal rule - the hardware doesn't know. So you'd have to actualyl have some specific sequence with specific fairness rules (maybe you could make the rule be that a failed atomic cmpxchg counts as a real failure, and if you give higher priority to cores with lots of failures etc). IOW, it's certainly possible *in*theory* to try to make hardware that has fairness guarantees. However, any hardware designer will tell you that (a) they have enough problems as it is (b) it's simply not worth their time, since there are other (simpler) things that are much much more important. So the end result: - hardware won't do it, and they'd be crazy to really even try (apart from maybe some really simplistic stuff that doesn't _guarantee_ anything, but maybe helps fairness a bit) - software cannot do it, without turning spinlocks into something really slow and complicated, at which point they've lost all meaning, and you should just use a higher-level construct like a mutex instead. In other words, spinlocks are optimized for *lack* of contention. If a spinlock has contention, you don't try to make the spinlock fair. No, you try to fix the contention instead! That way, you fix many things. You probably speed things up, _and_ you make it fairer. It might sometimes take some brains and effort, but it's worth it. The patch I sent out was an example of that. You *can* fix contention problems. Does it take clever approaches? Yes. It's why we have hashed spinlocks, RCU, and code sequences that are entirely lockless and use optimistic approaches. And suddenly you get fairness *and* performance! It's a win-win situation. It does require a bit of effort, but hey, we're good at effort. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Mon, Jun 18, 2007 at 01:20:55AM -0700, Andrew Morton wrote: > On Mon, 18 Jun 2007 10:12:04 +0200 Ingo Molnar <[EMAIL PROTECTED]> wrote: > > > > > > Subject: [patch] x86: fix spin-loop starvation bug > > From: Ingo Molnar <[EMAIL PROTECTED]> > > > > Miklos Szeredi reported very long pauses (several seconds, sometimes > > more) on his T60 (with a Core2Duo) which he managed to track down to > > wait_task_inactive()'s open-coded busy-loop. He observed that an > > interrupt on one core tries to acquire the runqueue-lock but does not > > succeed in doing so for a very long time - while wait_task_inactive() on > > the other core loops waiting for the first core to deschedule a task > > (which it wont do while spinning in an interrupt handler). > > > > The problem is: both the spin_lock() code and the wait_task_inactive() > > loop uses cpu_relax()/rep_nop(), so in theory the CPU should have > > guaranteed MESI-fairness to the two cores - but that didnt happen: one > > of the cores was able to monopolize the cacheline that holds the > > runqueue lock, for extended periods of time. > > > > This patch changes the spin-loop to assert an atomic op after every REP > > NOP instance - this will cause the CPU to express its "MESI interest" in > > that cacheline after every REP NOP. > > Kiran, if you're still able to reproduce that zone->lru_lock starvation > problem, > this would be a good one to try... We tried this approach a week back (speak of co-incidences), and it did not help the problem. I'd changed calls to the zone->lru_lock spin_lock to do spin_trylock in a while loop with cpu_relax instead. It did not help, This was on top of 2.6.17 kernels. But the good news is 2.6.21, as is does not have the starvation issue -- that is, zone->lru_lock does not seem to get contended that much under the same workload. However, this was not on the same hardware I reported zone->lru_lock contention on (8 socket dual core opteron). I don't have access to it anymore :( Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Mon, 18 Jun 2007, Ingo Molnar wrote: > > ok. Do we have an guarantee that cpu_relax() is also an smp_rmb()? The common use for cpu_relax() is basically for code that does while (*ptr != val) cpu_relax(); so yes, an architecture that doesn't notice writes by other CPU's on its own had *better* have an implied read barrier in its "cpu_relax()" implementation. For example, the irq handling code does while (desc->status & IRQ_INPROGRESS) cpu_relax(); which is explicitly about waiting for another CPU to get out of their interrupt handler. And one classic use for it in drivers is obviously the while (time_before (jiffies, next)) cpu_relax(); kind of setup (and "jiffies" may well be updated on another CPU: the fact that it is "volatile" is just a *compiler* barrier just like cpu_relax() itself will also be, not a "smp_rmb()" kind of hardware barrier). So we could certainly add the smp_rmb() to make it more explicit, and it wouldn't be *wrong*. But quite frankly, I'd personally rather not - if it were to make a difference in some situation, it would just be papering over a bug in cpu_relax() itself. The whole point of cpu_relax() is about busy-looping, after all. And the only thing you really *can* busy-loop on in a CPU is basically a memory value. So the smp_rmb() would I think distract from the issue, and at best paper over some totally separate bug. > hm, this might still go into a non-nice busy loop on SMP: one cpu runs > the strace, another one runs two tasks, one of which is runnable but not > on the runqueue (the one we are waiting for). In that case we'd call > yield() on this CPU in a loop Sure. I agree - we can get into a loop that calls yield(). But I think a loop that calls yield() had better be ok - we're explicitly giving the scheduler the ability to to schedule anything else that is relevant. So I think yield()'ing is fundamentally different from busy-looping any other way. Would it be better to be able to have a wait-queue, and actually *sleep* on it, and not even busy-loop using yield? Yeah, possibly. I cannot personally bring myself to care about that kind of corner-case situation, though. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > > Boots and runs fine. Fixes the freezes as well, which is not such a > > big surprise, since basically any change in that function seems to > > do that ;) > > Yeah, and that code really was *designed* to make all "locking in a > loop" go away. So unlike the patches adding random barriers and > cpu_relax() calls (which might fix it for some *particular* hw setup), > I pretty much guarantee that you can never get that code into a > situation where there is some lock being starved on *any* hw setup. > > Ingo, an ack for the patch, and I'll just apply it? yeah. It's definitely better than what we had before. Maybe one day someone frees us from this function =B-) We already had like 2-3 nasty SMP bugs related to it in the last 10 years. Acked-by: Ingo Molnar <[EMAIL PROTECTED]> and kudos to Miklos for the patience to debug this bug! Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > That code does: > > if (unlikely(p->array || task_running(rq, p))) { > > to decide if it needs to just unlock and repeat, but then to decide if > it need to *yield* it only uses *one* of those tests (namely > > preempted = !task_running(rq, p); > .. > if (preempted) > yield(); > > and I think that's just broken. It basically says: > > - if the task is running, I will busy-loop on getting/releasing the >task_rq_lock > > and that is the _real_ bug here. > > Trying to make the spinlocks do somethign else than what they do is > just papering over the real bug. The real bug is that anybody who just > busy-loops getting a lock is wasting resources so much that we should > not be at all surprised that some multi-core or NUMA situations will > get starvation. > > Blaming some random Core 2 hardware implementation issue that just > makes it show up is wrong. It's a software bug, plain and simple. yeah, agreed. wait_task_inactive() is butt-ugly, and Roland i think found a way to get rid of it in utrace (but it's not implemented yet, boggle) - but nevertheless this needs fixing for .22. > So how about this diff? The diff looks big, but the *code* is actually > simpler and shorter, I just added tons of comments, which is what > blows it up. > > The new *code* looks like this: > > repeat: > /* Unlocked, optimistic looping! */ > rq = task_rq(p); > while (task_running(rq, p)) > cpu_relax(); ok. Do we have an guarantee that cpu_relax() is also an smp_rmb()? > > /* Get the *real* values */ > rq = task_rq_lock(p, ); > running = task_running(rq, p); > array = p->array; > task_rq_unlock(rq, ); > > /* Check them.. */ > if (unlikely(running)) { > cpu_relax(); > goto repeat; > } > > if (unlikely(array)) { > yield(); > goto repeat; > } hm, this might still go into a non-nice busy loop on SMP: one cpu runs the strace, another one runs two tasks, one of which is runnable but not on the runqueue (the one we are waiting for). In that case we'd call yield() on this CPU in a loop (and likely wont pull that task over from that CPU). And yield() itself is a high-frequency rq-lock touching thing too, just a bit heavier than the other path in the wait function. > Hmm? Untested, I know. Maybe I overlooked something. But even the > generated assembly code looks fine (much better than it looked > before!) it looks certainly better and cleaner than what we had before! Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
On Mon, 18 Jun 2007, Miklos Szeredi wrote: > > > Hmm? Untested, I know. Maybe I overlooked something. But even the > > generated assembly code looks fine (much better than it looked before!) > > Boots and runs fine. Fixes the freezes as well, which is not such a > big surprise, since basically any change in that function seems to do > that ;) Yeah, and that code really was *designed* to make all "locking in a loop" go away. So unlike the patches adding random barriers and cpu_relax() calls (which might fix it for some *particular* hw setup), I pretty much guarantee that you can never get that code into a situation where there is some lock being starved on *any* hw setup. Ingo, an ack for the patch, and I'll just apply it? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] long freezes on thinkpad t60
> Hmm? Untested, I know. Maybe I overlooked something. But even the > generated assembly code looks fine (much better than it looked before!) Boots and runs fine. Fixes the freezes as well, which is not such a big surprise, since basically any change in that function seems to do that ;) Miklos - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/