Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Fri, 2005-03-18 at 01:58 -0500, Steven Rostedt wrote: > > On Thu, 17 Mar 2005, Lee Revell wrote: > > > > OK, no need to cc: me on this one any more. It's really low priority > > IMO compared to the big latencies I am seeing with ext3 and > > "data=ordered". Unless you think there is any relation. > > > > IMO a deadlock is higher priority than a big latency :-) > Of course, if I was hitting the deadlock in normal use. > I still belive that something to do with the locking in ext3 has to do > with your latencies, but I'll take you off when I send something to Andrew > or Ingo next time. Hopefully, they'll do the same. If you suspect they are related then yes I would like to be copied. > > When this problem is solved on Ingo's side, maybe this will solve your > latency problem, so I recommend that you keep trying the latest RT > kernels. BTW what test are you running that causes these latencies? dbench 16 Lee - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove lame schedule in journal inverted_lock (was: Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks)
On Fri, 18 Mar 2005, Andrew Morton wrote: > Steven Rostedt <[EMAIL PROTECTED]> wrote: > > > > I wanted to keep the wait logic out when it wasn't a problem. Basically, > > the problem only occurs when bit_spin_trylock is defined as an actual > > trylock. So I put in a define there to enable the wait queues. I didn't > > want to waste cycles checking the wait queue in jbd_unlock_bh_state when > > there would never be anything on it. Heck, I figured why even have the > > wait queue wasting memory if it wasn't needed. So that added the > > ifdeffery complexity. > > No, that code's just a problem. For ranking reasons it's essentially doing > this: > > repeat: > cond_resched(); > spin_lock(j_list_lock); > > if (!bit_spin_trylock(bh)) { > spin_unlock(j_list_lock); > schedule(); > goto repeat; > } > Yep, that I understand. > Now imagine that some other CPU holds the bit_spin_lock and is spinning, > trying to get the spin_lock(). The above code assumes that the schedule() > and cond_resched() will take "long enough" for the other CPU to get the > spinlock, do its business then release the locks. > > So all the schedule() is really doing is "blow a few cycles so the other > CPU can get in and grab the spinlock". That'll work OK on normal SMP but I > suspect that on NUMA setups with really big latencies we could end up > starving the other CPU: this CPU would keep on grabbing the lock. It > depends on how the interconnect cache and all that goop works. > > So what to do? > > One approach would be to spin on the bit_spin_trylock after having dropped > j_list_lock. That'll tell us when the other CPU has moved on. > This is probably the best for mainline, since, as you mentioned, the abover code is just bad. > Another approach would be to sleep on a waitqueue somewhere. But that > means that jbd_unlock_bh_state() needs to do wakeups all the time - costly. > That's the approach that my patch made. > Another approach would be to simply whack an msleep(1) in there. That > might be OK - it should be very rare. > This approach is not much better than the current implementation. > Probably the first approach would be the one to use. That's for mainline. > I don't know what the super-duper-RT fix would be. Why did we start > discussing this anyway? > > Oh, SCHED_FIFO. kjournald doesn't run SCHED_FIFO, but someone may decide > to make it do so. But even then I don't see a problem for the mainline > kernel, because this CPU's SCHED_FIFO doesn't stop the other CPU from > running. > So this comes down to just a problem with Ingo's PREEPMT_RT. This means that the latency of kjournald, even without SCHED_FIFO will be large. If it preempts a process that has one of these bit spinlocks, (Ingo's RT kernel takes out the preempt_disable in them), then the kjournal thread will spin till its quota is free, causing problems for other processes. Even a process with a higher priority than kjournal if it blocks on one of the other locks that kjournal can have while attempting to get the bit locks. I know Ingo wants to get his patch eventually into the mainline without too much drag. But this problem needs to be solved in the mainline to accomplish this. What do you recommend? -- Steve - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove lame schedule in journal inverted_lock (was: Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks)
Steven Rostedt <[EMAIL PROTECTED]> wrote: > > > > > I really should knock up a script to send out an email when I add a patch > > to -mm. > > > > I thought you might have had something like that already, which was > another reason I thought you might have skipped this. > I do now.. > > > > I'd figure that you may have > > > missed this, since the subject didn't change. So I changed the subject > to > > > get your attention, and I've resent this. Here's the patch to get rid of > > > the the lame schedule that was in fs/jbd/commit.c. Let me know if this > > > patch is appropriate. > > > > I'm rather aghast at all the ifdeffery and complexity in this one. But I > > haven't looked at it closely yet. > > > > I wanted to keep the wait logic out when it wasn't a problem. Basically, > the problem only occurs when bit_spin_trylock is defined as an actual > trylock. So I put in a define there to enable the wait queues. I didn't > want to waste cycles checking the wait queue in jbd_unlock_bh_state when > there would never be anything on it. Heck, I figured why even have the > wait queue wasting memory if it wasn't needed. So that added the > ifdeffery complexity. No, that code's just a problem. For ranking reasons it's essentially doing this: repeat: cond_resched(); spin_lock(j_list_lock); if (!bit_spin_trylock(bh)) { spin_unlock(j_list_lock); schedule(); goto repeat; } Now imagine that some other CPU holds the bit_spin_lock and is spinning, trying to get the spin_lock(). The above code assumes that the schedule() and cond_resched() will take "long enough" for the other CPU to get the spinlock, do its business then release the locks. So all the schedule() is really doing is "blow a few cycles so the other CPU can get in and grab the spinlock". That'll work OK on normal SMP but I suspect that on NUMA setups with really big latencies we could end up starving the other CPU: this CPU would keep on grabbing the lock. It depends on how the interconnect cache and all that goop works. So what to do? One approach would be to spin on the bit_spin_trylock after having dropped j_list_lock. That'll tell us when the other CPU has moved on. Another approach would be to sleep on a waitqueue somewhere. But that means that jbd_unlock_bh_state() needs to do wakeups all the time - costly. Another approach would be to simply whack an msleep(1) in there. That might be OK - it should be very rare. Probably the first approach would be the one to use. That's for mainline. I don't know what the super-duper-RT fix would be. Why did we start discussing this anyway? Oh, SCHED_FIFO. kjournald doesn't run SCHED_FIFO, but someone may decide to make it do so. But even then I don't see a problem for the mainline kernel, because this CPU's SCHED_FIFO doesn't stop the other CPU from running. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove lame schedule in journal inverted_lock (was: Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks)
On Fri, 18 Mar 2005, Andrew Morton wrote: > Steven Rostedt <[EMAIL PROTECTED]> wrote: > > > > > > Andrew, > > > > Since I haven't gotten a response from you, > > It sometimes takes me half a day to get onto looking at patches. And if I > take them I usually don't reply (sorry). But I don't drop stuff, so if you > don't hear, please assume the patch stuck. If others raise objections > to the patch I'll usually duck it as well, but it's pretty obvious when that > happens. Sorry, I didn't mean to be pushy. I understand that you have a lot on your plate, and I'm sure you don't drop stuff. I just wasn't sure that you noticed that that was a patch and not just a reply on this thread, since I didn't flag it as such in the subject. I just didn't want it to slip under the radar. > > I really should knock up a script to send out an email when I add a patch > to -mm. > I thought you might have had something like that already, which was another reason I thought you might have skipped this. > > I'd figure that you may have > > missed this, since the subject didn't change. So I changed the subject to > > get your attention, and I've resent this. Here's the patch to get rid of > > the the lame schedule that was in fs/jbd/commit.c. Let me know if this > > patch is appropriate. > > I'm rather aghast at all the ifdeffery and complexity in this one. But I > haven't looked at it closely yet. > I wanted to keep the wait logic out when it wasn't a problem. Basically, the problem only occurs when bit_spin_trylock is defined as an actual trylock. So I put in a define there to enable the wait queues. I didn't want to waste cycles checking the wait queue in jbd_unlock_bh_state when there would never be anything on it. Heck, I figured why even have the wait queue wasting memory if it wasn't needed. So that added the ifdeffery complexity. Thanks, -- Steve - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove lame schedule in journal inverted_lock (was: Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks)
Steven Rostedt <[EMAIL PROTECTED]> wrote: > > > Andrew, > > Since I haven't gotten a response from you, It sometimes takes me half a day to get onto looking at patches. And if I take them I usually don't reply (sorry). But I don't drop stuff, so if you don't hear, please assume the patch stuck. If others raise objections to the patch I'll usually duck it as well, but it's pretty obvious when that happens. I really should knock up a script to send out an email when I add a patch to -mm. > I'd figure that you may have > missed this, since the subject didn't change. So I changed the subject to > get your attention, and I've resent this. Here's the patch to get rid of > the the lame schedule that was in fs/jbd/commit.c. Let me know if this > patch is appropriate. I'm rather aghast at all the ifdeffery and complexity in this one. But I haven't looked at it closely yet. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] remove lame schedule in journal inverted_lock (was: Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks)
Andrew, Since I haven't gotten a response from you, I'd figure that you may have missed this, since the subject didn't change. So I changed the subject to get your attention, and I've resent this. Here's the patch to get rid of the the lame schedule that was in fs/jbd/commit.c. Let me know if this patch is appropriate. Thanks, -- Steve On Thu, 17 Mar 2005, Steven Rostedt wrote: > > > On Wed, 16 Mar 2005, Andrew Morton wrote: > > > > I guess one way to solve this is to add a wait queue here (before > > > schedule()), and have the one holding the lock to wake up all on the > > > waitqueue when they release it. > > > > yup. A patch against mainline would be appropriate, please. > > > > Hi Andrew, > > Here's the patch against 2.6.11. I tested it, by adding (after making the > patch) global spinlocks for jbd_lock_bh_state and jbd_lock_bh_journalhead. > That way I have same scenerio as with Ingo's kernel, and I turned on > NEED_JOURNAL_STATE_WAIT. I'm still running that kernel so it looks like > it works. Making those two locks global causes this deadlock on kjournal > much quicker, and I don't need to run on an SMP machine (since my SMP > machines are currently being used for other tasks). > > Some comments on my patch. I only implement the wait queue when > bit_spin_trylock is an actual lock (thus creating the problem). I didn't > want to add this code if it was needed (ie. !(CONFIG_SMP && > CONFIG_DEBUG_SPINLOCKS)). So in bit_spin_trylock, I define > NEED_JOURNAL_STATE_WAIT if bit_spin_trylock is really a lock. When > NEED_JOURNAL_STATE_WAIT is set, then the wait queue is set up in the > journal code. > > Now the question is, should we make those two locks global? It would help > Ingo's cause (and mine as well). But I don't know the impact on a large > SMP configuration. Andrew, since you have a 16xSMP machine, could you (if > you have time) try out the effect of that. If you do have time, then I'll > send you a patch that goes on top of this one to change the two locks into > global spin locks. > > Ingo, where do you want to go from here? I guess we need to wait on what > Andrew decides. > > -- Steve > > diff -ur linux-2.6.11.orig/fs/jbd/commit.c linux-2.6.11/fs/jbd/commit.c --- linux-2.6.11.orig/fs/jbd/commit.c 2005-03-02 02:38:25.0 -0500 +++ linux-2.6.11/fs/jbd/commit.c2005-03-17 03:40:06.0 -0500 @@ -80,15 +80,33 @@ /* * Try to acquire jbd_lock_bh_state() against the buffer, when j_list_lock is - * held. For ranking reasons we must trylock. If we lose, schedule away and - * return 0. j_list_lock is dropped in this case. + * held. For ranking reasons we must trylock. If we lose put ourselves on a + * state wait queue and we'll be woken up when it is unlocked. Then we return + * 0 to try this again. j_list_lock is dropped in this case. */ static int inverted_lock(journal_t *journal, struct buffer_head *bh) { if (!jbd_trylock_bh_state(bh)) { + /* +* jbd_trylock_bh_state always returns true unless CONFIG_SMP or +* CONFIG_DEBUG_SPINLOCK, so the wait queue is not needed there. +* The bit_spin_locks in jbd_lock_bh_state need to be removed anyway. +*/ +#ifdef NEED_JOURNAL_STATE_WAIT + DECLARE_WAITQUEUE(wait, current); spin_unlock(>j_list_lock); - schedule(); + add_wait_queue_exclusive(_state_wait,); + set_current_state(TASK_UNINTERRUPTIBLE); + /* Check to see if the lock has been unlocked in this short time */ + if (jbd_is_locked_bh_state(bh)) + schedule(); + set_current_state(TASK_RUNNING); + remove_wait_queue(_state_wait,); return 0; +#else + /* This should never be hit */ + BUG(); +#endif } return 1; } diff -ur linux-2.6.11.orig/fs/jbd/journal.c linux-2.6.11/fs/jbd/journal.c --- linux-2.6.11.orig/fs/jbd/journal.c 2005-03-02 02:37:49.0 -0500 +++ linux-2.6.11/fs/jbd/journal.c 2005-03-17 03:47:40.0 -0500 @@ -80,6 +80,11 @@ EXPORT_SYMBOL(journal_try_to_free_buffers); EXPORT_SYMBOL(journal_force_commit); +#ifdef NEED_JOURNAL_STATE_WAIT +EXPORT_SYMBOL(journal_state_wait); +DECLARE_WAIT_QUEUE_HEAD(journal_state_wait); +#endif + static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); /* diff -ur linux-2.6.11.orig/include/linux/jbd.h linux-2.6.11/include/linux/jbd.h --- linux-2.6.11.orig/include/linux/jbd.h 2005-03-02 02:38:19.0 -0500 +++ linux-2.6.11/include/linux/jbd.h2005-03-17 03:48:18.0 -0500 @@ -324,6 +324,20 @@ return bh->b_private; } +#ifdef NEED_JOURNAL_STATE_WAIT +/* + * The journal_state_wait is a wait queue that tasks will wait on + * if they fail to get the jbd_lock_bh_state while holding the j_list_lock. + * Instead of spinning on schedule, the task now adds
[PATCH] remove lame schedule in journal inverted_lock (was: Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks)
Andrew, Since I haven't gotten a response from you, I'd figure that you may have missed this, since the subject didn't change. So I changed the subject to get your attention, and I've resent this. Here's the patch to get rid of the the lame schedule that was in fs/jbd/commit.c. Let me know if this patch is appropriate. Thanks, -- Steve On Thu, 17 Mar 2005, Steven Rostedt wrote: On Wed, 16 Mar 2005, Andrew Morton wrote: I guess one way to solve this is to add a wait queue here (before schedule()), and have the one holding the lock to wake up all on the waitqueue when they release it. yup. A patch against mainline would be appropriate, please. Hi Andrew, Here's the patch against 2.6.11. I tested it, by adding (after making the patch) global spinlocks for jbd_lock_bh_state and jbd_lock_bh_journalhead. That way I have same scenerio as with Ingo's kernel, and I turned on NEED_JOURNAL_STATE_WAIT. I'm still running that kernel so it looks like it works. Making those two locks global causes this deadlock on kjournal much quicker, and I don't need to run on an SMP machine (since my SMP machines are currently being used for other tasks). Some comments on my patch. I only implement the wait queue when bit_spin_trylock is an actual lock (thus creating the problem). I didn't want to add this code if it was needed (ie. !(CONFIG_SMP CONFIG_DEBUG_SPINLOCKS)). So in bit_spin_trylock, I define NEED_JOURNAL_STATE_WAIT if bit_spin_trylock is really a lock. When NEED_JOURNAL_STATE_WAIT is set, then the wait queue is set up in the journal code. Now the question is, should we make those two locks global? It would help Ingo's cause (and mine as well). But I don't know the impact on a large SMP configuration. Andrew, since you have a 16xSMP machine, could you (if you have time) try out the effect of that. If you do have time, then I'll send you a patch that goes on top of this one to change the two locks into global spin locks. Ingo, where do you want to go from here? I guess we need to wait on what Andrew decides. -- Steve diff -ur linux-2.6.11.orig/fs/jbd/commit.c linux-2.6.11/fs/jbd/commit.c --- linux-2.6.11.orig/fs/jbd/commit.c 2005-03-02 02:38:25.0 -0500 +++ linux-2.6.11/fs/jbd/commit.c2005-03-17 03:40:06.0 -0500 @@ -80,15 +80,33 @@ /* * Try to acquire jbd_lock_bh_state() against the buffer, when j_list_lock is - * held. For ranking reasons we must trylock. If we lose, schedule away and - * return 0. j_list_lock is dropped in this case. + * held. For ranking reasons we must trylock. If we lose put ourselves on a + * state wait queue and we'll be woken up when it is unlocked. Then we return + * 0 to try this again. j_list_lock is dropped in this case. */ static int inverted_lock(journal_t *journal, struct buffer_head *bh) { if (!jbd_trylock_bh_state(bh)) { + /* +* jbd_trylock_bh_state always returns true unless CONFIG_SMP or +* CONFIG_DEBUG_SPINLOCK, so the wait queue is not needed there. +* The bit_spin_locks in jbd_lock_bh_state need to be removed anyway. +*/ +#ifdef NEED_JOURNAL_STATE_WAIT + DECLARE_WAITQUEUE(wait, current); spin_unlock(journal-j_list_lock); - schedule(); + add_wait_queue_exclusive(journal_state_wait,wait); + set_current_state(TASK_UNINTERRUPTIBLE); + /* Check to see if the lock has been unlocked in this short time */ + if (jbd_is_locked_bh_state(bh)) + schedule(); + set_current_state(TASK_RUNNING); + remove_wait_queue(journal_state_wait,wait); return 0; +#else + /* This should never be hit */ + BUG(); +#endif } return 1; } diff -ur linux-2.6.11.orig/fs/jbd/journal.c linux-2.6.11/fs/jbd/journal.c --- linux-2.6.11.orig/fs/jbd/journal.c 2005-03-02 02:37:49.0 -0500 +++ linux-2.6.11/fs/jbd/journal.c 2005-03-17 03:47:40.0 -0500 @@ -80,6 +80,11 @@ EXPORT_SYMBOL(journal_try_to_free_buffers); EXPORT_SYMBOL(journal_force_commit); +#ifdef NEED_JOURNAL_STATE_WAIT +EXPORT_SYMBOL(journal_state_wait); +DECLARE_WAIT_QUEUE_HEAD(journal_state_wait); +#endif + static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); /* diff -ur linux-2.6.11.orig/include/linux/jbd.h linux-2.6.11/include/linux/jbd.h --- linux-2.6.11.orig/include/linux/jbd.h 2005-03-02 02:38:19.0 -0500 +++ linux-2.6.11/include/linux/jbd.h2005-03-17 03:48:18.0 -0500 @@ -324,6 +324,20 @@ return bh-b_private; } +#ifdef NEED_JOURNAL_STATE_WAIT +/* + * The journal_state_wait is a wait queue that tasks will wait on + * if they fail to get the jbd_lock_bh_state while holding the j_list_lock. + * Instead of spinning on schedule, the task now adds itself to this wait queue + *
Re: [PATCH] remove lame schedule in journal inverted_lock (was: Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks)
Steven Rostedt [EMAIL PROTECTED] wrote: Andrew, Since I haven't gotten a response from you, It sometimes takes me half a day to get onto looking at patches. And if I take them I usually don't reply (sorry). But I don't drop stuff, so if you don't hear, please assume the patch stuck. If others raise objections to the patch I'll usually duck it as well, but it's pretty obvious when that happens. I really should knock up a script to send out an email when I add a patch to -mm. I'd figure that you may have missed this, since the subject didn't change. So I changed the subject to get your attention, and I've resent this. Here's the patch to get rid of the the lame schedule that was in fs/jbd/commit.c. Let me know if this patch is appropriate. I'm rather aghast at all the ifdeffery and complexity in this one. But I haven't looked at it closely yet. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove lame schedule in journal inverted_lock (was: Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks)
On Fri, 18 Mar 2005, Andrew Morton wrote: Steven Rostedt [EMAIL PROTECTED] wrote: Andrew, Since I haven't gotten a response from you, It sometimes takes me half a day to get onto looking at patches. And if I take them I usually don't reply (sorry). But I don't drop stuff, so if you don't hear, please assume the patch stuck. If others raise objections to the patch I'll usually duck it as well, but it's pretty obvious when that happens. Sorry, I didn't mean to be pushy. I understand that you have a lot on your plate, and I'm sure you don't drop stuff. I just wasn't sure that you noticed that that was a patch and not just a reply on this thread, since I didn't flag it as such in the subject. I just didn't want it to slip under the radar. I really should knock up a script to send out an email when I add a patch to -mm. I thought you might have had something like that already, which was another reason I thought you might have skipped this. I'd figure that you may have missed this, since the subject didn't change. So I changed the subject to get your attention, and I've resent this. Here's the patch to get rid of the the lame schedule that was in fs/jbd/commit.c. Let me know if this patch is appropriate. I'm rather aghast at all the ifdeffery and complexity in this one. But I haven't looked at it closely yet. I wanted to keep the wait logic out when it wasn't a problem. Basically, the problem only occurs when bit_spin_trylock is defined as an actual trylock. So I put in a define there to enable the wait queues. I didn't want to waste cycles checking the wait queue in jbd_unlock_bh_state when there would never be anything on it. Heck, I figured why even have the wait queue wasting memory if it wasn't needed. So that added the ifdeffery complexity. Thanks, -- Steve - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove lame schedule in journal inverted_lock (was: Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks)
Steven Rostedt [EMAIL PROTECTED] wrote: I really should knock up a script to send out an email when I add a patch to -mm. I thought you might have had something like that already, which was another reason I thought you might have skipped this. I do now.. I'd figure that you may have missed this, since the subject didn't change. So I changed the subject to get your attention, and I've resent this. Here's the patch to get rid of the the lame schedule that was in fs/jbd/commit.c. Let me know if this patch is appropriate. I'm rather aghast at all the ifdeffery and complexity in this one. But I haven't looked at it closely yet. I wanted to keep the wait logic out when it wasn't a problem. Basically, the problem only occurs when bit_spin_trylock is defined as an actual trylock. So I put in a define there to enable the wait queues. I didn't want to waste cycles checking the wait queue in jbd_unlock_bh_state when there would never be anything on it. Heck, I figured why even have the wait queue wasting memory if it wasn't needed. So that added the ifdeffery complexity. No, that code's just a problem. For ranking reasons it's essentially doing this: repeat: cond_resched(); spin_lock(j_list_lock); if (!bit_spin_trylock(bh)) { spin_unlock(j_list_lock); schedule(); goto repeat; } Now imagine that some other CPU holds the bit_spin_lock and is spinning, trying to get the spin_lock(). The above code assumes that the schedule() and cond_resched() will take long enough for the other CPU to get the spinlock, do its business then release the locks. So all the schedule() is really doing is blow a few cycles so the other CPU can get in and grab the spinlock. That'll work OK on normal SMP but I suspect that on NUMA setups with really big latencies we could end up starving the other CPU: this CPU would keep on grabbing the lock. It depends on how the interconnect cache and all that goop works. So what to do? One approach would be to spin on the bit_spin_trylock after having dropped j_list_lock. That'll tell us when the other CPU has moved on. Another approach would be to sleep on a waitqueue somewhere. But that means that jbd_unlock_bh_state() needs to do wakeups all the time - costly. Another approach would be to simply whack an msleep(1) in there. That might be OK - it should be very rare. Probably the first approach would be the one to use. That's for mainline. I don't know what the super-duper-RT fix would be. Why did we start discussing this anyway? Oh, SCHED_FIFO. kjournald doesn't run SCHED_FIFO, but someone may decide to make it do so. But even then I don't see a problem for the mainline kernel, because this CPU's SCHED_FIFO doesn't stop the other CPU from running. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove lame schedule in journal inverted_lock (was: Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks)
On Fri, 18 Mar 2005, Andrew Morton wrote: Steven Rostedt [EMAIL PROTECTED] wrote: I wanted to keep the wait logic out when it wasn't a problem. Basically, the problem only occurs when bit_spin_trylock is defined as an actual trylock. So I put in a define there to enable the wait queues. I didn't want to waste cycles checking the wait queue in jbd_unlock_bh_state when there would never be anything on it. Heck, I figured why even have the wait queue wasting memory if it wasn't needed. So that added the ifdeffery complexity. No, that code's just a problem. For ranking reasons it's essentially doing this: repeat: cond_resched(); spin_lock(j_list_lock); if (!bit_spin_trylock(bh)) { spin_unlock(j_list_lock); schedule(); goto repeat; } Yep, that I understand. Now imagine that some other CPU holds the bit_spin_lock and is spinning, trying to get the spin_lock(). The above code assumes that the schedule() and cond_resched() will take long enough for the other CPU to get the spinlock, do its business then release the locks. So all the schedule() is really doing is blow a few cycles so the other CPU can get in and grab the spinlock. That'll work OK on normal SMP but I suspect that on NUMA setups with really big latencies we could end up starving the other CPU: this CPU would keep on grabbing the lock. It depends on how the interconnect cache and all that goop works. So what to do? One approach would be to spin on the bit_spin_trylock after having dropped j_list_lock. That'll tell us when the other CPU has moved on. This is probably the best for mainline, since, as you mentioned, the abover code is just bad. Another approach would be to sleep on a waitqueue somewhere. But that means that jbd_unlock_bh_state() needs to do wakeups all the time - costly. That's the approach that my patch made. Another approach would be to simply whack an msleep(1) in there. That might be OK - it should be very rare. This approach is not much better than the current implementation. Probably the first approach would be the one to use. That's for mainline. I don't know what the super-duper-RT fix would be. Why did we start discussing this anyway? Oh, SCHED_FIFO. kjournald doesn't run SCHED_FIFO, but someone may decide to make it do so. But even then I don't see a problem for the mainline kernel, because this CPU's SCHED_FIFO doesn't stop the other CPU from running. So this comes down to just a problem with Ingo's PREEPMT_RT. This means that the latency of kjournald, even without SCHED_FIFO will be large. If it preempts a process that has one of these bit spinlocks, (Ingo's RT kernel takes out the preempt_disable in them), then the kjournal thread will spin till its quota is free, causing problems for other processes. Even a process with a higher priority than kjournal if it blocks on one of the other locks that kjournal can have while attempting to get the bit locks. I know Ingo wants to get his patch eventually into the mainline without too much drag. But this problem needs to be solved in the mainline to accomplish this. What do you recommend? -- Steve - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Fri, 2005-03-18 at 01:58 -0500, Steven Rostedt wrote: On Thu, 17 Mar 2005, Lee Revell wrote: OK, no need to cc: me on this one any more. It's really low priority IMO compared to the big latencies I am seeing with ext3 and data=ordered. Unless you think there is any relation. IMO a deadlock is higher priority than a big latency :-) Of course, if I was hitting the deadlock in normal use. I still belive that something to do with the locking in ext3 has to do with your latencies, but I'll take you off when I send something to Andrew or Ingo next time. Hopefully, they'll do the same. If you suspect they are related then yes I would like to be copied. When this problem is solved on Ingo's side, maybe this will solve your latency problem, so I recommend that you keep trying the latest RT kernels. BTW what test are you running that causes these latencies? dbench 16 Lee - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Thu, 17 Mar 2005, Lee Revell wrote: > > OK, no need to cc: me on this one any more. It's really low priority > IMO compared to the big latencies I am seeing with ext3 and > "data=ordered". Unless you think there is any relation. > IMO a deadlock is higher priority than a big latency :-) I still belive that something to do with the locking in ext3 has to do with your latencies, but I'll take you off when I send something to Andrew or Ingo next time. Hopefully, they'll do the same. When this problem is solved on Ingo's side, maybe this will solve your latency problem, so I recommend that you keep trying the latest RT kernels. BTW what test are you running that causes these latencies? -- Steve - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Thu, 2005-03-17 at 11:23 -0500, Steven Rostedt wrote: > > On Thu, 17 Mar 2005, Lee Revell wrote: > > > > > Sorry, it's hard to follow this thread. Just to make sure we're all on > > the same page, what exactly is the symptom of this ext3 issue you are > > working on? Is it a performance regression, or a latency issue, or a > > lockup - ? > > > > Whatever your problem is, I am not seeing it. > > > > The root is a lockup. I think you can get this lockup whether or not it > is PREEMPT_RT or PREEPMT_DESKTOP. All you need is CONFIG_PREEMPT turned > on. Then this is what you want to do on a UP Machine. OK, no need to cc: me on this one any more. It's really low priority IMO compared to the big latencies I am seeing with ext3 and "data=ordered". Unless you think there is any relation. Lee - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Thu, 17 Mar 2005, Lee Revell wrote: > > Sorry, it's hard to follow this thread. Just to make sure we're all on > the same page, what exactly is the symptom of this ext3 issue you are > working on? Is it a performance regression, or a latency issue, or a > lockup - ? > > Whatever your problem is, I am not seeing it. > The root is a lockup. I think you can get this lockup whether or not it is PREEMPT_RT or PREEPMT_DESKTOP. All you need is CONFIG_PREEMPT turned on. Then this is what you want to do on a UP Machine. Set kjournald to FIFO (any realtime priority). And then from a non-RT task, just do a "make clean; make" on the kernel. It may take a few minutes but your system will lock up. That's because kjournal will wait on the bit_spin_lock, but will never be preempted by the one holding the lock, because it is FIFO and the one holding the lock (the kernel compile) is not RT. Even if it was, and the same priority as kjournal, it would still lock, since kjournal is FIFO and will only yield to higher priority threads. Now this lockup has uncovered other problems with ext3. Mainly that it uses bit spinlocks, which in of itself is bad. You don't want a busy wait unless you really need it. A normal spinlock is such a thing in vanilla SMP systems, since a schedule would take longer than the one holding the lock. Ingo's RT kernel, removes most of these, and makes them into mutexes. This may slow down the overall performance but it shortens latencies for RT tasks, which is what RT tries to do. Now the latest problem is also bad, since you should never just call schedule as a "yield" to let someone else release a lock. Since the ranking order of the locks prevents just grabbing the lock and then risking a deadlock, ext3 tries to get the lock, and if it fails, it releases the other lock it has, calls schedule, then tries again. This is usually bad, since it would most likely be rescheduled, so basically it is worst than a spinlock, since it actually goes through the schedule logic again and spins! With Ingo's RT patch, this also becomes a deadlock the same way as bit_spin_locks can. Hope this helps, -- Steve - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Thu, 2005-03-17 at 02:15 -0500, Steven Rostedt wrote: > > On Wed, 16 Mar 2005, Lee Revell wrote: > > > I am a bit confused, big surprise. Does this thread still have anything > > to do with this trace from my "Latency regressions" bug report? > > Don't worry, I've been in a state of confusion for a long time now ;-) > > > > > http://www.alsa-project.org/~rlrevell/2912us > > > > The problem only is apparent with PREEMPT_DESKTOP and "data=ordered". > > > > PREEMPT_RT has always worked perfectly. > > > > I'm surprise that PREEMPT_RT does work. I'm no longer sure that this does > affect your latency anymore. It probably does indirectly somehow. I > still think it has to do with the bitspinlocks. But I'm not sure. Just > let me know if you want to be taken off this thread and I'll remove you > from my CC list. Until then, I'll keep you on. Sorry, it's hard to follow this thread. Just to make sure we're all on the same page, what exactly is the symptom of this ext3 issue you are working on? Is it a performance regression, or a latency issue, or a lockup - ? Whatever your problem is, I am not seeing it. Lee - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Wed, 16 Mar 2005, Steven Rostedt wrote: > [...] There's a couple of places that > jbd_trylock_bh_state is used in checkpoint.c, but this is the one place > that it definitely deadlocks the system. I believe that the > code in checkpoint.c also has this problem. > I've examined the code in checkpoint.c, and I now believe that it doesn't have this problem. When it fails a lock, it just falls out of the while loops. -- Steve - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Wed, 16 Mar 2005, Andrew Morton wrote: > > I guess one way to solve this is to add a wait queue here (before > > schedule()), and have the one holding the lock to wake up all on the > > waitqueue when they release it. > > yup. A patch against mainline would be appropriate, please. > Hi Andrew, Here's the patch against 2.6.11. I tested it, by adding (after making the patch) global spinlocks for jbd_lock_bh_state and jbd_lock_bh_journalhead. That way I have same scenerio as with Ingo's kernel, and I turned on NEED_JOURNAL_STATE_WAIT. I'm still running that kernel so it looks like it works. Making those two locks global causes this deadlock on kjournal much quicker, and I don't need to run on an SMP machine (since my SMP machines are currently being used for other tasks). Some comments on my patch. I only implement the wait queue when bit_spin_trylock is an actual lock (thus creating the problem). I didn't want to add this code if it was needed (ie. !(CONFIG_SMP && CONFIG_DEBUG_SPINLOCKS)). So in bit_spin_trylock, I define NEED_JOURNAL_STATE_WAIT if bit_spin_trylock is really a lock. When NEED_JOURNAL_STATE_WAIT is set, then the wait queue is set up in the journal code. Now the question is, should we make those two locks global? It would help Ingo's cause (and mine as well). But I don't know the impact on a large SMP configuration. Andrew, since you have a 16xSMP machine, could you (if you have time) try out the effect of that. If you do have time, then I'll send you a patch that goes on top of this one to change the two locks into global spin locks. Ingo, where do you want to go from here? I guess we need to wait on what Andrew decides. -- Steve diff -ur linux-2.6.11.orig/fs/jbd/commit.c linux-2.6.11/fs/jbd/commit.c --- linux-2.6.11.orig/fs/jbd/commit.c 2005-03-02 02:38:25.0 -0500 +++ linux-2.6.11/fs/jbd/commit.c2005-03-17 03:40:06.0 -0500 @@ -80,15 +80,33 @@ /* * Try to acquire jbd_lock_bh_state() against the buffer, when j_list_lock is - * held. For ranking reasons we must trylock. If we lose, schedule away and - * return 0. j_list_lock is dropped in this case. + * held. For ranking reasons we must trylock. If we lose put ourselves on a + * state wait queue and we'll be woken up when it is unlocked. Then we return + * 0 to try this again. j_list_lock is dropped in this case. */ static int inverted_lock(journal_t *journal, struct buffer_head *bh) { if (!jbd_trylock_bh_state(bh)) { + /* +* jbd_trylock_bh_state always returns true unless CONFIG_SMP or +* CONFIG_DEBUG_SPINLOCK, so the wait queue is not needed there. +* The bit_spin_locks in jbd_lock_bh_state need to be removed anyway. +*/ +#ifdef NEED_JOURNAL_STATE_WAIT + DECLARE_WAITQUEUE(wait, current); spin_unlock(>j_list_lock); - schedule(); + add_wait_queue_exclusive(_state_wait,); + set_current_state(TASK_UNINTERRUPTIBLE); + /* Check to see if the lock has been unlocked in this short time */ + if (jbd_is_locked_bh_state(bh)) + schedule(); + set_current_state(TASK_RUNNING); + remove_wait_queue(_state_wait,); return 0; +#else + /* This should never be hit */ + BUG(); +#endif } return 1; } diff -ur linux-2.6.11.orig/fs/jbd/journal.c linux-2.6.11/fs/jbd/journal.c --- linux-2.6.11.orig/fs/jbd/journal.c 2005-03-02 02:37:49.0 -0500 +++ linux-2.6.11/fs/jbd/journal.c 2005-03-17 03:47:40.0 -0500 @@ -80,6 +80,11 @@ EXPORT_SYMBOL(journal_try_to_free_buffers); EXPORT_SYMBOL(journal_force_commit); +#ifdef NEED_JOURNAL_STATE_WAIT +EXPORT_SYMBOL(journal_state_wait); +DECLARE_WAIT_QUEUE_HEAD(journal_state_wait); +#endif + static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); /* diff -ur linux-2.6.11.orig/include/linux/jbd.h linux-2.6.11/include/linux/jbd.h --- linux-2.6.11.orig/include/linux/jbd.h 2005-03-02 02:38:19.0 -0500 +++ linux-2.6.11/include/linux/jbd.h2005-03-17 03:48:18.0 -0500 @@ -324,6 +324,20 @@ return bh->b_private; } +#ifdef NEED_JOURNAL_STATE_WAIT +/* + * The journal_state_wait is a wait queue that tasks will wait on + * if they fail to get the jbd_lock_bh_state while holding the j_list_lock. + * Instead of spinning on schedule, the task now adds itself to this wait queue + * and will be woken up when the jbd_lock_bh_state is released. + * + * Since the bit_spin_locks are only locks under CONFIG_SMP and + * CONFIG_DEBUG_SPINLOCK, this wait queue is only needed in those + * cases. + */ +extern wait_queue_head_t journal_state_wait; +#endif + static inline void jbd_lock_bh_state(struct buffer_head *bh) { bit_spin_lock(BH_State, >b_state); @@ -342,6 +356,13 @@ static inline void
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Wed, 16 Mar 2005, Steven Rostedt wrote: [...] There's a couple of places that jbd_trylock_bh_state is used in checkpoint.c, but this is the one place that it definitely deadlocks the system. I believe that the code in checkpoint.c also has this problem. I've examined the code in checkpoint.c, and I now believe that it doesn't have this problem. When it fails a lock, it just falls out of the while loops. -- Steve - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Thu, 2005-03-17 at 02:15 -0500, Steven Rostedt wrote: On Wed, 16 Mar 2005, Lee Revell wrote: I am a bit confused, big surprise. Does this thread still have anything to do with this trace from my Latency regressions bug report? Don't worry, I've been in a state of confusion for a long time now ;-) http://www.alsa-project.org/~rlrevell/2912us The problem only is apparent with PREEMPT_DESKTOP and data=ordered. PREEMPT_RT has always worked perfectly. I'm surprise that PREEMPT_RT does work. I'm no longer sure that this does affect your latency anymore. It probably does indirectly somehow. I still think it has to do with the bitspinlocks. But I'm not sure. Just let me know if you want to be taken off this thread and I'll remove you from my CC list. Until then, I'll keep you on. Sorry, it's hard to follow this thread. Just to make sure we're all on the same page, what exactly is the symptom of this ext3 issue you are working on? Is it a performance regression, or a latency issue, or a lockup - ? Whatever your problem is, I am not seeing it. Lee - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Thu, 17 Mar 2005, Lee Revell wrote: Sorry, it's hard to follow this thread. Just to make sure we're all on the same page, what exactly is the symptom of this ext3 issue you are working on? Is it a performance regression, or a latency issue, or a lockup - ? Whatever your problem is, I am not seeing it. The root is a lockup. I think you can get this lockup whether or not it is PREEMPT_RT or PREEPMT_DESKTOP. All you need is CONFIG_PREEMPT turned on. Then this is what you want to do on a UP Machine. Set kjournald to FIFO (any realtime priority). And then from a non-RT task, just do a make clean; make on the kernel. It may take a few minutes but your system will lock up. That's because kjournal will wait on the bit_spin_lock, but will never be preempted by the one holding the lock, because it is FIFO and the one holding the lock (the kernel compile) is not RT. Even if it was, and the same priority as kjournal, it would still lock, since kjournal is FIFO and will only yield to higher priority threads. Now this lockup has uncovered other problems with ext3. Mainly that it uses bit spinlocks, which in of itself is bad. You don't want a busy wait unless you really need it. A normal spinlock is such a thing in vanilla SMP systems, since a schedule would take longer than the one holding the lock. Ingo's RT kernel, removes most of these, and makes them into mutexes. This may slow down the overall performance but it shortens latencies for RT tasks, which is what RT tries to do. Now the latest problem is also bad, since you should never just call schedule as a yield to let someone else release a lock. Since the ranking order of the locks prevents just grabbing the lock and then risking a deadlock, ext3 tries to get the lock, and if it fails, it releases the other lock it has, calls schedule, then tries again. This is usually bad, since it would most likely be rescheduled, so basically it is worst than a spinlock, since it actually goes through the schedule logic again and spins! With Ingo's RT patch, this also becomes a deadlock the same way as bit_spin_locks can. Hope this helps, -- Steve - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Thu, 2005-03-17 at 11:23 -0500, Steven Rostedt wrote: On Thu, 17 Mar 2005, Lee Revell wrote: Sorry, it's hard to follow this thread. Just to make sure we're all on the same page, what exactly is the symptom of this ext3 issue you are working on? Is it a performance regression, or a latency issue, or a lockup - ? Whatever your problem is, I am not seeing it. The root is a lockup. I think you can get this lockup whether or not it is PREEMPT_RT or PREEPMT_DESKTOP. All you need is CONFIG_PREEMPT turned on. Then this is what you want to do on a UP Machine. OK, no need to cc: me on this one any more. It's really low priority IMO compared to the big latencies I am seeing with ext3 and data=ordered. Unless you think there is any relation. Lee - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Thu, 17 Mar 2005, Lee Revell wrote: OK, no need to cc: me on this one any more. It's really low priority IMO compared to the big latencies I am seeing with ext3 and data=ordered. Unless you think there is any relation. IMO a deadlock is higher priority than a big latency :-) I still belive that something to do with the locking in ext3 has to do with your latencies, but I'll take you off when I send something to Andrew or Ingo next time. Hopefully, they'll do the same. When this problem is solved on Ingo's side, maybe this will solve your latency problem, so I recommend that you keep trying the latest RT kernels. BTW what test are you running that causes these latencies? -- Steve - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Wed, 16 Mar 2005, Lee Revell wrote: > I am a bit confused, big surprise. Does this thread still have anything > to do with this trace from my "Latency regressions" bug report? Don't worry, I've been in a state of confusion for a long time now ;-) > > http://www.alsa-project.org/~rlrevell/2912us > > The problem only is apparent with PREEMPT_DESKTOP and "data=ordered". > > PREEMPT_RT has always worked perfectly. > I'm surprise that PREEMPT_RT does work. I'm no longer sure that this does affect your latency anymore. It probably does indirectly somehow. I still think it has to do with the bitspinlocks. But I'm not sure. Just let me know if you want to be taken off this thread and I'll remove you from my CC list. Until then, I'll keep you on. -- Steve - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
Steven Rostedt <[EMAIL PROTECTED]> wrote: > > /* > * Try to acquire jbd_lock_bh_state() against the buffer, when j_list_lock > is > * held. For ranking reasons we must trylock. If we lose, schedule away > and > * return 0. j_list_lock is dropped in this case. > */ > static int inverted_lock(journal_t *journal, struct buffer_head *bh) > { > if (!jbd_trylock_bh_state(bh)) { > spin_unlock(>j_list_lock); > schedule(); > return 0; > } > return 1; > } > That's very lame code, that. The old "I don't know what the heck to do now so I'll schedule" trick. Sorry. > I guess one way to solve this is to add a wait queue here (before > schedule()), and have the one holding the lock to wake up all on the > waitqueue when they release it. yup. A patch against mainline would be appropriate, please. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Wed, 2005-03-16 at 12:47 -0500, Steven Rostedt wrote: > > On Wed, 16 Mar 2005, Steven Rostedt wrote: > > > > > Hi Ingo, > > > > I just ran this with PREEMPT_RT and it works fine. > > Not quite, and I will assume that some of the other patches I sent have > this same problem. The jbd_trylock_bh_state really scares me. It seems > that in fs/jbd/commit.c in journal_commit_transaction we have the > following code: I am a bit confused, big surprise. Does this thread still have anything to do with this trace from my "Latency regressions" bug report? http://www.alsa-project.org/~rlrevell/2912us The problem only is apparent with PREEMPT_DESKTOP and "data=ordered". PREEMPT_RT has always worked perfectly. Lee - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Wed, 16 Mar 2005, Steven Rostedt wrote: > > Hi Ingo, > > I just ran this with PREEMPT_RT and it works fine. Not quite, and I will assume that some of the other patches I sent have this same problem. The jbd_trylock_bh_state really scares me. It seems that in fs/jbd/commit.c in journal_commit_transaction we have the following code: write_out_data: cond_resched(); spin_lock(>j_list_lock); while (commit_transaction->t_sync_datalist) { struct buffer_head *bh; jh = commit_transaction->t_sync_datalist; commit_transaction->t_sync_datalist = jh->b_tnext; bh = jh2bh(jh); if (buffer_locked(bh)) { BUFFER_TRACE(bh, "locked"); if (!inverted_lock(journal, bh)) goto write_out_data; where invert_data simply is: /* * Try to acquire jbd_lock_bh_state() against the buffer, when j_list_lock is * held. For ranking reasons we must trylock. If we lose, schedule away and * return 0. j_list_lock is dropped in this case. */ static int inverted_lock(journal_t *journal, struct buffer_head *bh) { if (!jbd_trylock_bh_state(bh)) { spin_unlock(>j_list_lock); schedule(); return 0; } return 1; } So, with kjournal running as a FIFO, it may hit this (as it did with my last test) and not get the lock. All it does is release another lock (ranking reasons) and calls schedule and tries again. With kjournal the highest running process on the system (UP) it deadlocks since whoever has the lock will never get a chance to run. There's a couple of places that jbd_trylock_bh_state is used in checkpoint.c, but this is the one place that it definitely deadlocks the system. I believe that the code in checkpoint.c also has this problem. I guess one way to solve this is to add a wait queue here (before schedule()), and have the one holding the lock to wake up all on the waitqueue when they release it. -- Steve - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Wed, 16 Mar 2005, Steven Rostedt wrote: > > Ingo, I still get the following bug because of the added BUFFER_FNS and > DESKTOP_PREEMPT. I haven't tried this with RT yet. I'll see if this shows > a deadlock there. > > Hi Ingo, I just ran this with PREEMPT_RT and it works fine. Now is this the best solution, or adding a lock to the buffer head? This works but I don't have anything more than a 2X CPU to test this on. If either you or Andrew can try this on the 8x or 16x that would be great.. Also, I only get the BUG with PREEMPT_DESKTOP. I really don't understand why this happens. I sent you a test patch earlier with just adding BUFFER_FNS(JournalHead,journalhead) in jbd.h, and under PREEMPT_DESKTOP that causes this bug as well. No other changes, just adding the BUFFER_FNS call causes this. I can't find any other reference to buffer_journal (besides reiser_fs). What do you think, and are you getting the same bug? -- Steve > BUG: Unable to handle kernel NULL pointer dereference at virtual address > > printing eip: > c0214888 > *pde = > Oops: [#1] > Modules linked in: ipv6 af_packet tsdev mousedev evdev floppy psmouse > pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm > snd_timer snd soundcore snd_page_alloc shpchp pci_hotplug ehci_hcd > intel_agp agpgart uhci_hcd usbcore e100 mii ide_cd cdrom unix > CPU:0 > EIP:0060:[]Not tainted VLI > EFLAGS: 00010286 (2.6.11-RT-V0.7.40-00) > EIP is at vt_ioctl+0x18/0x1ab0 > eax: ebx: 5603 ecx: 5603 edx: cec18c80 > esi: c0214870 edi: cb49e000 ebp: cb479f18 esp: cb479e48 > ds: 007b es: 007b ss: 0068 preempt: > Process XFree86 (pid: 4744, threadinfo=cb478000 task=cb403530) > Stack: cb403680 cb478000 cb403530 c034594c cb403530 0246 cb479e7c > c0117217 >c0345954 0006 0001 cb479ebc cefa1c04 > c13e1000 >ced6b9b8 cb479ed4 c01707f1 ced6b9b8 0007 > > Call Trace: > [] show_stack+0x7f/0xa0 (28) > [] show_registers+0x165/0x1d0 (56) > [] die+0xc8/0x150 (64) > [] do_page_fault+0x356/0x6c4 (216) > [] error_code+0x2b/0x30 (268) > [] tty_ioctl+0x34b/0x490 (52) > [] do_ioctl+0x4f/0x70 (32) > [] vfs_ioctl+0x62/0x1d0 (40) > [] sys_ioctl+0x61/0x90 (40) > [] syscall_call+0x7/0xb (-8124) > Code: ff ff 8d 05 88 5d 34 c0 e8 f6 60 0a 00 e9 3a ff ff ff 90 55 89 e5 57 > 56 53 81 ec c4 00 00 00 8b 7d 08 8b 5d 10 8b 87 7c 09 00 00 <8b> 30 89 34 > 24 8b 04 b5 e0 b7 3c c0 89 45 8c e8 a4 6a 00 00 85 > > > > > Here's the patch (on Ingo's -40 kernel). > > diff -ur linux-2.6.11-final-V0.7.40-00.orig/fs/jbd/journal.c > linux-2.6.11-final-V0.7.40-00/fs/jbd/journal.c > --- linux-2.6.11-final-V0.7.40-00.orig/fs/jbd/journal.c 2005-03-02 > 02:37:49.0 -0500 > +++ linux-2.6.11-final-V0.7.40-00/fs/jbd/journal.c2005-03-16 > 07:47:50.0 -0500 > @@ -82,6 +82,9 @@ > > static int journal_convert_superblock_v1(journal_t *, journal_superblock_t > *); > > +spinlock_t jbd_state_lock = SPIN_LOCK_UNLOCKED; > +spinlock_t jbd_journal_lock = SPIN_LOCK_UNLOCKED; > + > /* > * Helper function used to manage commit timeouts > */ > diff -ur linux-2.6.11-final-V0.7.40-00.orig/include/linux/jbd.h > linux-2.6.11-final-V0.7.40-00/include/linux/jbd.h > --- linux-2.6.11-final-V0.7.40-00.orig/include/linux/jbd.h2005-03-02 > 02:38:19.0 -0500 > +++ linux-2.6.11-final-V0.7.40-00/include/linux/jbd.h 2005-03-16 > 08:51:27.292105187 -0500 > @@ -313,6 +313,8 @@ > BUFFER_FNS(RevokeValid, revokevalid) > TAS_BUFFER_FNS(RevokeValid, revokevalid) > BUFFER_FNS(Freed, freed) > +BUFFER_FNS(State,state) > +BUFFER_FNS(JournalHead,journal) > > static inline struct buffer_head *jh2bh(struct journal_head *jh) > { > @@ -324,34 +326,50 @@ > return bh->b_private; > } > > +extern spinlock_t jbd_state_lock; > +extern spinlock_t jbd_journal_lock; > + > static inline void jbd_lock_bh_state(struct buffer_head *bh) > { > - bit_spin_lock(BH_State, >b_state); > + spin_lock(_state_lock); > + BUG_ON(buffer_state(bh)); > + set_buffer_state(bh); > } > > static inline int jbd_trylock_bh_state(struct buffer_head *bh) > { > - return bit_spin_trylock(BH_State, >b_state); > + if (spin_trylock(_state_lock)) { > + BUG_ON(buffer_state(bh)); > + set_buffer_state(bh); > + return 1; > + } > + return 0; > } > > static inline int jbd_is_locked_bh_state(struct buffer_head *bh) > { > - return bit_spin_is_locked(BH_State, >b_state); > + return buffer_state(bh); //spin_is_locked(_state_lock); > } > > static inline void jbd_unlock_bh_state(struct buffer_head *bh) > { > - bit_spin_unlock(BH_State, >b_state); > + BUG_ON(!buffer_state(bh)); > + clear_buffer_state(bh); > + spin_unlock(_state_lock); > } > > static inline void jbd_lock_bh_journal_head(struct buffer_head *bh) > { > - bit_spin_lock(BH_JournalHead, >b_state); > +
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Wed, 16 Mar 2005, Andrew Morton wrote: > > > But > > when I made jbd_lock_bh_state a global lock, I believe it deadlocked on > > me. > > That's a worry. > OK, I'm wrong here. I just tried it again and it didn't deadlock (that must have been another lock I was dealing with). But it does test if the buffer head is locked or not, and asserts if it is. I'm running the following patch with on problems so far. I still use the lock bits to determine if the bh state is locked. Do you and Ingo think that this would have too much contention. Ingo, I still get the following bug because of the added BUFFER_FNS and DESKTOP_PREEMPT. I haven't tried this with RT yet. I'll see if this shows a deadlock there. BUG: Unable to handle kernel NULL pointer dereference at virtual address printing eip: c0214888 *pde = Oops: [#1] Modules linked in: ipv6 af_packet tsdev mousedev evdev floppy psmouse pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc shpchp pci_hotplug ehci_hcd intel_agp agpgart uhci_hcd usbcore e100 mii ide_cd cdrom unix CPU:0 EIP:0060:[]Not tainted VLI EFLAGS: 00010286 (2.6.11-RT-V0.7.40-00) EIP is at vt_ioctl+0x18/0x1ab0 eax: ebx: 5603 ecx: 5603 edx: cec18c80 esi: c0214870 edi: cb49e000 ebp: cb479f18 esp: cb479e48 ds: 007b es: 007b ss: 0068 preempt: Process XFree86 (pid: 4744, threadinfo=cb478000 task=cb403530) Stack: cb403680 cb478000 cb403530 c034594c cb403530 0246 cb479e7c c0117217 c0345954 0006 0001 cb479ebc cefa1c04 c13e1000 ced6b9b8 cb479ed4 c01707f1 ced6b9b8 0007 Call Trace: [] show_stack+0x7f/0xa0 (28) [] show_registers+0x165/0x1d0 (56) [] die+0xc8/0x150 (64) [] do_page_fault+0x356/0x6c4 (216) [] error_code+0x2b/0x30 (268) [] tty_ioctl+0x34b/0x490 (52) [] do_ioctl+0x4f/0x70 (32) [] vfs_ioctl+0x62/0x1d0 (40) [] sys_ioctl+0x61/0x90 (40) [] syscall_call+0x7/0xb (-8124) Code: ff ff 8d 05 88 5d 34 c0 e8 f6 60 0a 00 e9 3a ff ff ff 90 55 89 e5 57 56 53 81 ec c4 00 00 00 8b 7d 08 8b 5d 10 8b 87 7c 09 00 00 <8b> 30 89 34 24 8b 04 b5 e0 b7 3c c0 89 45 8c e8 a4 6a 00 00 85 Here's the patch (on Ingo's -40 kernel). diff -ur linux-2.6.11-final-V0.7.40-00.orig/fs/jbd/journal.c linux-2.6.11-final-V0.7.40-00/fs/jbd/journal.c --- linux-2.6.11-final-V0.7.40-00.orig/fs/jbd/journal.c 2005-03-02 02:37:49.0 -0500 +++ linux-2.6.11-final-V0.7.40-00/fs/jbd/journal.c 2005-03-16 07:47:50.0 -0500 @@ -82,6 +82,9 @@ static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); +spinlock_t jbd_state_lock = SPIN_LOCK_UNLOCKED; +spinlock_t jbd_journal_lock = SPIN_LOCK_UNLOCKED; + /* * Helper function used to manage commit timeouts */ diff -ur linux-2.6.11-final-V0.7.40-00.orig/include/linux/jbd.h linux-2.6.11-final-V0.7.40-00/include/linux/jbd.h --- linux-2.6.11-final-V0.7.40-00.orig/include/linux/jbd.h 2005-03-02 02:38:19.0 -0500 +++ linux-2.6.11-final-V0.7.40-00/include/linux/jbd.h 2005-03-16 08:51:27.292105187 -0500 @@ -313,6 +313,8 @@ BUFFER_FNS(RevokeValid, revokevalid) TAS_BUFFER_FNS(RevokeValid, revokevalid) BUFFER_FNS(Freed, freed) +BUFFER_FNS(State,state) +BUFFER_FNS(JournalHead,journal) static inline struct buffer_head *jh2bh(struct journal_head *jh) { @@ -324,34 +326,50 @@ return bh->b_private; } +extern spinlock_t jbd_state_lock; +extern spinlock_t jbd_journal_lock; + static inline void jbd_lock_bh_state(struct buffer_head *bh) { - bit_spin_lock(BH_State, >b_state); + spin_lock(_state_lock); + BUG_ON(buffer_state(bh)); + set_buffer_state(bh); } static inline int jbd_trylock_bh_state(struct buffer_head *bh) { - return bit_spin_trylock(BH_State, >b_state); + if (spin_trylock(_state_lock)) { + BUG_ON(buffer_state(bh)); + set_buffer_state(bh); + return 1; + } + return 0; } static inline int jbd_is_locked_bh_state(struct buffer_head *bh) { - return bit_spin_is_locked(BH_State, >b_state); + return buffer_state(bh); //spin_is_locked(_state_lock); } static inline void jbd_unlock_bh_state(struct buffer_head *bh) { - bit_spin_unlock(BH_State, >b_state); + BUG_ON(!buffer_state(bh)); + clear_buffer_state(bh); + spin_unlock(_state_lock); } static inline void jbd_lock_bh_journal_head(struct buffer_head *bh) { - bit_spin_lock(BH_JournalHead, >b_state); + spin_lock(_journal_lock); + BUG_ON(buffer_journal(bh)); + set_buffer_journal(bh); } static inline void jbd_unlock_bh_journal_head(struct buffer_head *bh) { - bit_spin_unlock(BH_JournalHead, >b_state); + BUG_ON(!buffer_journal(bh)); + clear_buffer_journal(bh); + spin_unlock(_journal_lock); } struct jbd_revoke_table_s; - To unsubscribe from this list: send the line "unsubscribe
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
Steven Rostedt <[EMAIL PROTECTED]> wrote: > > > > On Wed, 16 Mar 2005, Andrew Morton wrote: > > > > > Those two are in the journal, actually. You refer to jbd_lock_bh_state() > > and jbd_lock_bh_journal_head(). I think they both need to be in the > > buffer_head. jbd_lock_bh_journal_head() can probably go away (just use > > caller's jbd_lock_bh_state()). > > > > Or make them global, or put them in the journal. > > The jbd_lock_bh_journal_head can be one global lock without a problem. As I say, we can probably eliminate it. > But > when I made jbd_lock_bh_state a global lock, I believe it deadlocked on > me. That's a worry. > So this one has to go into the buffer head. What do you mean with > "put them in the journal", do you mean the journal_s structure? Yes. > Is there a > safe way to get to that structure from the buffer head? No convenient way, iirc. But there's usually a fairly straightforward way to get at the journal from within JBD code. > The state lock is > used quite a bit and it gets tricky trying to figure out how to use other > structures wrt buffer_heads at all the locations that use > jbd_lock_bh_state. That one should go into the buffer_head, I guess. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Wed, 16 Mar 2005, Andrew Morton wrote: > > Those two are in the journal, actually. You refer to jbd_lock_bh_state() > and jbd_lock_bh_journal_head(). I think they both need to be in the > buffer_head. jbd_lock_bh_journal_head() can probably go away (just use > caller's jbd_lock_bh_state()). > > Or make them global, or put them in the journal. The jbd_lock_bh_journal_head can be one global lock without a problem. But when I made jbd_lock_bh_state a global lock, I believe it deadlocked on me. So this one has to go into the buffer head. What do you mean with "put them in the journal", do you mean the journal_s structure? Is there a safe way to get to that structure from the buffer head? The state lock is used quite a bit and it gets tricky trying to figure out how to use other structures wrt buffer_heads at all the locations that use jbd_lock_bh_state. -- Steve - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
* Andrew Morton <[EMAIL PROTECTED]> wrote: > > How much would the +4/+8 bytes size increase in > > buffer_head [on SMP] be frowned upon? > > It wouldn't be the end of the world. I'm not clear on what bits of > the rt-super-low-latency stuff is intended for mainline though? in the long run, most of it. There are no conceptual barriers so far, the -RT tree consists of lots of small details and the PREEMPT_RT framework itself. We are trying to solve (and merge) the small details first (in upstream), so that PREEMPT_RT itself becomes uncontroversial. (and it's not really the low latency that matters mainly - more valuable is the fact that under PREEMPT_RT high latencies are statistically much more unlikely [you need to do some really intentional and easy to see things to introduce high latencies], while in the current upstream kernel, high latencies are often side-effects of pretty normal kernel coding activities, so low latencies are always a catch-up game that can never be truly won for sure. So yes, while a 10 usec worst-case latency under arbitrary Linux workloads [on the right hardware] is indeed sexy, more important is that things are much more deterministic and hence much more trustable from a hard-RT POV.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
Ingo Molnar <[EMAIL PROTECTED]> wrote: > > > * Andrew Morton <[EMAIL PROTECTED]> wrote: > > > I forget how much of the 1000% came from that, but it was quite a lot. > > > > Removing the BKL was the first step. That took the context switch > > rate under high load from ~10,000/sec up to ~300,000/sec. Because the > > first thing a CPU hit on entry to the fs was then a semaphore. > > Performance rather took a dive. > > > > Of course the locks also became much finer-grained, so the contention > > opportunities lessened. But j_list_lock and j_state_lock have fs-wide > > scope, so I'd expect the context switch rate to go up quite a lot > > again. > > > > The hold times are short, and a context switch hurts rather ore than a > > quick spin. > > which particular workload was this - dbench? (I can try PREEMPT_RT on an > 8-way, such effects will show up tenfold.) > Oh gee, that was back in the days when Martin was being useful. SDET, I think. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
Ingo Molnar <[EMAIL PROTECTED]> wrote: > > > * Andrew Morton <[EMAIL PROTECTED]> wrote: > > > > > There's a little lock ranking diagram in jbd.h which tells us that > > > > these locks nest inside j_list_lock and j_state_lock. So I guess > > > > you'll need to turn those into semaphores. > > > > > > indeed. I did this (see the three followup patches, against BK-curr), > > > and it builds/boots/works just fine on an ext3 box. Do we want to try > > > this in -mm? > > > > ooh, I'd rather not. I spent an intense three days removing all the > > sleeping locks from ext3 (and three months debugging the result). > > Ended up gaining 1000% on 16-way. > > > > Putting them back in will really hurt the SMP performance. > > seems like turning the bitlocks into spinlocks is the best option then. > We'd need one lock in buffer_head (j_state_lock, renamed to something > more sensible like b_private_lock), and one lock in journal_head > (j_list_lock) i guess. Those two are in the journal, actually. You refer to jbd_lock_bh_state() and jbd_lock_bh_journal_head(). I think they both need to be in the buffer_head. jbd_lock_bh_journal_head() can probably go away (just use caller's jbd_lock_bh_state()). Or make them global, or put them in the journal. > How much would the +4/+8 bytes size increase in > buffer_head [on SMP] be frowned upon? It wouldn't be the end of the world. I'm not clear on what bits of the rt-super-low-latency stuff is intended for mainline though? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Wed, 2005-03-16 at 02:26 -0800, Andrew Morton wrote: > > The hold times are short, and a context switch hurts rather ore than a > quick > spin. so we need a spinaphore ;) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
* Andrew Morton <[EMAIL PROTECTED]> wrote: > I forget how much of the 1000% came from that, but it was quite a lot. > > Removing the BKL was the first step. That took the context switch > rate under high load from ~10,000/sec up to ~300,000/sec. Because the > first thing a CPU hit on entry to the fs was then a semaphore. > Performance rather took a dive. > > Of course the locks also became much finer-grained, so the contention > opportunities lessened. But j_list_lock and j_state_lock have fs-wide > scope, so I'd expect the context switch rate to go up quite a lot > again. > > The hold times are short, and a context switch hurts rather ore than a > quick spin. which particular workload was this - dbench? (I can try PREEMPT_RT on an 8-way, such effects will show up tenfold.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
Ingo Molnar <[EMAIL PROTECTED]> wrote: > > > ooh, I'd rather not. I spent an intense three days removing all the > > sleeping locks from ext3 (and three months debugging the result). > > Ended up gaining 1000% on 16-way. > > > > Putting them back in will really hurt the SMP performance. > > ah. Yeah. Sniff. > > if we gain 1000% on a 16-way then there's something really wrong about > semaphores (or scheduling) though. A semaphore is almost a spinlock, in > the uncontended case - and even under contention we really (should) just > spend the cycles that we'd spend spinning. There will be some > intermediate contention level where semaphores hurt, but 1000% sounds > truly excessive. I forget how much of the 1000% came from that, but it was quite a lot. Removing the BKL was the first step. That took the context switch rate under high load from ~10,000/sec up to ~300,000/sec. Because the first thing a CPU hit on entry to the fs was then a semaphore. Performance rather took a dive. Of course the locks also became much finer-grained, so the contention opportunities lessened. But j_list_lock and j_state_lock have fs-wide scope, so I'd expect the context switch rate to go up quite a lot again. The hold times are short, and a context switch hurts rather ore than a quick spin. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
* Steven Rostedt <[EMAIL PROTECTED]> wrote: > > > ooh, I'd rather not. I spent an intense three days removing all the > > > sleeping locks from ext3 (and three months debugging the result). > > > Ended up gaining 1000% on 16-way. > > > > > > Putting them back in will really hurt the SMP performance. > > > > ah. Yeah. Sniff. > > > > if we gain 1000% on a 16-way then there's something really wrong about > > semaphores (or scheduling) though. A semaphore is almost a spinlock, in > > the uncontended case - and even under contention we really (should) just > > spend the cycles that we'd spend spinning. There will be some > > intermediate contention level where semaphores hurt, but 1000% sounds > > truly excessive. > > > > Could it possibly be that in the process of removing all the sleeping > locks from ext3, that Andrew also removed a flaw in ext3 itself that > is responsible for the 1000% improvement? i think the chances for that are really remote. I think it must have been a workload ending up scheduling itself to death, while spinlocks force atomicity of execution and affinity. we should be able to see the same scenario with PREEMPT_RT on a 16-way :-) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Wed, 16 Mar 2005, Ingo Molnar wrote: > > * Andrew Morton <[EMAIL PROTECTED]> wrote: > > > > ooh, I'd rather not. I spent an intense three days removing all the > > sleeping locks from ext3 (and three months debugging the result). > > Ended up gaining 1000% on 16-way. > > > > Putting them back in will really hurt the SMP performance. > > ah. Yeah. Sniff. > > if we gain 1000% on a 16-way then there's something really wrong about > semaphores (or scheduling) though. A semaphore is almost a spinlock, in > the uncontended case - and even under contention we really (should) just > spend the cycles that we'd spend spinning. There will be some > intermediate contention level where semaphores hurt, but 1000% sounds > truly excessive. > Could it possibly be that in the process of removing all the sleeping locks from ext3, that Andrew also removed a flaw in ext3 itself that is responsible for the 1000% improvement? -- Steve - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
* Andrew Morton <[EMAIL PROTECTED]> wrote: > > > There's a little lock ranking diagram in jbd.h which tells us that > > > these locks nest inside j_list_lock and j_state_lock. So I guess > > > you'll need to turn those into semaphores. > > > > indeed. I did this (see the three followup patches, against BK-curr), > > and it builds/boots/works just fine on an ext3 box. Do we want to try > > this in -mm? > > ooh, I'd rather not. I spent an intense three days removing all the > sleeping locks from ext3 (and three months debugging the result). > Ended up gaining 1000% on 16-way. > > Putting them back in will really hurt the SMP performance. seems like turning the bitlocks into spinlocks is the best option then. We'd need one lock in buffer_head (j_state_lock, renamed to something more sensible like b_private_lock), and one lock in journal_head (j_list_lock) i guess. How much would the +4/+8 bytes size increase in buffer_head [on SMP] be frowned upon? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
* Andrew Morton <[EMAIL PROTECTED]> wrote: > Ingo Molnar <[EMAIL PROTECTED]> wrote: > > > > > There's a little lock ranking diagram in jbd.h which tells us that > > > these locks nest inside j_list_lock and j_state_lock. So I guess > > > you'll need to turn those into semaphores. > > > > indeed. I did this (see the three followup patches, against BK-curr), > > and it builds/boots/works just fine on an ext3 box. Do we want to try > > this in -mm? > > ooh, I'd rather not. I spent an intense three days removing all the > sleeping locks from ext3 (and three months debugging the result). > Ended up gaining 1000% on 16-way. > > Putting them back in will really hurt the SMP performance. ah. Yeah. Sniff. if we gain 1000% on a 16-way then there's something really wrong about semaphores (or scheduling) though. A semaphore is almost a spinlock, in the uncontended case - and even under contention we really (should) just spend the cycles that we'd spend spinning. There will be some intermediate contention level where semaphores hurt, but 1000% sounds truly excessive. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
Ingo Molnar <[EMAIL PROTECTED]> wrote: > > > There's a little lock ranking diagram in jbd.h which tells us that > > these locks nest inside j_list_lock and j_state_lock. So I guess > > you'll need to turn those into semaphores. > > indeed. I did this (see the three followup patches, against BK-curr), > and it builds/boots/works just fine on an ext3 box. Do we want to try > this in -mm? ooh, I'd rather not. I spent an intense three days removing all the sleeping locks from ext3 (and three months debugging the result). Ended up gaining 1000% on 16-way. Putting them back in will really hurt the SMP performance. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
* Andrew Morton <[EMAIL PROTECTED]> wrote: > > > Damn! The answer was right there in front of my eyes! Here's the > > > cleanest solution. I forgot about wait_on_bit_lock. I've converted > > > all the locks to use this instead. [...] > > > > ah, indeed, this looks really nifty. Andrew? > > > > There's a little lock ranking diagram in jbd.h which tells us that > these locks nest inside j_list_lock and j_state_lock. So I guess > you'll need to turn those into semaphores. indeed. I did this (see the three followup patches, against BK-curr), and it builds/boots/works just fine on an ext3 box. Do we want to try this in -mm? one worry would be that while spinlocks are NOP on UP, semaphores are not. OTOH, this could relax some of the preemptability constraints within ext3 and could make it more hackable. These patches enabled the removal of some of the lock-break code for example and could likely solve some of the remaining ext3 latencies. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
* Andrew Morton [EMAIL PROTECTED] wrote: Damn! The answer was right there in front of my eyes! Here's the cleanest solution. I forgot about wait_on_bit_lock. I've converted all the locks to use this instead. [...] ah, indeed, this looks really nifty. Andrew? There's a little lock ranking diagram in jbd.h which tells us that these locks nest inside j_list_lock and j_state_lock. So I guess you'll need to turn those into semaphores. indeed. I did this (see the three followup patches, against BK-curr), and it builds/boots/works just fine on an ext3 box. Do we want to try this in -mm? one worry would be that while spinlocks are NOP on UP, semaphores are not. OTOH, this could relax some of the preemptability constraints within ext3 and could make it more hackable. These patches enabled the removal of some of the lock-break code for example and could likely solve some of the remaining ext3 latencies. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
Ingo Molnar [EMAIL PROTECTED] wrote: There's a little lock ranking diagram in jbd.h which tells us that these locks nest inside j_list_lock and j_state_lock. So I guess you'll need to turn those into semaphores. indeed. I did this (see the three followup patches, against BK-curr), and it builds/boots/works just fine on an ext3 box. Do we want to try this in -mm? ooh, I'd rather not. I spent an intense three days removing all the sleeping locks from ext3 (and three months debugging the result). Ended up gaining 1000% on 16-way. Putting them back in will really hurt the SMP performance. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
* Andrew Morton [EMAIL PROTECTED] wrote: Ingo Molnar [EMAIL PROTECTED] wrote: There's a little lock ranking diagram in jbd.h which tells us that these locks nest inside j_list_lock and j_state_lock. So I guess you'll need to turn those into semaphores. indeed. I did this (see the three followup patches, against BK-curr), and it builds/boots/works just fine on an ext3 box. Do we want to try this in -mm? ooh, I'd rather not. I spent an intense three days removing all the sleeping locks from ext3 (and three months debugging the result). Ended up gaining 1000% on 16-way. Putting them back in will really hurt the SMP performance. ah. Yeah. Sniff. if we gain 1000% on a 16-way then there's something really wrong about semaphores (or scheduling) though. A semaphore is almost a spinlock, in the uncontended case - and even under contention we really (should) just spend the cycles that we'd spend spinning. There will be some intermediate contention level where semaphores hurt, but 1000% sounds truly excessive. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
* Andrew Morton [EMAIL PROTECTED] wrote: There's a little lock ranking diagram in jbd.h which tells us that these locks nest inside j_list_lock and j_state_lock. So I guess you'll need to turn those into semaphores. indeed. I did this (see the three followup patches, against BK-curr), and it builds/boots/works just fine on an ext3 box. Do we want to try this in -mm? ooh, I'd rather not. I spent an intense three days removing all the sleeping locks from ext3 (and three months debugging the result). Ended up gaining 1000% on 16-way. Putting them back in will really hurt the SMP performance. seems like turning the bitlocks into spinlocks is the best option then. We'd need one lock in buffer_head (j_state_lock, renamed to something more sensible like b_private_lock), and one lock in journal_head (j_list_lock) i guess. How much would the +4/+8 bytes size increase in buffer_head [on SMP] be frowned upon? Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Wed, 16 Mar 2005, Ingo Molnar wrote: * Andrew Morton [EMAIL PROTECTED] wrote: ooh, I'd rather not. I spent an intense three days removing all the sleeping locks from ext3 (and three months debugging the result). Ended up gaining 1000% on 16-way. Putting them back in will really hurt the SMP performance. ah. Yeah. Sniff. if we gain 1000% on a 16-way then there's something really wrong about semaphores (or scheduling) though. A semaphore is almost a spinlock, in the uncontended case - and even under contention we really (should) just spend the cycles that we'd spend spinning. There will be some intermediate contention level where semaphores hurt, but 1000% sounds truly excessive. Could it possibly be that in the process of removing all the sleeping locks from ext3, that Andrew also removed a flaw in ext3 itself that is responsible for the 1000% improvement? -- Steve - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
* Steven Rostedt [EMAIL PROTECTED] wrote: ooh, I'd rather not. I spent an intense three days removing all the sleeping locks from ext3 (and three months debugging the result). Ended up gaining 1000% on 16-way. Putting them back in will really hurt the SMP performance. ah. Yeah. Sniff. if we gain 1000% on a 16-way then there's something really wrong about semaphores (or scheduling) though. A semaphore is almost a spinlock, in the uncontended case - and even under contention we really (should) just spend the cycles that we'd spend spinning. There will be some intermediate contention level where semaphores hurt, but 1000% sounds truly excessive. Could it possibly be that in the process of removing all the sleeping locks from ext3, that Andrew also removed a flaw in ext3 itself that is responsible for the 1000% improvement? i think the chances for that are really remote. I think it must have been a workload ending up scheduling itself to death, while spinlocks force atomicity of execution and affinity. we should be able to see the same scenario with PREEMPT_RT on a 16-way :-) Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
Ingo Molnar [EMAIL PROTECTED] wrote: ooh, I'd rather not. I spent an intense three days removing all the sleeping locks from ext3 (and three months debugging the result). Ended up gaining 1000% on 16-way. Putting them back in will really hurt the SMP performance. ah. Yeah. Sniff. if we gain 1000% on a 16-way then there's something really wrong about semaphores (or scheduling) though. A semaphore is almost a spinlock, in the uncontended case - and even under contention we really (should) just spend the cycles that we'd spend spinning. There will be some intermediate contention level where semaphores hurt, but 1000% sounds truly excessive. I forget how much of the 1000% came from that, but it was quite a lot. Removing the BKL was the first step. That took the context switch rate under high load from ~10,000/sec up to ~300,000/sec. Because the first thing a CPU hit on entry to the fs was then a semaphore. Performance rather took a dive. Of course the locks also became much finer-grained, so the contention opportunities lessened. But j_list_lock and j_state_lock have fs-wide scope, so I'd expect the context switch rate to go up quite a lot again. The hold times are short, and a context switch hurts rather ore than a quick spin. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
* Andrew Morton [EMAIL PROTECTED] wrote: I forget how much of the 1000% came from that, but it was quite a lot. Removing the BKL was the first step. That took the context switch rate under high load from ~10,000/sec up to ~300,000/sec. Because the first thing a CPU hit on entry to the fs was then a semaphore. Performance rather took a dive. Of course the locks also became much finer-grained, so the contention opportunities lessened. But j_list_lock and j_state_lock have fs-wide scope, so I'd expect the context switch rate to go up quite a lot again. The hold times are short, and a context switch hurts rather ore than a quick spin. which particular workload was this - dbench? (I can try PREEMPT_RT on an 8-way, such effects will show up tenfold.) Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Wed, 2005-03-16 at 02:26 -0800, Andrew Morton wrote: The hold times are short, and a context switch hurts rather ore than a quick spin. so we need a spinaphore ;) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
Ingo Molnar [EMAIL PROTECTED] wrote: * Andrew Morton [EMAIL PROTECTED] wrote: There's a little lock ranking diagram in jbd.h which tells us that these locks nest inside j_list_lock and j_state_lock. So I guess you'll need to turn those into semaphores. indeed. I did this (see the three followup patches, against BK-curr), and it builds/boots/works just fine on an ext3 box. Do we want to try this in -mm? ooh, I'd rather not. I spent an intense three days removing all the sleeping locks from ext3 (and three months debugging the result). Ended up gaining 1000% on 16-way. Putting them back in will really hurt the SMP performance. seems like turning the bitlocks into spinlocks is the best option then. We'd need one lock in buffer_head (j_state_lock, renamed to something more sensible like b_private_lock), and one lock in journal_head (j_list_lock) i guess. Those two are in the journal, actually. You refer to jbd_lock_bh_state() and jbd_lock_bh_journal_head(). I think they both need to be in the buffer_head. jbd_lock_bh_journal_head() can probably go away (just use caller's jbd_lock_bh_state()). Or make them global, or put them in the journal. How much would the +4/+8 bytes size increase in buffer_head [on SMP] be frowned upon? It wouldn't be the end of the world. I'm not clear on what bits of the rt-super-low-latency stuff is intended for mainline though? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
Ingo Molnar [EMAIL PROTECTED] wrote: * Andrew Morton [EMAIL PROTECTED] wrote: I forget how much of the 1000% came from that, but it was quite a lot. Removing the BKL was the first step. That took the context switch rate under high load from ~10,000/sec up to ~300,000/sec. Because the first thing a CPU hit on entry to the fs was then a semaphore. Performance rather took a dive. Of course the locks also became much finer-grained, so the contention opportunities lessened. But j_list_lock and j_state_lock have fs-wide scope, so I'd expect the context switch rate to go up quite a lot again. The hold times are short, and a context switch hurts rather ore than a quick spin. which particular workload was this - dbench? (I can try PREEMPT_RT on an 8-way, such effects will show up tenfold.) Oh gee, that was back in the days when Martin was being useful. SDET, I think. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
* Andrew Morton [EMAIL PROTECTED] wrote: How much would the +4/+8 bytes size increase in buffer_head [on SMP] be frowned upon? It wouldn't be the end of the world. I'm not clear on what bits of the rt-super-low-latency stuff is intended for mainline though? in the long run, most of it. There are no conceptual barriers so far, the -RT tree consists of lots of small details and the PREEMPT_RT framework itself. We are trying to solve (and merge) the small details first (in upstream), so that PREEMPT_RT itself becomes uncontroversial. (and it's not really the low latency that matters mainly - more valuable is the fact that under PREEMPT_RT high latencies are statistically much more unlikely [you need to do some really intentional and easy to see things to introduce high latencies], while in the current upstream kernel, high latencies are often side-effects of pretty normal kernel coding activities, so low latencies are always a catch-up game that can never be truly won for sure. So yes, while a 10 usec worst-case latency under arbitrary Linux workloads [on the right hardware] is indeed sexy, more important is that things are much more deterministic and hence much more trustable from a hard-RT POV.) Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Wed, 16 Mar 2005, Andrew Morton wrote: Those two are in the journal, actually. You refer to jbd_lock_bh_state() and jbd_lock_bh_journal_head(). I think they both need to be in the buffer_head. jbd_lock_bh_journal_head() can probably go away (just use caller's jbd_lock_bh_state()). Or make them global, or put them in the journal. The jbd_lock_bh_journal_head can be one global lock without a problem. But when I made jbd_lock_bh_state a global lock, I believe it deadlocked on me. So this one has to go into the buffer head. What do you mean with put them in the journal, do you mean the journal_s structure? Is there a safe way to get to that structure from the buffer head? The state lock is used quite a bit and it gets tricky trying to figure out how to use other structures wrt buffer_heads at all the locations that use jbd_lock_bh_state. -- Steve - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
Steven Rostedt [EMAIL PROTECTED] wrote: On Wed, 16 Mar 2005, Andrew Morton wrote: Those two are in the journal, actually. You refer to jbd_lock_bh_state() and jbd_lock_bh_journal_head(). I think they both need to be in the buffer_head. jbd_lock_bh_journal_head() can probably go away (just use caller's jbd_lock_bh_state()). Or make them global, or put them in the journal. The jbd_lock_bh_journal_head can be one global lock without a problem. As I say, we can probably eliminate it. But when I made jbd_lock_bh_state a global lock, I believe it deadlocked on me. That's a worry. So this one has to go into the buffer head. What do you mean with put them in the journal, do you mean the journal_s structure? Yes. Is there a safe way to get to that structure from the buffer head? No convenient way, iirc. But there's usually a fairly straightforward way to get at the journal from within JBD code. The state lock is used quite a bit and it gets tricky trying to figure out how to use other structures wrt buffer_heads at all the locations that use jbd_lock_bh_state. That one should go into the buffer_head, I guess. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Wed, 16 Mar 2005, Andrew Morton wrote: But when I made jbd_lock_bh_state a global lock, I believe it deadlocked on me. That's a worry. OK, I'm wrong here. I just tried it again and it didn't deadlock (that must have been another lock I was dealing with). But it does test if the buffer head is locked or not, and asserts if it is. I'm running the following patch with on problems so far. I still use the lock bits to determine if the bh state is locked. Do you and Ingo think that this would have too much contention. Ingo, I still get the following bug because of the added BUFFER_FNS and DESKTOP_PREEMPT. I haven't tried this with RT yet. I'll see if this shows a deadlock there. BUG: Unable to handle kernel NULL pointer dereference at virtual address printing eip: c0214888 *pde = Oops: [#1] Modules linked in: ipv6 af_packet tsdev mousedev evdev floppy psmouse pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc shpchp pci_hotplug ehci_hcd intel_agp agpgart uhci_hcd usbcore e100 mii ide_cd cdrom unix CPU:0 EIP:0060:[c0214888]Not tainted VLI EFLAGS: 00010286 (2.6.11-RT-V0.7.40-00) EIP is at vt_ioctl+0x18/0x1ab0 eax: ebx: 5603 ecx: 5603 edx: cec18c80 esi: c0214870 edi: cb49e000 ebp: cb479f18 esp: cb479e48 ds: 007b es: 007b ss: 0068 preempt: Process XFree86 (pid: 4744, threadinfo=cb478000 task=cb403530) Stack: cb403680 cb478000 cb403530 c034594c cb403530 0246 cb479e7c c0117217 c0345954 0006 0001 cb479ebc cefa1c04 c13e1000 ced6b9b8 cb479ed4 c01707f1 ced6b9b8 0007 Call Trace: [c0103cdf] show_stack+0x7f/0xa0 (28) [c0103e95] show_registers+0x165/0x1d0 (56) [c0104088] die+0xc8/0x150 (64) [c0115376] do_page_fault+0x356/0x6c4 (216) [c0103973] error_code+0x2b/0x30 (268) [c020fd6b] tty_ioctl+0x34b/0x490 (52) [c016837f] do_ioctl+0x4f/0x70 (32) [c0168582] vfs_ioctl+0x62/0x1d0 (40) [c0168751] sys_ioctl+0x61/0x90 (40) [c0102ec3] syscall_call+0x7/0xb (-8124) Code: ff ff 8d 05 88 5d 34 c0 e8 f6 60 0a 00 e9 3a ff ff ff 90 55 89 e5 57 56 53 81 ec c4 00 00 00 8b 7d 08 8b 5d 10 8b 87 7c 09 00 00 8b 30 89 34 24 8b 04 b5 e0 b7 3c c0 89 45 8c e8 a4 6a 00 00 85 Here's the patch (on Ingo's -40 kernel). diff -ur linux-2.6.11-final-V0.7.40-00.orig/fs/jbd/journal.c linux-2.6.11-final-V0.7.40-00/fs/jbd/journal.c --- linux-2.6.11-final-V0.7.40-00.orig/fs/jbd/journal.c 2005-03-02 02:37:49.0 -0500 +++ linux-2.6.11-final-V0.7.40-00/fs/jbd/journal.c 2005-03-16 07:47:50.0 -0500 @@ -82,6 +82,9 @@ static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); +spinlock_t jbd_state_lock = SPIN_LOCK_UNLOCKED; +spinlock_t jbd_journal_lock = SPIN_LOCK_UNLOCKED; + /* * Helper function used to manage commit timeouts */ diff -ur linux-2.6.11-final-V0.7.40-00.orig/include/linux/jbd.h linux-2.6.11-final-V0.7.40-00/include/linux/jbd.h --- linux-2.6.11-final-V0.7.40-00.orig/include/linux/jbd.h 2005-03-02 02:38:19.0 -0500 +++ linux-2.6.11-final-V0.7.40-00/include/linux/jbd.h 2005-03-16 08:51:27.292105187 -0500 @@ -313,6 +313,8 @@ BUFFER_FNS(RevokeValid, revokevalid) TAS_BUFFER_FNS(RevokeValid, revokevalid) BUFFER_FNS(Freed, freed) +BUFFER_FNS(State,state) +BUFFER_FNS(JournalHead,journal) static inline struct buffer_head *jh2bh(struct journal_head *jh) { @@ -324,34 +326,50 @@ return bh-b_private; } +extern spinlock_t jbd_state_lock; +extern spinlock_t jbd_journal_lock; + static inline void jbd_lock_bh_state(struct buffer_head *bh) { - bit_spin_lock(BH_State, bh-b_state); + spin_lock(jbd_state_lock); + BUG_ON(buffer_state(bh)); + set_buffer_state(bh); } static inline int jbd_trylock_bh_state(struct buffer_head *bh) { - return bit_spin_trylock(BH_State, bh-b_state); + if (spin_trylock(jbd_state_lock)) { + BUG_ON(buffer_state(bh)); + set_buffer_state(bh); + return 1; + } + return 0; } static inline int jbd_is_locked_bh_state(struct buffer_head *bh) { - return bit_spin_is_locked(BH_State, bh-b_state); + return buffer_state(bh); //spin_is_locked(jbd_state_lock); } static inline void jbd_unlock_bh_state(struct buffer_head *bh) { - bit_spin_unlock(BH_State, bh-b_state); + BUG_ON(!buffer_state(bh)); + clear_buffer_state(bh); + spin_unlock(jbd_state_lock); } static inline void jbd_lock_bh_journal_head(struct buffer_head *bh) { - bit_spin_lock(BH_JournalHead, bh-b_state); + spin_lock(jbd_journal_lock); + BUG_ON(buffer_journal(bh)); + set_buffer_journal(bh); } static inline void jbd_unlock_bh_journal_head(struct buffer_head *bh) { - bit_spin_unlock(BH_JournalHead, bh-b_state); + BUG_ON(!buffer_journal(bh)); + clear_buffer_journal(bh); +
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Wed, 16 Mar 2005, Steven Rostedt wrote: Ingo, I still get the following bug because of the added BUFFER_FNS and DESKTOP_PREEMPT. I haven't tried this with RT yet. I'll see if this shows a deadlock there. Hi Ingo, I just ran this with PREEMPT_RT and it works fine. Now is this the best solution, or adding a lock to the buffer head? This works but I don't have anything more than a 2X CPU to test this on. If either you or Andrew can try this on the 8x or 16x that would be great.. Also, I only get the BUG with PREEMPT_DESKTOP. I really don't understand why this happens. I sent you a test patch earlier with just adding BUFFER_FNS(JournalHead,journalhead) in jbd.h, and under PREEMPT_DESKTOP that causes this bug as well. No other changes, just adding the BUFFER_FNS call causes this. I can't find any other reference to buffer_journal (besides reiser_fs). What do you think, and are you getting the same bug? -- Steve BUG: Unable to handle kernel NULL pointer dereference at virtual address printing eip: c0214888 *pde = Oops: [#1] Modules linked in: ipv6 af_packet tsdev mousedev evdev floppy psmouse pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc shpchp pci_hotplug ehci_hcd intel_agp agpgart uhci_hcd usbcore e100 mii ide_cd cdrom unix CPU:0 EIP:0060:[c0214888]Not tainted VLI EFLAGS: 00010286 (2.6.11-RT-V0.7.40-00) EIP is at vt_ioctl+0x18/0x1ab0 eax: ebx: 5603 ecx: 5603 edx: cec18c80 esi: c0214870 edi: cb49e000 ebp: cb479f18 esp: cb479e48 ds: 007b es: 007b ss: 0068 preempt: Process XFree86 (pid: 4744, threadinfo=cb478000 task=cb403530) Stack: cb403680 cb478000 cb403530 c034594c cb403530 0246 cb479e7c c0117217 c0345954 0006 0001 cb479ebc cefa1c04 c13e1000 ced6b9b8 cb479ed4 c01707f1 ced6b9b8 0007 Call Trace: [c0103cdf] show_stack+0x7f/0xa0 (28) [c0103e95] show_registers+0x165/0x1d0 (56) [c0104088] die+0xc8/0x150 (64) [c0115376] do_page_fault+0x356/0x6c4 (216) [c0103973] error_code+0x2b/0x30 (268) [c020fd6b] tty_ioctl+0x34b/0x490 (52) [c016837f] do_ioctl+0x4f/0x70 (32) [c0168582] vfs_ioctl+0x62/0x1d0 (40) [c0168751] sys_ioctl+0x61/0x90 (40) [c0102ec3] syscall_call+0x7/0xb (-8124) Code: ff ff 8d 05 88 5d 34 c0 e8 f6 60 0a 00 e9 3a ff ff ff 90 55 89 e5 57 56 53 81 ec c4 00 00 00 8b 7d 08 8b 5d 10 8b 87 7c 09 00 00 8b 30 89 34 24 8b 04 b5 e0 b7 3c c0 89 45 8c e8 a4 6a 00 00 85 Here's the patch (on Ingo's -40 kernel). diff -ur linux-2.6.11-final-V0.7.40-00.orig/fs/jbd/journal.c linux-2.6.11-final-V0.7.40-00/fs/jbd/journal.c --- linux-2.6.11-final-V0.7.40-00.orig/fs/jbd/journal.c 2005-03-02 02:37:49.0 -0500 +++ linux-2.6.11-final-V0.7.40-00/fs/jbd/journal.c2005-03-16 07:47:50.0 -0500 @@ -82,6 +82,9 @@ static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); +spinlock_t jbd_state_lock = SPIN_LOCK_UNLOCKED; +spinlock_t jbd_journal_lock = SPIN_LOCK_UNLOCKED; + /* * Helper function used to manage commit timeouts */ diff -ur linux-2.6.11-final-V0.7.40-00.orig/include/linux/jbd.h linux-2.6.11-final-V0.7.40-00/include/linux/jbd.h --- linux-2.6.11-final-V0.7.40-00.orig/include/linux/jbd.h2005-03-02 02:38:19.0 -0500 +++ linux-2.6.11-final-V0.7.40-00/include/linux/jbd.h 2005-03-16 08:51:27.292105187 -0500 @@ -313,6 +313,8 @@ BUFFER_FNS(RevokeValid, revokevalid) TAS_BUFFER_FNS(RevokeValid, revokevalid) BUFFER_FNS(Freed, freed) +BUFFER_FNS(State,state) +BUFFER_FNS(JournalHead,journal) static inline struct buffer_head *jh2bh(struct journal_head *jh) { @@ -324,34 +326,50 @@ return bh-b_private; } +extern spinlock_t jbd_state_lock; +extern spinlock_t jbd_journal_lock; + static inline void jbd_lock_bh_state(struct buffer_head *bh) { - bit_spin_lock(BH_State, bh-b_state); + spin_lock(jbd_state_lock); + BUG_ON(buffer_state(bh)); + set_buffer_state(bh); } static inline int jbd_trylock_bh_state(struct buffer_head *bh) { - return bit_spin_trylock(BH_State, bh-b_state); + if (spin_trylock(jbd_state_lock)) { + BUG_ON(buffer_state(bh)); + set_buffer_state(bh); + return 1; + } + return 0; } static inline int jbd_is_locked_bh_state(struct buffer_head *bh) { - return bit_spin_is_locked(BH_State, bh-b_state); + return buffer_state(bh); //spin_is_locked(jbd_state_lock); } static inline void jbd_unlock_bh_state(struct buffer_head *bh) { - bit_spin_unlock(BH_State, bh-b_state); + BUG_ON(!buffer_state(bh)); + clear_buffer_state(bh); + spin_unlock(jbd_state_lock); } static inline void jbd_lock_bh_journal_head(struct buffer_head *bh) { - bit_spin_lock(BH_JournalHead, bh-b_state); +
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Wed, 16 Mar 2005, Steven Rostedt wrote: Hi Ingo, I just ran this with PREEMPT_RT and it works fine. Not quite, and I will assume that some of the other patches I sent have this same problem. The jbd_trylock_bh_state really scares me. It seems that in fs/jbd/commit.c in journal_commit_transaction we have the following code: write_out_data: cond_resched(); spin_lock(journal-j_list_lock); while (commit_transaction-t_sync_datalist) { struct buffer_head *bh; jh = commit_transaction-t_sync_datalist; commit_transaction-t_sync_datalist = jh-b_tnext; bh = jh2bh(jh); if (buffer_locked(bh)) { BUFFER_TRACE(bh, locked); if (!inverted_lock(journal, bh)) goto write_out_data; where invert_data simply is: /* * Try to acquire jbd_lock_bh_state() against the buffer, when j_list_lock is * held. For ranking reasons we must trylock. If we lose, schedule away and * return 0. j_list_lock is dropped in this case. */ static int inverted_lock(journal_t *journal, struct buffer_head *bh) { if (!jbd_trylock_bh_state(bh)) { spin_unlock(journal-j_list_lock); schedule(); return 0; } return 1; } So, with kjournal running as a FIFO, it may hit this (as it did with my last test) and not get the lock. All it does is release another lock (ranking reasons) and calls schedule and tries again. With kjournal the highest running process on the system (UP) it deadlocks since whoever has the lock will never get a chance to run. There's a couple of places that jbd_trylock_bh_state is used in checkpoint.c, but this is the one place that it definitely deadlocks the system. I believe that the code in checkpoint.c also has this problem. I guess one way to solve this is to add a wait queue here (before schedule()), and have the one holding the lock to wake up all on the waitqueue when they release it. -- Steve - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Wed, 2005-03-16 at 12:47 -0500, Steven Rostedt wrote: On Wed, 16 Mar 2005, Steven Rostedt wrote: Hi Ingo, I just ran this with PREEMPT_RT and it works fine. Not quite, and I will assume that some of the other patches I sent have this same problem. The jbd_trylock_bh_state really scares me. It seems that in fs/jbd/commit.c in journal_commit_transaction we have the following code: I am a bit confused, big surprise. Does this thread still have anything to do with this trace from my Latency regressions bug report? http://www.alsa-project.org/~rlrevell/2912us The problem only is apparent with PREEMPT_DESKTOP and data=ordered. PREEMPT_RT has always worked perfectly. Lee - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
Steven Rostedt [EMAIL PROTECTED] wrote: /* * Try to acquire jbd_lock_bh_state() against the buffer, when j_list_lock is * held. For ranking reasons we must trylock. If we lose, schedule away and * return 0. j_list_lock is dropped in this case. */ static int inverted_lock(journal_t *journal, struct buffer_head *bh) { if (!jbd_trylock_bh_state(bh)) { spin_unlock(journal-j_list_lock); schedule(); return 0; } return 1; } That's very lame code, that. The old I don't know what the heck to do now so I'll schedule trick. Sorry. I guess one way to solve this is to add a wait queue here (before schedule()), and have the one holding the lock to wake up all on the waitqueue when they release it. yup. A patch against mainline would be appropriate, please. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks
On Wed, 16 Mar 2005, Lee Revell wrote: I am a bit confused, big surprise. Does this thread still have anything to do with this trace from my Latency regressions bug report? Don't worry, I've been in a state of confusion for a long time now ;-) http://www.alsa-project.org/~rlrevell/2912us The problem only is apparent with PREEMPT_DESKTOP and data=ordered. PREEMPT_RT has always worked perfectly. I'm surprise that PREEMPT_RT does work. I'm no longer sure that this does affect your latency anymore. It probably does indirectly somehow. I still think it has to do with the bitspinlocks. But I'm not sure. Just let me know if you want to be taken off this thread and I'll remove you from my CC list. Until then, I'll keep you on. -- Steve - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/