Re: [Ext2-devel] Re: ext3 allocate-with-reservation latencies
On Mon, 2005-04-18 at 19:00 +0100, Stephen C. Tweedie wrote: > > > Note that there _is_ some additional complexity here. It's not entirely > > > free. Specifically, if we have other tasks waiting on our provisional > > > window, then we need to be very careful about the life expectancy of the > > > wait queues involved, so that if the first task fixes its window and > > > then deletes it before the waiters have woken up, they don't get > > > confused by the provisional window vanishing while being waited for. > > > This approach(provisional window) sounds interesting to me, but it's a > > little more complicated than I thought:( > > Yep. Once you have other tasks waiting on your window while it's not > locked, object lifetime becomes a much bigger problem. > > > alloc_new_reservation() > > retry: > > lock the tree > > search for a new space start from the given startblock > > found a new space > > if the new space is not a "provisional window" > I was thinking rather that we _start_ with the window, and _then_ look > for new space. > It seems I am lost here. Could you elaborate more here, please? What is "the window" you are referring to here? The old(stale) reservation window? I thought we are discussing the algorithm to how to allocate a new window. > So we'd start with: > > if we already have a window, > mark it provisional; > else, > do > search for a new window; > if the immediately preceding window is provisional, > wait for that window; > continue; > allocate the window and mark it provisional; > break > > At this point we have a provisional window, and we know that nobody else > is going to allocate either in it, or in the empty space following it > (because if they tried to, they'd bump into our own provisional window > as their predecessor and would wait for us.) So even if the window > doesn't occupy the _whole_ unreserved area, we can still keep searching > without fear of multiple tasks trying to do so in the same space at the > same time. > > --Stephen > > Hmm.. This thread was to address the latency of holding the global per-fs reservation lock while scanning the block group bitmap. And the whole point of the "provisional window" is to prevent other inodes being forced to put into other block groups when one inode hold/reserve the whole block group as a temporary window to do the bitmap scan. I clearly see it's a win if there are no multiple threads allocating blocks nearby at the same time. However the whole reservation is there to address the case where multiple allocations happen at the same time near the same place. With the provisional method, if multiple threads trying to allocate new windows in the same area, they are still have to wait for other new- window-allocation-and-bitmap-scan finish. After that the probably will compete the same window again and again:( I am worried that the latency is not being being improved than before (holding the lock for bitmap scan), and we have paid extra cpu time or context switches. Also the changes to the algorithm is no going to be trivial. Now we all agree that the right thing to fix the latency is to drop the lock and then scan the bitmap. Before that we need to reserve the open window in case someone else is trying to target at the same window. Question was should we reserve the whole free reservable space or just the window size we need. Now that we have explored and discussed so many possible solutions, I think the lest evil and less intrusive way to just lock the small window. We were worried about in the case the scanned free bit is not inside the temporary reserved window, it is possible we have to do many window unlink and link operations on the rb tree, however this is not avoidable with the provisional window proposal either. Probably we could start with that simple proposal first, and if there are benchmarks shows the cpu usage is really high and a concern, we could address that later? Thanks, Mingming - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Fri, 2005-04-15 at 21:32, Mingming Cao wrote: > Sorry for the delaying. I was not in office these days. No problem. > > > Also I am concerned about the possible > > > starvation on writers. > > In what way? > I was worried about the rw lock case.:) OK, so we're both on the same track here. :-) > > Note that there _is_ some additional complexity here. It's not entirely > > free. Specifically, if we have other tasks waiting on our provisional > > window, then we need to be very careful about the life expectancy of the > > wait queues involved, so that if the first task fixes its window and > > then deletes it before the waiters have woken up, they don't get > > confused by the provisional window vanishing while being waited for. > This approach(provisional window) sounds interesting to me, but it's a > little more complicated than I thought:( Yep. Once you have other tasks waiting on your window while it's not locked, object lifetime becomes a much bigger problem. > alloc_new_reservation() > retry: > lock the tree > search for a new space start from the given startblock > found a new space > if the new space is not a "provisional window" I was thinking rather that we _start_ with the window, and _then_ look for new space. So we'd start with: if we already have a window, mark it provisional; else, do search for a new window; if the immediately preceding window is provisional, wait for that window; continue; allocate the window and mark it provisional; break At this point we have a provisional window, and we know that nobody else is going to allocate either in it, or in the empty space following it (because if they tried to, they'd bump into our own provisional window as their predecessor and would wait for us.) So even if the window doesn't occupy the _whole_ unreserved area, we can still keep searching without fear of multiple tasks trying to do so in the same space at the same time. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] Re: ext3 allocate-with-reservation latencies
On Mon, 2005-04-18 at 19:00 +0100, Stephen C. Tweedie wrote: Note that there _is_ some additional complexity here. It's not entirely free. Specifically, if we have other tasks waiting on our provisional window, then we need to be very careful about the life expectancy of the wait queues involved, so that if the first task fixes its window and then deletes it before the waiters have woken up, they don't get confused by the provisional window vanishing while being waited for. This approach(provisional window) sounds interesting to me, but it's a little more complicated than I thought:( Yep. Once you have other tasks waiting on your window while it's not locked, object lifetime becomes a much bigger problem. alloc_new_reservation() retry: lock the tree search for a new space start from the given startblock found a new space if the new space is not a provisional window I was thinking rather that we _start_ with the window, and _then_ look for new space. It seems I am lost here. Could you elaborate more here, please? What is the window you are referring to here? The old(stale) reservation window? I thought we are discussing the algorithm to how to allocate a new window. So we'd start with: if we already have a window, mark it provisional; else, do search for a new window; if the immediately preceding window is provisional, wait for that window; continue; allocate the window and mark it provisional; break At this point we have a provisional window, and we know that nobody else is going to allocate either in it, or in the empty space following it (because if they tried to, they'd bump into our own provisional window as their predecessor and would wait for us.) So even if the window doesn't occupy the _whole_ unreserved area, we can still keep searching without fear of multiple tasks trying to do so in the same space at the same time. --Stephen Hmm.. This thread was to address the latency of holding the global per-fs reservation lock while scanning the block group bitmap. And the whole point of the provisional window is to prevent other inodes being forced to put into other block groups when one inode hold/reserve the whole block group as a temporary window to do the bitmap scan. I clearly see it's a win if there are no multiple threads allocating blocks nearby at the same time. However the whole reservation is there to address the case where multiple allocations happen at the same time near the same place. With the provisional method, if multiple threads trying to allocate new windows in the same area, they are still have to wait for other new- window-allocation-and-bitmap-scan finish. After that the probably will compete the same window again and again:( I am worried that the latency is not being being improved than before (holding the lock for bitmap scan), and we have paid extra cpu time or context switches. Also the changes to the algorithm is no going to be trivial. Now we all agree that the right thing to fix the latency is to drop the lock and then scan the bitmap. Before that we need to reserve the open window in case someone else is trying to target at the same window. Question was should we reserve the whole free reservable space or just the window size we need. Now that we have explored and discussed so many possible solutions, I think the lest evil and less intrusive way to just lock the small window. We were worried about in the case the scanned free bit is not inside the temporary reserved window, it is possible we have to do many window unlink and link operations on the rb tree, however this is not avoidable with the provisional window proposal either. Probably we could start with that simple proposal first, and if there are benchmarks shows the cpu usage is really high and a concern, we could address that later? Thanks, Mingming - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Wed, 2005-04-13 at 00:27, Mingming Cao wrote: > > I wonder if there's not a simple solution for this --- mark the window > > as "provisional", and if any other task tries to allocate in the space > > immediately following such a window, it needs to block until that window > > is released. > Sounds interesting. However that implies we need a write lock to mark > the window as provisional and block other files looking for windows near > it: we need to insert the provisional window into the tree and then mark > it as a temporary window, to really let other file notice this kind of > "hold". We need a lock for the tree modification, yes. > I wonder if the benefit of read/write lock is worth all the hassles now. The point of the provisional window is that you don't need read/write locks at all. The provisional window lets you unlock the tree completely while you do the bitmap scan, so there's really no reason to have rwlocks for the tree any more. > If the new window's neighbor stays the same, we only need to roll > forward once. If not, after a successful scan, we need to grab the > write lock, and make sure the window is still there. When we take the provisional window, we can make a note of how much space we have before the next window. And because all future allocators will stall if they try to allocate at this point due to the provisional window, we know that that space will remain outside any other window until we come to fix the provisional window and potentially roll it forward to the space we found. > If we dropped the > lock without holding the new space, we have to search the whole tree > again to find out if the space is still there As long as the space is within the area between the provisional window and its successor, we don't need to do that. (It gets more complex if the bitmap search returns a bit _beyond_ the next window, though.) > Also I am concerned about the possible > starvation on writers. In what way? > I am thinking, maybe back to the spin_lock is not that bad with the > "mark provisional" suggest you made? Right, that was the intent --- sorry if I didn't make it clear. > It allows us to mark the new space > as provisional if we find a new space(prevent other window searching run > into the same neighborhood). We could release the lock and scan the > bitmap without worry about the new space will be taking by others. Exactly. Note that there _is_ some additional complexity here. It's not entirely free. Specifically, if we have other tasks waiting on our provisional window, then we need to be very careful about the life expectancy of the wait queues involved, so that if the first task fixes its window and then deletes it before the waiters have woken up, they don't get confused by the provisional window vanishing while being waited for. The easy solution is a global wait queue for that, but that doesn't scale well. The complex solution is a per-window wait queue and reference count, which is obviously a bit of bloat, though probably worth it for the high-load case. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Wed, 2005-04-13 at 00:27, Mingming Cao wrote: I wonder if there's not a simple solution for this --- mark the window as provisional, and if any other task tries to allocate in the space immediately following such a window, it needs to block until that window is released. Sounds interesting. However that implies we need a write lock to mark the window as provisional and block other files looking for windows near it: we need to insert the provisional window into the tree and then mark it as a temporary window, to really let other file notice this kind of hold. We need a lock for the tree modification, yes. I wonder if the benefit of read/write lock is worth all the hassles now. The point of the provisional window is that you don't need read/write locks at all. The provisional window lets you unlock the tree completely while you do the bitmap scan, so there's really no reason to have rwlocks for the tree any more. If the new window's neighbor stays the same, we only need to roll forward once. If not, after a successful scan, we need to grab the write lock, and make sure the window is still there. When we take the provisional window, we can make a note of how much space we have before the next window. And because all future allocators will stall if they try to allocate at this point due to the provisional window, we know that that space will remain outside any other window until we come to fix the provisional window and potentially roll it forward to the space we found. If we dropped the lock without holding the new space, we have to search the whole tree again to find out if the space is still there As long as the space is within the area between the provisional window and its successor, we don't need to do that. (It gets more complex if the bitmap search returns a bit _beyond_ the next window, though.) Also I am concerned about the possible starvation on writers. In what way? I am thinking, maybe back to the spin_lock is not that bad with the mark provisional suggest you made? Right, that was the intent --- sorry if I didn't make it clear. It allows us to mark the new space as provisional if we find a new space(prevent other window searching run into the same neighborhood). We could release the lock and scan the bitmap without worry about the new space will be taking by others. Exactly. Note that there _is_ some additional complexity here. It's not entirely free. Specifically, if we have other tasks waiting on our provisional window, then we need to be very careful about the life expectancy of the wait queues involved, so that if the first task fixes its window and then deletes it before the waiters have woken up, they don't get confused by the provisional window vanishing while being waited for. The easy solution is a global wait queue for that, but that doesn't scale well. The complex solution is a per-window wait queue and reference count, which is obviously a bit of bloat, though probably worth it for the high-load case. Cheers, Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Tue, 2005-04-12 at 12:18 +0100, Stephen C. Tweedie wrote: > Hi, > > On Tue, 2005-04-12 at 07:41, Mingming Cao wrote: > > > > Note that this may improve average case latencies, but it's not likely > > > to improve worst-case ones. We still need a write lock to install a new > > > window, and that's going to have to wait for us to finish finding a free > > > bit even if that operation starts using a read lock. > > > > > Yes indeed. However nothing is free and there are always trade-offs.:) > > > > By worse case you mean multiple writes trying to allocate blocks around > > same area? > > It doesn't matter where they are; multiple new file opens will all be > looking for a write lock. You only need one long-held read lock and all > the writers still block. The worst-case latencies can't be properly > solved with r/w locks --- those let the readers go more quickly > (assuming they are in the majority), which helps the average case, but > writers still have to wait for exclusive access. We only really help > them by dropping the lock entirely. > > > Even if we take out the whole > > reservation, we still possibility run into this kind of latency: the > > bitmap on disk and on journal are extremely inconsistent so we need to > > keep searching them both until we find a free bit on both map. > > Quite possibly. But as long as that code is running without locks, it's > much easier to deal with those latencies: they won't impact other CPUs, > cond_resched() is easier, and there's even CONFIG_PREEMPT. > Okey, I agree. > > > I'm not really sure what to do about worst-case here. For that, we > > > really do want to drop the lock entirely while we do the bitmap scan. > > > Hmm...if we drop the lock entirely while scan the bitmap, assuming you > > mean drop the read lock, then I am afraid we have to re-check the tree > > (require a read or write lock ) to see if the new window space is still > > there after the scan succeed. > > Sure. You basically start off with a provisional window, and then if > necessary roll it forward just the same way you roll normal windows > forward when they get to their end. That means you can still drop the > lock while you search for new space. When you get there, reacquire the > lock and check that the intervening space is still available. > Please note that the bitmap scan does not only scan the provisional window range, it will return the first free it on the bitmap start from the start block of the provisional window until the end of the whole block group. So in the case the new window's neighbor is the same as the old one(that means window is rolled forward), we only need roll forward once to find it. Either we find a free bit inside the provisional window, or we find a free bit out of it. In the first case we just roll window forward and we are done. In the second case, it's possible the free bit is inside someone else's window(which means we can't take that window) or it inside the new space after a already reserved window. Either way we have to lock up the whole tree to remove the old window and insert the new window. > That's really cheap for the common case. The difficulty is when you > have many parallel allocations hitting the same bg: they allocate > provisional windows, find the same free area later on in the bg, and > then stomp on each other as they try to move their windows there. > > I wonder if there's not a simple solution for this --- mark the window > as "provisional", and if any other task tries to allocate in the space > immediately following such a window, it needs to block until that window > is released. > Sounds interesting. However that implies we need a write lock to mark the window as provisional and block other files looking for windows near it: we need to insert the provisional window into the tree and then mark it as a temporary window, to really let other file notice this kind of "hold". I wonder if the benefit of read/write lock is worth all the hassles now. If the new window's neighbor stays the same, we only need to roll forward once. If not, after a successful scan, we need to grab the write lock, and make sure the window is still there. If we dropped the lock without holding the new space, we have to search the whole tree again to find out if the space is still there, we cannot use the previous node returned by find_next_reservable_window() since the previous node could be gone while we scan the bitmap without the lock. Basically we do twice tree search(once for roll forward case) and twice locking in the normal case. Also I am concerned about the possible starvation on writers. I am thinking, maybe back to the spin_lock is not that bad with the "mark provisional" suggest you made? It allows us to mark the new space as provisional if we find a new space(prevent other window searching run into the same neighborhood). We could release the lock and scan the bitmap without worry about the new space will be taking by
Re: ext3 allocate-with-reservation latencies
Hi, On Tue, 2005-04-12 at 07:41, Mingming Cao wrote: > > Note that this may improve average case latencies, but it's not likely > > to improve worst-case ones. We still need a write lock to install a new > > window, and that's going to have to wait for us to finish finding a free > > bit even if that operation starts using a read lock. > > > Yes indeed. However nothing is free and there are always trade-offs.:) > > By worse case you mean multiple writes trying to allocate blocks around > same area? It doesn't matter where they are; multiple new file opens will all be looking for a write lock. You only need one long-held read lock and all the writers still block. The worst-case latencies can't be properly solved with r/w locks --- those let the readers go more quickly (assuming they are in the majority), which helps the average case, but writers still have to wait for exclusive access. We only really help them by dropping the lock entirely. > Even if we take out the whole > reservation, we still possibility run into this kind of latency: the > bitmap on disk and on journal are extremely inconsistent so we need to > keep searching them both until we find a free bit on both map. Quite possibly. But as long as that code is running without locks, it's much easier to deal with those latencies: they won't impact other CPUs, cond_resched() is easier, and there's even CONFIG_PREEMPT. > > I'm not really sure what to do about worst-case here. For that, we > > really do want to drop the lock entirely while we do the bitmap scan. > Hmm...if we drop the lock entirely while scan the bitmap, assuming you > mean drop the read lock, then I am afraid we have to re-check the tree > (require a read or write lock ) to see if the new window space is still > there after the scan succeed. Sure. You basically start off with a provisional window, and then if necessary roll it forward just the same way you roll normal windows forward when they get to their end. That means you can still drop the lock while you search for new space. When you get there, reacquire the lock and check that the intervening space is still available. That's really cheap for the common case. The difficulty is when you have many parallel allocations hitting the same bg: they allocate provisional windows, find the same free area later on in the bg, and then stomp on each other as they try to move their windows there. I wonder if there's not a simple solution for this --- mark the window as "provisional", and if any other task tries to allocate in the space immediately following such a window, it needs to block until that window is released. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Mon, 2005-04-11 at 20:57 +0100, Stephen C. Tweedie wrote: > Hi, Hi Stephen, > > On Mon, 2005-04-11 at 19:38, Mingming Cao wrote: > > Ah.. I see the win with the read lock now: once the a reservation window > > is added, updating it (either winding it forward and or searching for a > > avaliable window) probably is the majorirty of the operations on the > > tree, and using read lock for that should help reduce the latency. > > Right. The down side is that for things like a kernel "tar xf", we'll > be doing lots of small file unpacks, and hopefully most files will be > just one or two reservations --- so there's little winding forward going > on. The searching will still be improved in that case. Just a side note that "tar xf" should know the file size before unpacking it. So it could set the reservation window size large enough to fit the entire file before doing the file write through ioctl command. > Note that this may improve average case latencies, but it's not likely > to improve worst-case ones. We still need a write lock to install a new > window, and that's going to have to wait for us to finish finding a free > bit even if that operation starts using a read lock. > Yes indeed. However nothing is free and there are always trade-offs.:) By worse case you mean multiple writes trying to allocate blocks around same area? But I wonder if the latency saw by Lee belongs to this worst-case: the latency comes mostly from loop calling find_next_zero_bit() in bitmap_search_next_usable_block(). Even if we take out the whole reservation, we still possibility run into this kind of latency: the bitmap on disk and on journal are extremely inconsistent so we need to keep searching them both until we find a free bit on both map. > I'm not really sure what to do about worst-case here. For that, we > really do want to drop the lock entirely while we do the bitmap scan. > Hmm...if we drop the lock entirely while scan the bitmap, assuming you mean drop the read lock, then I am afraid we have to re-check the tree (require a read or write lock ) to see if the new window space is still there after the scan succeed. This is probably not very interesting for the window rotating case. > That leaves two options. Hold a reservation while we do that; or don't. > Holding one poses the problems we discussed before: either you hold a > large reservation (bad for disk layout in the presence of concurrent > allocators), or you hold smaller ones (high cost as you continually > advance the window, which requires some read lock on the tree to avoid > bumping into the next window.) > Well, we cannot hold a reservation (which need to update the tree) without a write lock. I guess if we want to improve the average case latency by replacing the current spin_lock with the read lock for the new window space searching, we don't have much choice here. > Just how bad would it be if we didn't hold a lock _or_ a window at all > while doing the search for new window space? I wonder if this is feasible: Walk through the rb tree without a lock? What if some node is being removed by another thread while we are walking through the tree and trying to get the next node from it? Thanks, Mingming - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Mon, 2005-04-11 at 20:57 +0100, Stephen C. Tweedie wrote: Hi, Hi Stephen, On Mon, 2005-04-11 at 19:38, Mingming Cao wrote: Ah.. I see the win with the read lock now: once the a reservation window is added, updating it (either winding it forward and or searching for a avaliable window) probably is the majorirty of the operations on the tree, and using read lock for that should help reduce the latency. Right. The down side is that for things like a kernel tar xf, we'll be doing lots of small file unpacks, and hopefully most files will be just one or two reservations --- so there's little winding forward going on. The searching will still be improved in that case. Just a side note that tar xf should know the file size before unpacking it. So it could set the reservation window size large enough to fit the entire file before doing the file write through ioctl command. Note that this may improve average case latencies, but it's not likely to improve worst-case ones. We still need a write lock to install a new window, and that's going to have to wait for us to finish finding a free bit even if that operation starts using a read lock. Yes indeed. However nothing is free and there are always trade-offs.:) By worse case you mean multiple writes trying to allocate blocks around same area? But I wonder if the latency saw by Lee belongs to this worst-case: the latency comes mostly from loop calling find_next_zero_bit() in bitmap_search_next_usable_block(). Even if we take out the whole reservation, we still possibility run into this kind of latency: the bitmap on disk and on journal are extremely inconsistent so we need to keep searching them both until we find a free bit on both map. I'm not really sure what to do about worst-case here. For that, we really do want to drop the lock entirely while we do the bitmap scan. Hmm...if we drop the lock entirely while scan the bitmap, assuming you mean drop the read lock, then I am afraid we have to re-check the tree (require a read or write lock ) to see if the new window space is still there after the scan succeed. This is probably not very interesting for the window rotating case. That leaves two options. Hold a reservation while we do that; or don't. Holding one poses the problems we discussed before: either you hold a large reservation (bad for disk layout in the presence of concurrent allocators), or you hold smaller ones (high cost as you continually advance the window, which requires some read lock on the tree to avoid bumping into the next window.) Well, we cannot hold a reservation (which need to update the tree) without a write lock. I guess if we want to improve the average case latency by replacing the current spin_lock with the read lock for the new window space searching, we don't have much choice here. Just how bad would it be if we didn't hold a lock _or_ a window at all while doing the search for new window space? I wonder if this is feasible: Walk through the rb tree without a lock? What if some node is being removed by another thread while we are walking through the tree and trying to get the next node from it? Thanks, Mingming - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Tue, 2005-04-12 at 07:41, Mingming Cao wrote: Note that this may improve average case latencies, but it's not likely to improve worst-case ones. We still need a write lock to install a new window, and that's going to have to wait for us to finish finding a free bit even if that operation starts using a read lock. Yes indeed. However nothing is free and there are always trade-offs.:) By worse case you mean multiple writes trying to allocate blocks around same area? It doesn't matter where they are; multiple new file opens will all be looking for a write lock. You only need one long-held read lock and all the writers still block. The worst-case latencies can't be properly solved with r/w locks --- those let the readers go more quickly (assuming they are in the majority), which helps the average case, but writers still have to wait for exclusive access. We only really help them by dropping the lock entirely. Even if we take out the whole reservation, we still possibility run into this kind of latency: the bitmap on disk and on journal are extremely inconsistent so we need to keep searching them both until we find a free bit on both map. Quite possibly. But as long as that code is running without locks, it's much easier to deal with those latencies: they won't impact other CPUs, cond_resched() is easier, and there's even CONFIG_PREEMPT. I'm not really sure what to do about worst-case here. For that, we really do want to drop the lock entirely while we do the bitmap scan. Hmm...if we drop the lock entirely while scan the bitmap, assuming you mean drop the read lock, then I am afraid we have to re-check the tree (require a read or write lock ) to see if the new window space is still there after the scan succeed. Sure. You basically start off with a provisional window, and then if necessary roll it forward just the same way you roll normal windows forward when they get to their end. That means you can still drop the lock while you search for new space. When you get there, reacquire the lock and check that the intervening space is still available. That's really cheap for the common case. The difficulty is when you have many parallel allocations hitting the same bg: they allocate provisional windows, find the same free area later on in the bg, and then stomp on each other as they try to move their windows there. I wonder if there's not a simple solution for this --- mark the window as provisional, and if any other task tries to allocate in the space immediately following such a window, it needs to block until that window is released. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Tue, 2005-04-12 at 12:18 +0100, Stephen C. Tweedie wrote: Hi, On Tue, 2005-04-12 at 07:41, Mingming Cao wrote: Note that this may improve average case latencies, but it's not likely to improve worst-case ones. We still need a write lock to install a new window, and that's going to have to wait for us to finish finding a free bit even if that operation starts using a read lock. Yes indeed. However nothing is free and there are always trade-offs.:) By worse case you mean multiple writes trying to allocate blocks around same area? It doesn't matter where they are; multiple new file opens will all be looking for a write lock. You only need one long-held read lock and all the writers still block. The worst-case latencies can't be properly solved with r/w locks --- those let the readers go more quickly (assuming they are in the majority), which helps the average case, but writers still have to wait for exclusive access. We only really help them by dropping the lock entirely. Even if we take out the whole reservation, we still possibility run into this kind of latency: the bitmap on disk and on journal are extremely inconsistent so we need to keep searching them both until we find a free bit on both map. Quite possibly. But as long as that code is running without locks, it's much easier to deal with those latencies: they won't impact other CPUs, cond_resched() is easier, and there's even CONFIG_PREEMPT. Okey, I agree. I'm not really sure what to do about worst-case here. For that, we really do want to drop the lock entirely while we do the bitmap scan. Hmm...if we drop the lock entirely while scan the bitmap, assuming you mean drop the read lock, then I am afraid we have to re-check the tree (require a read or write lock ) to see if the new window space is still there after the scan succeed. Sure. You basically start off with a provisional window, and then if necessary roll it forward just the same way you roll normal windows forward when they get to their end. That means you can still drop the lock while you search for new space. When you get there, reacquire the lock and check that the intervening space is still available. Please note that the bitmap scan does not only scan the provisional window range, it will return the first free it on the bitmap start from the start block of the provisional window until the end of the whole block group. So in the case the new window's neighbor is the same as the old one(that means window is rolled forward), we only need roll forward once to find it. Either we find a free bit inside the provisional window, or we find a free bit out of it. In the first case we just roll window forward and we are done. In the second case, it's possible the free bit is inside someone else's window(which means we can't take that window) or it inside the new space after a already reserved window. Either way we have to lock up the whole tree to remove the old window and insert the new window. That's really cheap for the common case. The difficulty is when you have many parallel allocations hitting the same bg: they allocate provisional windows, find the same free area later on in the bg, and then stomp on each other as they try to move their windows there. I wonder if there's not a simple solution for this --- mark the window as provisional, and if any other task tries to allocate in the space immediately following such a window, it needs to block until that window is released. Sounds interesting. However that implies we need a write lock to mark the window as provisional and block other files looking for windows near it: we need to insert the provisional window into the tree and then mark it as a temporary window, to really let other file notice this kind of hold. I wonder if the benefit of read/write lock is worth all the hassles now. If the new window's neighbor stays the same, we only need to roll forward once. If not, after a successful scan, we need to grab the write lock, and make sure the window is still there. If we dropped the lock without holding the new space, we have to search the whole tree again to find out if the space is still there, we cannot use the previous node returned by find_next_reservable_window() since the previous node could be gone while we scan the bitmap without the lock. Basically we do twice tree search(once for roll forward case) and twice locking in the normal case. Also I am concerned about the possible starvation on writers. I am thinking, maybe back to the spin_lock is not that bad with the mark provisional suggest you made? It allows us to mark the new space as provisional if we find a new space(prevent other window searching run into the same neighborhood). We could release the lock and scan the bitmap without worry about the new space will be taking by others. If there is a free bit, then we could just clear the provisional bit and simply
Re: ext3 allocate-with-reservation latencies
Hi, On Mon, 2005-04-11 at 19:38, Mingming Cao wrote: > I agree. We should not skip the home block group of the file. I guess > what I was suggesting is, if allocation from the home group failed and > we continuing the linear search the rest of block groups, we could > probably try to skip the block groups without enough usable free blocks > to make a reservation. Fair enough. Once those are the only bgs left, performance is going to drop pretty quickly, but that's not really avoidable on a very full filesystem. > Ah.. I see the win with the read lock now: once the a reservation window > is added, updating it (either winding it forward and or searching for a > avaliable window) probably is the majorirty of the operations on the > tree, and using read lock for that should help reduce the latency. Right. The down side is that for things like a kernel "tar xf", we'll be doing lots of small file unpacks, and hopefully most files will be just one or two reservations --- so there's little winding forward going on. The searching will still be improved in that case. Note that this may improve average case latencies, but it's not likely to improve worst-case ones. We still need a write lock to install a new window, and that's going to have to wait for us to finish finding a free bit even if that operation starts using a read lock. I'm not really sure what to do about worst-case here. For that, we really do want to drop the lock entirely while we do the bitmap scan. That leaves two options. Hold a reservation while we do that; or don't. Holding one poses the problems we discussed before: either you hold a large reservation (bad for disk layout in the presence of concurrent allocators), or you hold smaller ones (high cost as you continually advance the window, which requires some read lock on the tree to avoid bumping into the next window.) Just how bad would it be if we didn't hold a lock _or_ a window at all while doing the search for new window space? I didn't like that alternative at first because of the problem when you've got multiple tasks trying to allocate in the same space at the same time; but given the locking overhead of the alternatives, I'm wondering if this is actually the lesser evil. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Mon, 2005-04-11 at 15:12 -0400, Lee Revell wrote: > On Mon, 2005-04-11 at 11:38 -0700, Mingming Cao wrote: > > I will work on a patch for Lee to try sometime tonight. > > > > Just FYI, this will take a while to test, because this latency seems > quite rare. I haven't seen it again since my original report. Never mind, I spoke too soon. As soon as I posted this, I hit it again. Seems like doing a CVS checkout reliably triggers the problem. Lee - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Mon, 2005-04-11 at 11:38 -0700, Mingming Cao wrote: > I will work on a patch for Lee to try sometime tonight. > Just FYI, this will take a while to test, because this latency seems quite rare. I haven't seen it again since my original report. Hopefully I can reproduce it with enough dbench processes. This seems to agree with your earlier observation that we seem to have been quite unlucky. Anyway, if I understand the thread it seems like this could increase performance as well as help latency. Lee - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Mon, 2005-04-11 at 12:48 +0100, Stephen C. Tweedie wrote: > Hi, > > On Fri, 2005-04-08 at 19:10, Mingming Cao wrote: > > > > It still needs to be done under locking to prevent us from expanding > > > over the next window, though. And having to take and drop a spinlock a > > > dozen times or more just to find out that there are no usable free > > > blocks in the current block group is still expensive, even if we're not > > > actually fully unlinking the window each time. > > > Isn't this a rare case? The whole group is relatively full and the free > > bits are all reserved by other files. > > Well, that's not much different from the case where there _are_ usable > bits but they are all near the end of the bitmap. And that's common > enough as you fill up a block group with small files, for example. Once > the bg becomes nearly full, new file opens-for-write will still try to > allocate in that bg (because that's where the inode was allocated), but > as it's a new fd we have no prior knowledge of _where_ in the bh to > allocate, and we'll have to walk it to the end to find any free space. > This is the access pattern I'd expect of (for example) "tar xvjf > linux-2.6.12.tar.bz2", not exactly a rare case. > Okey. > > Probably we should avoid trying > > to make reservation in this block group at the first place > > Not in this case, because it's the "home" of the file in question, and > skipping to another bg would just leave useful space empty --- and that > leads to fragmentation. > I agree. We should not skip the home block group of the file. I guess what I was suggesting is, if allocation from the home group failed and we continuing the linear search the rest of block groups, we could probably try to skip the block groups without enough usable free blocks to make a reservation. Checking for the number of free blocks left in the quary bg is a good way, but probably not good enough, since those free blocks might already being reserved. > > You are proposing that we hold the read lock first, do the window search > > and bitmap scan, then once we confirm there is free bit in the candidate > > window, we grab the write lock and update the tree? > > No, I'm suggesting that if we need the write lock for tree updates, we > may still be able to get away with just a read lock when updating an > individual window. If all we're doing is winding the window forwards a > bit, that's not actually changing the structure of the tree. > > > However I am still worried that the rw lock will allow concurrent files > > trying to lock the same window at the same time. > > Adding a new window will need the write lock, always. But with the read > lock, we can still check the neighbouring windows (the structure is > pinned so those remain valid), so advancing a window with just that lock > can still be done without risking overlapping the next window. > Ah.. I see the win with the read lock now: once the a reservation window is added, updating it (either winding it forward and or searching for a avaliable window) probably is the majorirty of the operations on the tree, and using read lock for that should help reduce the latency. I will work on a patch for Lee to try sometime tonight. Thanks, Mingming - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Fri, 2005-04-08 at 19:10, Mingming Cao wrote: > > It still needs to be done under locking to prevent us from expanding > > over the next window, though. And having to take and drop a spinlock a > > dozen times or more just to find out that there are no usable free > > blocks in the current block group is still expensive, even if we're not > > actually fully unlinking the window each time. > Isn't this a rare case? The whole group is relatively full and the free > bits are all reserved by other files. Well, that's not much different from the case where there _are_ usable bits but they are all near the end of the bitmap. And that's common enough as you fill up a block group with small files, for example. Once the bg becomes nearly full, new file opens-for-write will still try to allocate in that bg (because that's where the inode was allocated), but as it's a new fd we have no prior knowledge of _where_ in the bh to allocate, and we'll have to walk it to the end to find any free space. This is the access pattern I'd expect of (for example) "tar xvjf linux-2.6.12.tar.bz2", not exactly a rare case. > Probably we should avoid trying > to make reservation in this block group at the first place Not in this case, because it's the "home" of the file in question, and skipping to another bg would just leave useful space empty --- and that leads to fragmentation. > You are proposing that we hold the read lock first, do the window search > and bitmap scan, then once we confirm there is free bit in the candidate > window, we grab the write lock and update the tree? No, I'm suggesting that if we need the write lock for tree updates, we may still be able to get away with just a read lock when updating an individual window. If all we're doing is winding the window forwards a bit, that's not actually changing the structure of the tree. > However I am still worried that the rw lock will allow concurrent files > trying to lock the same window at the same time. Adding a new window will need the write lock, always. But with the read lock, we can still check the neighbouring windows (the structure is pinned so those remain valid), so advancing a window with just that lock can still be done without risking overlapping the next window. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Fri, 2005-04-08 at 19:10, Mingming Cao wrote: It still needs to be done under locking to prevent us from expanding over the next window, though. And having to take and drop a spinlock a dozen times or more just to find out that there are no usable free blocks in the current block group is still expensive, even if we're not actually fully unlinking the window each time. Isn't this a rare case? The whole group is relatively full and the free bits are all reserved by other files. Well, that's not much different from the case where there _are_ usable bits but they are all near the end of the bitmap. And that's common enough as you fill up a block group with small files, for example. Once the bg becomes nearly full, new file opens-for-write will still try to allocate in that bg (because that's where the inode was allocated), but as it's a new fd we have no prior knowledge of _where_ in the bh to allocate, and we'll have to walk it to the end to find any free space. This is the access pattern I'd expect of (for example) tar xvjf linux-2.6.12.tar.bz2, not exactly a rare case. Probably we should avoid trying to make reservation in this block group at the first place Not in this case, because it's the home of the file in question, and skipping to another bg would just leave useful space empty --- and that leads to fragmentation. You are proposing that we hold the read lock first, do the window search and bitmap scan, then once we confirm there is free bit in the candidate window, we grab the write lock and update the tree? No, I'm suggesting that if we need the write lock for tree updates, we may still be able to get away with just a read lock when updating an individual window. If all we're doing is winding the window forwards a bit, that's not actually changing the structure of the tree. However I am still worried that the rw lock will allow concurrent files trying to lock the same window at the same time. Adding a new window will need the write lock, always. But with the read lock, we can still check the neighbouring windows (the structure is pinned so those remain valid), so advancing a window with just that lock can still be done without risking overlapping the next window. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Mon, 2005-04-11 at 12:48 +0100, Stephen C. Tweedie wrote: Hi, On Fri, 2005-04-08 at 19:10, Mingming Cao wrote: It still needs to be done under locking to prevent us from expanding over the next window, though. And having to take and drop a spinlock a dozen times or more just to find out that there are no usable free blocks in the current block group is still expensive, even if we're not actually fully unlinking the window each time. Isn't this a rare case? The whole group is relatively full and the free bits are all reserved by other files. Well, that's not much different from the case where there _are_ usable bits but they are all near the end of the bitmap. And that's common enough as you fill up a block group with small files, for example. Once the bg becomes nearly full, new file opens-for-write will still try to allocate in that bg (because that's where the inode was allocated), but as it's a new fd we have no prior knowledge of _where_ in the bh to allocate, and we'll have to walk it to the end to find any free space. This is the access pattern I'd expect of (for example) tar xvjf linux-2.6.12.tar.bz2, not exactly a rare case. Okey. Probably we should avoid trying to make reservation in this block group at the first place Not in this case, because it's the home of the file in question, and skipping to another bg would just leave useful space empty --- and that leads to fragmentation. I agree. We should not skip the home block group of the file. I guess what I was suggesting is, if allocation from the home group failed and we continuing the linear search the rest of block groups, we could probably try to skip the block groups without enough usable free blocks to make a reservation. Checking for the number of free blocks left in the quary bg is a good way, but probably not good enough, since those free blocks might already being reserved. You are proposing that we hold the read lock first, do the window search and bitmap scan, then once we confirm there is free bit in the candidate window, we grab the write lock and update the tree? No, I'm suggesting that if we need the write lock for tree updates, we may still be able to get away with just a read lock when updating an individual window. If all we're doing is winding the window forwards a bit, that's not actually changing the structure of the tree. However I am still worried that the rw lock will allow concurrent files trying to lock the same window at the same time. Adding a new window will need the write lock, always. But with the read lock, we can still check the neighbouring windows (the structure is pinned so those remain valid), so advancing a window with just that lock can still be done without risking overlapping the next window. Ah.. I see the win with the read lock now: once the a reservation window is added, updating it (either winding it forward and or searching for a avaliable window) probably is the majorirty of the operations on the tree, and using read lock for that should help reduce the latency. I will work on a patch for Lee to try sometime tonight. Thanks, Mingming - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Mon, 2005-04-11 at 11:38 -0700, Mingming Cao wrote: I will work on a patch for Lee to try sometime tonight. Just FYI, this will take a while to test, because this latency seems quite rare. I haven't seen it again since my original report. Hopefully I can reproduce it with enough dbench processes. This seems to agree with your earlier observation that we seem to have been quite unlucky. Anyway, if I understand the thread it seems like this could increase performance as well as help latency. Lee - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Mon, 2005-04-11 at 15:12 -0400, Lee Revell wrote: On Mon, 2005-04-11 at 11:38 -0700, Mingming Cao wrote: I will work on a patch for Lee to try sometime tonight. Just FYI, this will take a while to test, because this latency seems quite rare. I haven't seen it again since my original report. Never mind, I spoke too soon. As soon as I posted this, I hit it again. Seems like doing a CVS checkout reliably triggers the problem. Lee - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Mon, 2005-04-11 at 19:38, Mingming Cao wrote: I agree. We should not skip the home block group of the file. I guess what I was suggesting is, if allocation from the home group failed and we continuing the linear search the rest of block groups, we could probably try to skip the block groups without enough usable free blocks to make a reservation. Fair enough. Once those are the only bgs left, performance is going to drop pretty quickly, but that's not really avoidable on a very full filesystem. Ah.. I see the win with the read lock now: once the a reservation window is added, updating it (either winding it forward and or searching for a avaliable window) probably is the majorirty of the operations on the tree, and using read lock for that should help reduce the latency. Right. The down side is that for things like a kernel tar xf, we'll be doing lots of small file unpacks, and hopefully most files will be just one or two reservations --- so there's little winding forward going on. The searching will still be improved in that case. Note that this may improve average case latencies, but it's not likely to improve worst-case ones. We still need a write lock to install a new window, and that's going to have to wait for us to finish finding a free bit even if that operation starts using a read lock. I'm not really sure what to do about worst-case here. For that, we really do want to drop the lock entirely while we do the bitmap scan. That leaves two options. Hold a reservation while we do that; or don't. Holding one poses the problems we discussed before: either you hold a large reservation (bad for disk layout in the presence of concurrent allocators), or you hold smaller ones (high cost as you continually advance the window, which requires some read lock on the tree to avoid bumping into the next window.) Just how bad would it be if we didn't hold a lock _or_ a window at all while doing the search for new window space? I didn't like that alternative at first because of the problem when you've got multiple tasks trying to allocate in the same space at the same time; but given the locking overhead of the alternatives, I'm wondering if this is actually the lesser evil. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Fri, 2005-04-08 at 11:10 -0700, Mingming Cao wrote: > However I am still worried that the rw lock will allow concurrent files > trying to lock the same window at the same time. Only one succeed > though., high cpu usage then. And also, in the normal case the > filesystem is not really full, probably rw lock becomes expensive. FWIW this was a 125GB filesystem, about 70% full. Lee - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Fri, 2005-04-08 at 15:40 +0100, Stephen C. Tweedie wrote: > Hi, > > On Fri, 2005-04-08 at 00:37, Mingming Cao wrote: > > > Actually, we do not have to do an rbtree link and unlink for every > > window we search. If the reserved window(old) has no free bit and the > > new reservable window's is right after the old one, no need to unlink > > the old window from the rbtree and then link the new window, just update > > the start and end of the old one. > > It still needs to be done under locking to prevent us from expanding > over the next window, though. And having to take and drop a spinlock a > dozen times or more just to find out that there are no usable free > blocks in the current block group is still expensive, even if we're not > actually fully unlinking the window each time. > Isn't this a rare case? The whole group is relatively full and the free bits are all reserved by other files. Probably we should avoid trying to make reservation in this block group at the first place, if we could find a way to detect the number of _usable_ free bits are less than the requested window size. > I wonder if this can possibly be done safely without locking? It would > be really good if we could rotate windows forward with no global locks. > At the very least, a fair rwlock would let us freeze the total layout of > the tree, while still letting us modify individual windows safely. As > long as we use wmb() to make sure that we always extend the end of the > window before we shrink the start of it, I think we could get away with > such shared locking. And rw locking is much better for concurrency, so > we might be able to hold it over the whole bitmap search rather than > taking it and dropping it at each window advance. > You are proposing that we hold the read lock first, do the window search and bitmap scan, then once we confirm there is free bit in the candidate window, we grab the write lock and update the tree? I think this is a good idea to address case you have concerned: when we need to do lots of window search before settle down. Also if later we decide (I think we have discussed this before) to always try to reserve the window with at least 8 contigous free blocks, the search will be more expensive and the read lock will help. However I am still worried that the rw lock will allow concurrent files trying to lock the same window at the same time. Only one succeed though., high cpu usage then. And also, in the normal case the filesystem is not really full, probably rw lock becomes expensive. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
> And rw locking is much better for concurrency, so > we might be able to hold it over the whole bitmap search rather than > taking it and dropping it at each window advance. rw locks only help if readers are 10x more common than writers generally speaking. They are quite expensive locks, so they really should be used with the utmost care. (if you have really long hold times the dynamics are different, but if you have really long hold times your latency sucks too and wasn't that what this thread was trying to fix?) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Fri, 2005-04-08 at 00:37, Mingming Cao wrote: > Actually, we do not have to do an rbtree link and unlink for every > window we search. If the reserved window(old) has no free bit and the > new reservable window's is right after the old one, no need to unlink > the old window from the rbtree and then link the new window, just update > the start and end of the old one. It still needs to be done under locking to prevent us from expanding over the next window, though. And having to take and drop a spinlock a dozen times or more just to find out that there are no usable free blocks in the current block group is still expensive, even if we're not actually fully unlinking the window each time. I wonder if this can possibly be done safely without locking? It would be really good if we could rotate windows forward with no global locks. At the very least, a fair rwlock would let us freeze the total layout of the tree, while still letting us modify individual windows safely. As long as we use wmb() to make sure that we always extend the end of the window before we shrink the start of it, I think we could get away with such shared locking. And rw locking is much better for concurrency, so we might be able to hold it over the whole bitmap search rather than taking it and dropping it at each window advance. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Fri, 2005-04-08 at 00:37, Mingming Cao wrote: Actually, we do not have to do an rbtree link and unlink for every window we search. If the reserved window(old) has no free bit and the new reservable window's is right after the old one, no need to unlink the old window from the rbtree and then link the new window, just update the start and end of the old one. It still needs to be done under locking to prevent us from expanding over the next window, though. And having to take and drop a spinlock a dozen times or more just to find out that there are no usable free blocks in the current block group is still expensive, even if we're not actually fully unlinking the window each time. I wonder if this can possibly be done safely without locking? It would be really good if we could rotate windows forward with no global locks. At the very least, a fair rwlock would let us freeze the total layout of the tree, while still letting us modify individual windows safely. As long as we use wmb() to make sure that we always extend the end of the window before we shrink the start of it, I think we could get away with such shared locking. And rw locking is much better for concurrency, so we might be able to hold it over the whole bitmap search rather than taking it and dropping it at each window advance. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
And rw locking is much better for concurrency, so we might be able to hold it over the whole bitmap search rather than taking it and dropping it at each window advance. rw locks only help if readers are 10x more common than writers generally speaking. They are quite expensive locks, so they really should be used with the utmost care. (if you have really long hold times the dynamics are different, but if you have really long hold times your latency sucks too and wasn't that what this thread was trying to fix?) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Fri, 2005-04-08 at 15:40 +0100, Stephen C. Tweedie wrote: Hi, On Fri, 2005-04-08 at 00:37, Mingming Cao wrote: Actually, we do not have to do an rbtree link and unlink for every window we search. If the reserved window(old) has no free bit and the new reservable window's is right after the old one, no need to unlink the old window from the rbtree and then link the new window, just update the start and end of the old one. It still needs to be done under locking to prevent us from expanding over the next window, though. And having to take and drop a spinlock a dozen times or more just to find out that there are no usable free blocks in the current block group is still expensive, even if we're not actually fully unlinking the window each time. Isn't this a rare case? The whole group is relatively full and the free bits are all reserved by other files. Probably we should avoid trying to make reservation in this block group at the first place, if we could find a way to detect the number of _usable_ free bits are less than the requested window size. I wonder if this can possibly be done safely without locking? It would be really good if we could rotate windows forward with no global locks. At the very least, a fair rwlock would let us freeze the total layout of the tree, while still letting us modify individual windows safely. As long as we use wmb() to make sure that we always extend the end of the window before we shrink the start of it, I think we could get away with such shared locking. And rw locking is much better for concurrency, so we might be able to hold it over the whole bitmap search rather than taking it and dropping it at each window advance. You are proposing that we hold the read lock first, do the window search and bitmap scan, then once we confirm there is free bit in the candidate window, we grab the write lock and update the tree? I think this is a good idea to address case you have concerned: when we need to do lots of window search before settle down. Also if later we decide (I think we have discussed this before) to always try to reserve the window with at least 8 contigous free blocks, the search will be more expensive and the read lock will help. However I am still worried that the rw lock will allow concurrent files trying to lock the same window at the same time. Only one succeed though., high cpu usage then. And also, in the normal case the filesystem is not really full, probably rw lock becomes expensive. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Fri, 2005-04-08 at 11:10 -0700, Mingming Cao wrote: However I am still worried that the rw lock will allow concurrent files trying to lock the same window at the same time. Only one succeed though., high cpu usage then. And also, in the normal case the filesystem is not really full, probably rw lock becomes expensive. FWIW this was a 125GB filesystem, about 70% full. Lee - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Thu, 2005-04-07 at 14:08 +0100, Stephen C. Tweedie wrote: > Hi, > > On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote: > > > doesnt the first option also allow searches to be in parallel? > > In terms of CPU usage, yes. But either we use large windows, in which > case we *can't* search remotely near areas of the disk in parallel; or > we use small windows, in which case failure to find a free bit becomes > disproportionately expensive (we have to do an rbtree link and unlink > for each window we search.) Actually, we do not have to do an rbtree link and unlink for every window we search. If the reserved window(old) has no free bit and the new reservable window's is right after the old one, no need to unlink the old window from the rbtree and then link the new window, just update the start and end of the old one. That's in the current designed already, to reduce the cost of rbtree link and unlink. Mingming - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Thu, 2005-04-07 at 14:08 +0100, Stephen C. Tweedie wrote: > Hi, > > On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote: > > > doesnt the first option also allow searches to be in parallel? > > In terms of CPU usage, yes. But either we use large windows, in which > case we *can't* search remotely near areas of the disk in parallel; or > we use small windows, in which case failure to find a free bit becomes > disproportionately expensive (we have to do an rbtree link and unlink > for each window we search.) > I was thinking that at the end of find_next_reservable_window(), since we already know the previous node in the rbtree to insert, we could just link the window there, that way the rbtree link cost will be minimized. Then I realize in rbtree, previous node is not necessary the parent node, we still have to search part of the rbtree to find the parent node, which is really expensive in the case the insertion operations are very frequent. Hmmm... > > This particular one took over 1 msec, If you look at the pattern in the latency report: . cvs-14981 0.n.2 1044us : find_next_zero_bit (bitmap_search_next_usable_block) cvs-14981 0.n.2 1045us : ext3_test_allocatable (bitmap_search_next_usable_block) cvs-14981 0.n.2 1045us : find_next_zero_bit (bitmap_search_next_usable_block) cvs-14981 0.n.2 1046us : find_next_zero_bit (bitmap_search_next_usable_block) cvs-14981 0.n.2 1047us : ext3_test_allocatable (bitmap_search_next_usable_block) cvs-14981 0.n.2 1048us : find_next_zero_bit (bitmap_search_next_usable_block) cvs-14981 0.n.2 1049us : find_next_zero_bit (bitmap_search_next_usable_block) cvs-14981 0.n.2 1049us : ext3_test_allocatable (bitmap_search_next_usable_block) the patten is: two calls to find_next_zero_bit() followed by a ext3_test_allocatabl(). And we have about 440 iteration in this report. That means we are really unlucky: Checked the on disk bitmap, found a free bit, but it's not free in the journal copy, then we check the journal copy, find a free bit, but it's not free on diskI could imagine we need to check the journal copy once or twice(to prevent re- use the just freed bit before it is been commited), but how could this many (440) iterations happen? Mingming - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote: > doesnt the first option also allow searches to be in parallel? In terms of CPU usage, yes. But either we use large windows, in which case we *can't* search remotely near areas of the disk in parallel; or we use small windows, in which case failure to find a free bit becomes disproportionately expensive (we have to do an rbtree link and unlink for each window we search.) > This > particular one took over 1 msec, so it seems there's a fair room for > parallellizing multiple writers and thus improving scalability. (or is > this codepath serialized globally anyway?) No, the rbtree ops are serialised per-sb, and allocations within any one inode are serialised, but the bitmap searches need not be serialised globally. (They are in the case of new window searches right now, which is what we're trying to fix.) --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
* Mingming Cao <[EMAIL PROTECTED]> wrote: > It seems we are holding the rsv_block while searching the bitmap for a > free bit. In alloc_new_reservation(), we first find a available to > create a reservation window, then we check the bitmap to see if it > contains any free block. If not, we will search for next available > window, so on and on. During the whole process we are holding the > global rsv_lock. We could, and probably should, avoid that. Just > unlock the rsv_lock before the bitmap search and re-grab it after it. > We need to make sure that the available space that are still available > after we re- grab the lock. > > Another option is to hold that available window before we release the > rsv_lock, and if there is no free bit inside that window, we will > remove it from the tree in the next round of searching for next > available window. > > I prefer the second option, and plan to code it up soon. Any comments? doesnt the first option also allow searches to be in parallel? This particular one took over 1 msec, so it seems there's a fair room for parallellizing multiple writers and thus improving scalability. (or is this codepath serialized globally anyway?) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
* Mingming Cao [EMAIL PROTECTED] wrote: It seems we are holding the rsv_block while searching the bitmap for a free bit. In alloc_new_reservation(), we first find a available to create a reservation window, then we check the bitmap to see if it contains any free block. If not, we will search for next available window, so on and on. During the whole process we are holding the global rsv_lock. We could, and probably should, avoid that. Just unlock the rsv_lock before the bitmap search and re-grab it after it. We need to make sure that the available space that are still available after we re- grab the lock. Another option is to hold that available window before we release the rsv_lock, and if there is no free bit inside that window, we will remove it from the tree in the next round of searching for next available window. I prefer the second option, and plan to code it up soon. Any comments? doesnt the first option also allow searches to be in parallel? This particular one took over 1 msec, so it seems there's a fair room for parallellizing multiple writers and thus improving scalability. (or is this codepath serialized globally anyway?) Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote: doesnt the first option also allow searches to be in parallel? In terms of CPU usage, yes. But either we use large windows, in which case we *can't* search remotely near areas of the disk in parallel; or we use small windows, in which case failure to find a free bit becomes disproportionately expensive (we have to do an rbtree link and unlink for each window we search.) This particular one took over 1 msec, so it seems there's a fair room for parallellizing multiple writers and thus improving scalability. (or is this codepath serialized globally anyway?) No, the rbtree ops are serialised per-sb, and allocations within any one inode are serialised, but the bitmap searches need not be serialised globally. (They are in the case of new window searches right now, which is what we're trying to fix.) --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Thu, 2005-04-07 at 14:08 +0100, Stephen C. Tweedie wrote: Hi, On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote: doesnt the first option also allow searches to be in parallel? In terms of CPU usage, yes. But either we use large windows, in which case we *can't* search remotely near areas of the disk in parallel; or we use small windows, in which case failure to find a free bit becomes disproportionately expensive (we have to do an rbtree link and unlink for each window we search.) I was thinking that at the end of find_next_reservable_window(), since we already know the previous node in the rbtree to insert, we could just link the window there, that way the rbtree link cost will be minimized. Then I realize in rbtree, previous node is not necessary the parent node, we still have to search part of the rbtree to find the parent node, which is really expensive in the case the insertion operations are very frequent. Hmmm... This particular one took over 1 msec, If you look at the pattern in the latency report: . cvs-14981 0.n.2 1044us : find_next_zero_bit (bitmap_search_next_usable_block) cvs-14981 0.n.2 1045us : ext3_test_allocatable (bitmap_search_next_usable_block) cvs-14981 0.n.2 1045us : find_next_zero_bit (bitmap_search_next_usable_block) cvs-14981 0.n.2 1046us : find_next_zero_bit (bitmap_search_next_usable_block) cvs-14981 0.n.2 1047us : ext3_test_allocatable (bitmap_search_next_usable_block) cvs-14981 0.n.2 1048us : find_next_zero_bit (bitmap_search_next_usable_block) cvs-14981 0.n.2 1049us : find_next_zero_bit (bitmap_search_next_usable_block) cvs-14981 0.n.2 1049us : ext3_test_allocatable (bitmap_search_next_usable_block) the patten is: two calls to find_next_zero_bit() followed by a ext3_test_allocatabl(). And we have about 440 iteration in this report. That means we are really unlucky: Checked the on disk bitmap, found a free bit, but it's not free in the journal copy, then we check the journal copy, find a free bit, but it's not free on diskI could imagine we need to check the journal copy once or twice(to prevent re- use the just freed bit before it is been commited), but how could this many (440) iterations happen? Mingming - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Thu, 2005-04-07 at 14:08 +0100, Stephen C. Tweedie wrote: Hi, On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote: doesnt the first option also allow searches to be in parallel? In terms of CPU usage, yes. But either we use large windows, in which case we *can't* search remotely near areas of the disk in parallel; or we use small windows, in which case failure to find a free bit becomes disproportionately expensive (we have to do an rbtree link and unlink for each window we search.) Actually, we do not have to do an rbtree link and unlink for every window we search. If the reserved window(old) has no free bit and the new reservable window's is right after the old one, no need to unlink the old window from the rbtree and then link the new window, just update the start and end of the old one. That's in the current designed already, to reduce the cost of rbtree link and unlink. Mingming - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Wed, 2005-04-06 at 19:22 +0100, Stephen C. Tweedie wrote: > Hi, > > On Wed, 2005-04-06 at 17:53, Mingming Cao wrote: > > > > Possible, but not necessarily nice. If you've got a nearly-full disk, > > > most bits will be already allocated. As you scan the bitmaps, it may > > > take quite a while to find a free bit; do you really want to (a) lock > > > the whole block group with a temporary window just to do the scan, or > > > (b) keep allocating multiple smaller windows until you finally find a > > > free bit? The former is bad for concurrency if you have multiple tasks > > > trying to allocate nearby on disk --- you'll force them into different > > > block groups. The latter is high overhead. > > > I am not quite understand what you mean about (a). In this proposal, we > > will drop the lock before the scan. > > s/lock/reserve/. > > > And for (b), maybe I did not make myself clear: I am not proposing to > > keeping allocating multiple smaller windows until finally find a free > > bit. I mean, we book the window(just link the node into the tree) before > > we drop the lock, if there is no free bit inside that window, we will go > > back search for another window(call find_next_reserveable_window()), > > inside it, we will remove the temporary window we just created and find > > next window. SO we only have one temporary window at a time. > > And that's the problem. Either we create small temporary windows, in > which case we may end up thrashing through vast numbers of them before > we find a bit that's available --- very expensive as the disk gets full > --- or we use large windows but get worse layout when there are parallel > allocators going on. > Ah... I see your points. (a) is certainly not a good option. (b) is not very nice, but not necessary that bad when the disk is really full: We are not only scanning the bitmap within the reserved window range, instead, scanning the bitmap start from the reserved window start block to the last block of the group, to find the next free bit; So in the case the group is really full, we could reduce the # of small windows to to try. Mingming - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Wed, 2005-04-06 at 17:53, Mingming Cao wrote: > > Possible, but not necessarily nice. If you've got a nearly-full disk, > > most bits will be already allocated. As you scan the bitmaps, it may > > take quite a while to find a free bit; do you really want to (a) lock > > the whole block group with a temporary window just to do the scan, or > > (b) keep allocating multiple smaller windows until you finally find a > > free bit? The former is bad for concurrency if you have multiple tasks > > trying to allocate nearby on disk --- you'll force them into different > > block groups. The latter is high overhead. > I am not quite understand what you mean about (a). In this proposal, we > will drop the lock before the scan. s/lock/reserve/. > And for (b), maybe I did not make myself clear: I am not proposing to > keeping allocating multiple smaller windows until finally find a free > bit. I mean, we book the window(just link the node into the tree) before > we drop the lock, if there is no free bit inside that window, we will go > back search for another window(call find_next_reserveable_window()), > inside it, we will remove the temporary window we just created and find > next window. SO we only have one temporary window at a time. And that's the problem. Either we create small temporary windows, in which case we may end up thrashing through vast numbers of them before we find a bit that's available --- very expensive as the disk gets full --- or we use large windows but get worse layout when there are parallel allocators going on. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Wed, 2005-04-06 at 10:51 +0100, Stephen C. Tweedie wrote: > Hi, > > On Wed, 2005-04-06 at 06:35, Mingming Cao wrote: > > > It seems we are holding the rsv_block while searching the bitmap for a > > free bit. > > Probably something to avoid! > > > In alloc_new_reservation(), we first find a available to > > create a reservation window, then we check the bitmap to see if it > > contains any free block. If not, we will search for next available > > window, so on and on. During the whole process we are holding the global > > rsv_lock. We could, and probably should, avoid that. Just unlock the > > rsv_lock before the bitmap search and re-grab it after it. We need to > > make sure that the available space that are still available after we re- > > grab the lock. > > Not necessarily. As long as the windows remain mutually exclusive in > the rbtree, it doesn't matter too much if we occasionally allocate a > full one --- as long as that case is rare, the worst that happens is > that we fail to allocate from the window and have to repeat the window > reserve. > > The difficulty will be in keeping it rare. What we need to avoid is the > case where multiple tasks need a new window, they all drop the lock, > find the same bits free in the bitmap, then all try to take that > window. One will succeed, the others will fail; but as the files in > that bit of the disk continue to grow, we risk those processes > *continually* repeating the same stomp-on-each-others'-windows operation > and raising the allocation overhead significantly. > Agreed. That's why the second option is preferred. When a file find a window open to book, it booked it, then drop the lock and do the bitmap search. So other files will not attempt to book the same window. > > Another option is to hold that available window before we release the > > rsv_lock, and if there is no free bit inside that window, we will remove > > it from the tree in the next round of searching for next available > > window. > > Possible, but not necessarily nice. If you've got a nearly-full disk, > most bits will be already allocated. As you scan the bitmaps, it may > take quite a while to find a free bit; do you really want to (a) lock > the whole block group with a temporary window just to do the scan, or > (b) keep allocating multiple smaller windows until you finally find a > free bit? The former is bad for concurrency if you have multiple tasks > trying to allocate nearby on disk --- you'll force them into different > block groups. The latter is high overhead. > I am not quite understand what you mean about (a). In this proposal, we will drop the lock before the scan. And for (b), maybe I did not make myself clear: I am not proposing to keeping allocating multiple smaller windows until finally find a free bit. I mean, we book the window(just link the node into the tree) before we drop the lock, if there is no free bit inside that window, we will go back search for another window(call find_next_reserveable_window()), inside it, we will remove the temporary window we just created and find next window. SO we only have one temporary window at a time. In the case of we did find a free block inside the temporary created window, we make it official: do the rb tree coloring rotate stuff then. Does it make sense? Let me know, thanks, Maybe I should post my proposed patch here.( I am testing it now. ) --- linux-2.6.11-ming/fs/ext3/balloc.c | 107 +++-- 1 files changed, 68 insertions(+), 39 deletions(-) diff -puN fs/ext3/balloc.c~ext3-reduce-reservation-lock-latency fs/ext3/balloc.c --- linux-2.6.11/fs/ext3/balloc.c~ext3-reduce-reservation-lock-latency 2005-04-05 10:48:31.0 -0700 +++ linux-2.6.11-ming/fs/ext3/balloc.c 2005-04-06 09:24:24.203028968 -0700 @@ -241,8 +241,13 @@ void ext3_rsv_window_add(struct super_bl BUG(); } + /* +* we only link the node the the rb tree here +* the real color balancing will be done +* after we confirm the window has some free +* block. +*/ rb_link_node(node, parent, p); - rb_insert_color(node, root); } static void rsv_window_remove(struct super_block *sb, @@ -749,24 +754,24 @@ fail_access: * to find a free region that is of my size and has not * been reserved. * - * on succeed, it returns the reservation window to be appended to. - * failed, return NULL. */ -static struct ext3_reserve_window_node *find_next_reservable_window( +static int find_next_reservable_window( struct ext3_reserve_window_node *search_head, - unsigned long size, int *start_block, + struct ext3_reserve_window_node *my_rsv, + struct super_block * sb, int start_block, int last_block) { struct rb_node *next; struct
Re: ext3 allocate-with-reservation latencies
Hi, On Wed, 2005-04-06 at 06:35, Mingming Cao wrote: > It seems we are holding the rsv_block while searching the bitmap for a > free bit. Probably something to avoid! > In alloc_new_reservation(), we first find a available to > create a reservation window, then we check the bitmap to see if it > contains any free block. If not, we will search for next available > window, so on and on. During the whole process we are holding the global > rsv_lock. We could, and probably should, avoid that. Just unlock the > rsv_lock before the bitmap search and re-grab it after it. We need to > make sure that the available space that are still available after we re- > grab the lock. Not necessarily. As long as the windows remain mutually exclusive in the rbtree, it doesn't matter too much if we occasionally allocate a full one --- as long as that case is rare, the worst that happens is that we fail to allocate from the window and have to repeat the window reserve. The difficulty will be in keeping it rare. What we need to avoid is the case where multiple tasks need a new window, they all drop the lock, find the same bits free in the bitmap, then all try to take that window. One will succeed, the others will fail; but as the files in that bit of the disk continue to grow, we risk those processes *continually* repeating the same stomp-on-each-others'-windows operation and raising the allocation overhead significantly. > Another option is to hold that available window before we release the > rsv_lock, and if there is no free bit inside that window, we will remove > it from the tree in the next round of searching for next available > window. Possible, but not necessarily nice. If you've got a nearly-full disk, most bits will be already allocated. As you scan the bitmaps, it may take quite a while to find a free bit; do you really want to (a) lock the whole block group with a temporary window just to do the scan, or (b) keep allocating multiple smaller windows until you finally find a free bit? The former is bad for concurrency if you have multiple tasks trying to allocate nearby on disk --- you'll force them into different block groups. The latter is high overhead. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Wed, 2005-04-06 at 06:35, Mingming Cao wrote: It seems we are holding the rsv_block while searching the bitmap for a free bit. Probably something to avoid! In alloc_new_reservation(), we first find a available to create a reservation window, then we check the bitmap to see if it contains any free block. If not, we will search for next available window, so on and on. During the whole process we are holding the global rsv_lock. We could, and probably should, avoid that. Just unlock the rsv_lock before the bitmap search and re-grab it after it. We need to make sure that the available space that are still available after we re- grab the lock. Not necessarily. As long as the windows remain mutually exclusive in the rbtree, it doesn't matter too much if we occasionally allocate a full one --- as long as that case is rare, the worst that happens is that we fail to allocate from the window and have to repeat the window reserve. The difficulty will be in keeping it rare. What we need to avoid is the case where multiple tasks need a new window, they all drop the lock, find the same bits free in the bitmap, then all try to take that window. One will succeed, the others will fail; but as the files in that bit of the disk continue to grow, we risk those processes *continually* repeating the same stomp-on-each-others'-windows operation and raising the allocation overhead significantly. Another option is to hold that available window before we release the rsv_lock, and if there is no free bit inside that window, we will remove it from the tree in the next round of searching for next available window. Possible, but not necessarily nice. If you've got a nearly-full disk, most bits will be already allocated. As you scan the bitmaps, it may take quite a while to find a free bit; do you really want to (a) lock the whole block group with a temporary window just to do the scan, or (b) keep allocating multiple smaller windows until you finally find a free bit? The former is bad for concurrency if you have multiple tasks trying to allocate nearby on disk --- you'll force them into different block groups. The latter is high overhead. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Wed, 2005-04-06 at 10:51 +0100, Stephen C. Tweedie wrote: Hi, On Wed, 2005-04-06 at 06:35, Mingming Cao wrote: It seems we are holding the rsv_block while searching the bitmap for a free bit. Probably something to avoid! In alloc_new_reservation(), we first find a available to create a reservation window, then we check the bitmap to see if it contains any free block. If not, we will search for next available window, so on and on. During the whole process we are holding the global rsv_lock. We could, and probably should, avoid that. Just unlock the rsv_lock before the bitmap search and re-grab it after it. We need to make sure that the available space that are still available after we re- grab the lock. Not necessarily. As long as the windows remain mutually exclusive in the rbtree, it doesn't matter too much if we occasionally allocate a full one --- as long as that case is rare, the worst that happens is that we fail to allocate from the window and have to repeat the window reserve. The difficulty will be in keeping it rare. What we need to avoid is the case where multiple tasks need a new window, they all drop the lock, find the same bits free in the bitmap, then all try to take that window. One will succeed, the others will fail; but as the files in that bit of the disk continue to grow, we risk those processes *continually* repeating the same stomp-on-each-others'-windows operation and raising the allocation overhead significantly. Agreed. That's why the second option is preferred. When a file find a window open to book, it booked it, then drop the lock and do the bitmap search. So other files will not attempt to book the same window. Another option is to hold that available window before we release the rsv_lock, and if there is no free bit inside that window, we will remove it from the tree in the next round of searching for next available window. Possible, but not necessarily nice. If you've got a nearly-full disk, most bits will be already allocated. As you scan the bitmaps, it may take quite a while to find a free bit; do you really want to (a) lock the whole block group with a temporary window just to do the scan, or (b) keep allocating multiple smaller windows until you finally find a free bit? The former is bad for concurrency if you have multiple tasks trying to allocate nearby on disk --- you'll force them into different block groups. The latter is high overhead. I am not quite understand what you mean about (a). In this proposal, we will drop the lock before the scan. And for (b), maybe I did not make myself clear: I am not proposing to keeping allocating multiple smaller windows until finally find a free bit. I mean, we book the window(just link the node into the tree) before we drop the lock, if there is no free bit inside that window, we will go back search for another window(call find_next_reserveable_window()), inside it, we will remove the temporary window we just created and find next window. SO we only have one temporary window at a time. In the case of we did find a free block inside the temporary created window, we make it official: do the rb tree coloring rotate stuff then. Does it make sense? Let me know, thanks, Maybe I should post my proposed patch here.( I am testing it now. ) --- linux-2.6.11-ming/fs/ext3/balloc.c | 107 +++-- 1 files changed, 68 insertions(+), 39 deletions(-) diff -puN fs/ext3/balloc.c~ext3-reduce-reservation-lock-latency fs/ext3/balloc.c --- linux-2.6.11/fs/ext3/balloc.c~ext3-reduce-reservation-lock-latency 2005-04-05 10:48:31.0 -0700 +++ linux-2.6.11-ming/fs/ext3/balloc.c 2005-04-06 09:24:24.203028968 -0700 @@ -241,8 +241,13 @@ void ext3_rsv_window_add(struct super_bl BUG(); } + /* +* we only link the node the the rb tree here +* the real color balancing will be done +* after we confirm the window has some free +* block. +*/ rb_link_node(node, parent, p); - rb_insert_color(node, root); } static void rsv_window_remove(struct super_block *sb, @@ -749,24 +754,24 @@ fail_access: * to find a free region that is of my size and has not * been reserved. * - * on succeed, it returns the reservation window to be appended to. - * failed, return NULL. */ -static struct ext3_reserve_window_node *find_next_reservable_window( +static int find_next_reservable_window( struct ext3_reserve_window_node *search_head, - unsigned long size, int *start_block, + struct ext3_reserve_window_node *my_rsv, + struct super_block * sb, int start_block, int last_block) { struct rb_node *next; struct ext3_reserve_window_node *rsv, *prev; int cur; +
Re: ext3 allocate-with-reservation latencies
Hi, On Wed, 2005-04-06 at 17:53, Mingming Cao wrote: Possible, but not necessarily nice. If you've got a nearly-full disk, most bits will be already allocated. As you scan the bitmaps, it may take quite a while to find a free bit; do you really want to (a) lock the whole block group with a temporary window just to do the scan, or (b) keep allocating multiple smaller windows until you finally find a free bit? The former is bad for concurrency if you have multiple tasks trying to allocate nearby on disk --- you'll force them into different block groups. The latter is high overhead. I am not quite understand what you mean about (a). In this proposal, we will drop the lock before the scan. s/lock/reserve/. And for (b), maybe I did not make myself clear: I am not proposing to keeping allocating multiple smaller windows until finally find a free bit. I mean, we book the window(just link the node into the tree) before we drop the lock, if there is no free bit inside that window, we will go back search for another window(call find_next_reserveable_window()), inside it, we will remove the temporary window we just created and find next window. SO we only have one temporary window at a time. And that's the problem. Either we create small temporary windows, in which case we may end up thrashing through vast numbers of them before we find a bit that's available --- very expensive as the disk gets full --- or we use large windows but get worse layout when there are parallel allocators going on. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Wed, 2005-04-06 at 19:22 +0100, Stephen C. Tweedie wrote: Hi, On Wed, 2005-04-06 at 17:53, Mingming Cao wrote: Possible, but not necessarily nice. If you've got a nearly-full disk, most bits will be already allocated. As you scan the bitmaps, it may take quite a while to find a free bit; do you really want to (a) lock the whole block group with a temporary window just to do the scan, or (b) keep allocating multiple smaller windows until you finally find a free bit? The former is bad for concurrency if you have multiple tasks trying to allocate nearby on disk --- you'll force them into different block groups. The latter is high overhead. I am not quite understand what you mean about (a). In this proposal, we will drop the lock before the scan. s/lock/reserve/. And for (b), maybe I did not make myself clear: I am not proposing to keeping allocating multiple smaller windows until finally find a free bit. I mean, we book the window(just link the node into the tree) before we drop the lock, if there is no free bit inside that window, we will go back search for another window(call find_next_reserveable_window()), inside it, we will remove the temporary window we just created and find next window. SO we only have one temporary window at a time. And that's the problem. Either we create small temporary windows, in which case we may end up thrashing through vast numbers of them before we find a bit that's available --- very expensive as the disk gets full --- or we use large windows but get worse layout when there are parallel allocators going on. Ah... I see your points. (a) is certainly not a good option. (b) is not very nice, but not necessary that bad when the disk is really full: We are not only scanning the bitmap within the reserved window range, instead, scanning the bitmap start from the reserved window start block to the last block of the group, to find the next free bit; So in the case the group is really full, we could reduce the # of small windows to to try. Mingming - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Tue, 2005-04-05 at 06:13 +0200, Ingo Molnar wrote: > * Lee Revell <[EMAIL PROTECTED]> wrote: > > > I can trigger latencies up to ~1.1 ms with a CVS checkout. It looks > > like inside ext3_try_to_allocate_with_rsv, we spend a long time in this > > loop: > > > > ext3_test_allocatable (bitmap_search_next_usable_block) > > find_next_zero_bit (bitmap_search_next_usable_block) > > find_next_zero_bit (bitmap_search_next_usable_block) > > > > ext3_test_allocatable (bitmap_search_next_usable_block) > > find_next_zero_bit (bitmap_search_next_usable_block) > > find_next_zero_bit (bitmap_search_next_usable_block) > > Breaking the lock is not really possible at that point, and it doesnt > look too easy to make that path preemptable either. (To make it > preemptable rsv_lock would need to become a semaphore (this could be > fine, as it's only used when a new reservation window is created).) It seems we are holding the rsv_block while searching the bitmap for a free bit. In alloc_new_reservation(), we first find a available to create a reservation window, then we check the bitmap to see if it contains any free block. If not, we will search for next available window, so on and on. During the whole process we are holding the global rsv_lock. We could, and probably should, avoid that. Just unlock the rsv_lock before the bitmap search and re-grab it after it. We need to make sure that the available space that are still available after we re- grab the lock. Another option is to hold that available window before we release the rsv_lock, and if there is no free bit inside that window, we will remove it from the tree in the next round of searching for next available window. I prefer the second option, and plan to code it up soon. Any comments? Thanks, Mingming - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Mon, 2005-04-04 at 23:10 -0700, Mingming Cao wrote: > On Tue, 2005-04-05 at 06:13 +0200, Ingo Molnar wrote: > > * Lee Revell <[EMAIL PROTECTED]> wrote: > > > > > I can trigger latencies up to ~1.1 ms with a CVS checkout. It looks > > > like inside ext3_try_to_allocate_with_rsv, we spend a long time in this > > > loop: > > > > > We have not modify the reservation create algorithm for a long time. > Sorry, I missed the background here, on which kernel did you see this > latency? And what tests you were running? Makes sense, I have been seeing this one for a long time. I get it with dbench, or when doing a CVS checkout of a large module. Kernel is 2.6.12-rc1-RT-V0.7.43-05, with PREEMPT_DESKTOP which AFAIK should be equivalent to mainline. Lee - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Tue, 2005-04-05 at 06:13 +0200, Ingo Molnar wrote: > * Lee Revell <[EMAIL PROTECTED]> wrote: > > > I can trigger latencies up to ~1.1 ms with a CVS checkout. It looks > > like inside ext3_try_to_allocate_with_rsv, we spend a long time in this > > loop: > > We have not modify the reservation create algorithm for a long time. Sorry, I missed the background here, on which kernel did you see this latency? And what tests you were running? > > ext3_test_allocatable (bitmap_search_next_usable_block) > > find_next_zero_bit (bitmap_search_next_usable_block) > > find_next_zero_bit (bitmap_search_next_usable_block) > > > > ext3_test_allocatable (bitmap_search_next_usable_block) > > find_next_zero_bit (bitmap_search_next_usable_block) > > find_next_zero_bit (bitmap_search_next_usable_block) > The loop in bitmap_search_next_usable_block tries to find the next zero bit on the bitmap, if it find, then it need to check whether this zero bit is also be zero on the journalling last copy the the bitmap. The double check is there prevent re-use the just freed block on disk before it is committed from journalling. If it's not marked as a free block on the journalling copy, then it will find the next free bit on that copy first, then loop back etc.etc. The bitmap_search_next_usable_block() code is quite simple: static int bitmap_search_next_usable_block(int start, struct buffer_head *bh, int maxblocks) { int next; struct journal_head *jh = bh2jh(bh); /* * The bitmap search --- search forward alternately through the actual * bitmap and the last-committed copy until we find a bit free in * both */ while (start < maxblocks) { next = ext3_find_next_zero_bit(bh->b_data, maxblocks, start); if (next >= maxblocks) return -1; if (ext3_test_allocatable(next, bh)) return next; jbd_lock_bh_state(bh); if (jh->b_committed_data) start = ext3_find_next_zero_bit(jh->b_committed_data, maxblocks, next); jbd_unlock_bh_state(bh); } return -1; } The latency report shows the pattern of two calls of ext3_find_next_zero_bit () after a call to ext3_test_allocatable(), matches what I suspect above. I wonder if re-run the test with -0 noreservation mount option make any difference? > Breaking the lock is not really possible at that point, and it doesnt > look too easy to make that path preemptable either. (To make it > preemptable rsv_lock would need to become a semaphore (this could be > fine, as it's only used when a new reservation window is created).) > > The hard part is the seqlock - the read side is performance-critical, > maybe it could be solved via a preemptable but still scalable seqlock > variant that uses a semaphore for the write side? It all depends on what > the scalability impact of using a semaphore for the new-window code > would be. > The seqlock was there to prevent a smp race, but actually turns out to be unnecessary, and it has been removed recently. While even if it is there, it will not cause much latency here: the whole ext3_try_to_allocate_with_rsv is guard by ei->truncate_sem, thus no multiple readers/writers at the same time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Tue, 2005-04-05 at 06:13 +0200, Ingo Molnar wrote: * Lee Revell [EMAIL PROTECTED] wrote: I can trigger latencies up to ~1.1 ms with a CVS checkout. It looks like inside ext3_try_to_allocate_with_rsv, we spend a long time in this loop: ext3_test_allocatable (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) ext3_test_allocatable (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) Breaking the lock is not really possible at that point, and it doesnt look too easy to make that path preemptable either. (To make it preemptable rsv_lock would need to become a semaphore (this could be fine, as it's only used when a new reservation window is created).) It seems we are holding the rsv_block while searching the bitmap for a free bit. In alloc_new_reservation(), we first find a available to create a reservation window, then we check the bitmap to see if it contains any free block. If not, we will search for next available window, so on and on. During the whole process we are holding the global rsv_lock. We could, and probably should, avoid that. Just unlock the rsv_lock before the bitmap search and re-grab it after it. We need to make sure that the available space that are still available after we re- grab the lock. Another option is to hold that available window before we release the rsv_lock, and if there is no free bit inside that window, we will remove it from the tree in the next round of searching for next available window. I prefer the second option, and plan to code it up soon. Any comments? Thanks, Mingming - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Tue, 2005-04-05 at 06:13 +0200, Ingo Molnar wrote: * Lee Revell [EMAIL PROTECTED] wrote: I can trigger latencies up to ~1.1 ms with a CVS checkout. It looks like inside ext3_try_to_allocate_with_rsv, we spend a long time in this loop: We have not modify the reservation create algorithm for a long time. Sorry, I missed the background here, on which kernel did you see this latency? And what tests you were running? ext3_test_allocatable (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) ext3_test_allocatable (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) The loop in bitmap_search_next_usable_block tries to find the next zero bit on the bitmap, if it find, then it need to check whether this zero bit is also be zero on the journalling last copy the the bitmap. The double check is there prevent re-use the just freed block on disk before it is committed from journalling. If it's not marked as a free block on the journalling copy, then it will find the next free bit on that copy first, then loop back etc.etc. The bitmap_search_next_usable_block() code is quite simple: static int bitmap_search_next_usable_block(int start, struct buffer_head *bh, int maxblocks) { int next; struct journal_head *jh = bh2jh(bh); /* * The bitmap search --- search forward alternately through the actual * bitmap and the last-committed copy until we find a bit free in * both */ while (start maxblocks) { next = ext3_find_next_zero_bit(bh-b_data, maxblocks, start); if (next = maxblocks) return -1; if (ext3_test_allocatable(next, bh)) return next; jbd_lock_bh_state(bh); if (jh-b_committed_data) start = ext3_find_next_zero_bit(jh-b_committed_data, maxblocks, next); jbd_unlock_bh_state(bh); } return -1; } The latency report shows the pattern of two calls of ext3_find_next_zero_bit () after a call to ext3_test_allocatable(), matches what I suspect above. I wonder if re-run the test with -0 noreservation mount option make any difference? Breaking the lock is not really possible at that point, and it doesnt look too easy to make that path preemptable either. (To make it preemptable rsv_lock would need to become a semaphore (this could be fine, as it's only used when a new reservation window is created).) The hard part is the seqlock - the read side is performance-critical, maybe it could be solved via a preemptable but still scalable seqlock variant that uses a semaphore for the write side? It all depends on what the scalability impact of using a semaphore for the new-window code would be. The seqlock was there to prevent a smp race, but actually turns out to be unnecessary, and it has been removed recently. While even if it is there, it will not cause much latency here: the whole ext3_try_to_allocate_with_rsv is guard by ei-truncate_sem, thus no multiple readers/writers at the same time. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
On Mon, 2005-04-04 at 23:10 -0700, Mingming Cao wrote: On Tue, 2005-04-05 at 06:13 +0200, Ingo Molnar wrote: * Lee Revell [EMAIL PROTECTED] wrote: I can trigger latencies up to ~1.1 ms with a CVS checkout. It looks like inside ext3_try_to_allocate_with_rsv, we spend a long time in this loop: We have not modify the reservation create algorithm for a long time. Sorry, I missed the background here, on which kernel did you see this latency? And what tests you were running? Makes sense, I have been seeing this one for a long time. I get it with dbench, or when doing a CVS checkout of a large module. Kernel is 2.6.12-rc1-RT-V0.7.43-05, with PREEMPT_DESKTOP which AFAIK should be equivalent to mainline. Lee - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
* Lee Revell <[EMAIL PROTECTED]> wrote: > I can trigger latencies up to ~1.1 ms with a CVS checkout. It looks > like inside ext3_try_to_allocate_with_rsv, we spend a long time in this > loop: > > ext3_test_allocatable (bitmap_search_next_usable_block) > find_next_zero_bit (bitmap_search_next_usable_block) > find_next_zero_bit (bitmap_search_next_usable_block) > > ext3_test_allocatable (bitmap_search_next_usable_block) > find_next_zero_bit (bitmap_search_next_usable_block) > find_next_zero_bit (bitmap_search_next_usable_block) Breaking the lock is not really possible at that point, and it doesnt look too easy to make that path preemptable either. (To make it preemptable rsv_lock would need to become a semaphore (this could be fine, as it's only used when a new reservation window is created).) The hard part is the seqlock - the read side is performance-critical, maybe it could be solved via a preemptable but still scalable seqlock variant that uses a semaphore for the write side? It all depends on what the scalability impact of using a semaphore for the new-window code would be. the best longterm solution for these types of tradeoffs seems to be to add a locking primitive that is a spinlock on !PREEMPT kernels and a semaphore on PREEMPT kernels. I.e. not as drastic as a full PREEMPT_RT kernel, but good enough to make latency-critical codepaths of ext3 preemptable, without having to hurt scalability on !PREEMPT. The PREEMPT_RT kernel has all the 'compile-time type-switching' infrastructure for such tricks, all that needs to be changed to switch a lock's type is to change the spinlock definition - all the spin_lock() uses can remain unchanged. (The same method is used on PREEMPT_RT to have 'dual-type' spinlocks.) the same thing could then also be used for things like the mm lock, and other longer-held locks that PREEMPT would like to see preemptable. It would also be a good first step towards merging the PREEMPT_RT infrastructure ;-) I'll cook up something. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
ext3 allocate-with-reservation latencies
I can trigger latencies up to ~1.1 ms with a CVS checkout. It looks like inside ext3_try_to_allocate_with_rsv, we spend a long time in this loop: ext3_test_allocatable (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) ext3_test_allocatable (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) etc. Lee ext3_reservation.bz2 Description: application/bzip
ext3 allocate-with-reservation latencies
I can trigger latencies up to ~1.1 ms with a CVS checkout. It looks like inside ext3_try_to_allocate_with_rsv, we spend a long time in this loop: ext3_test_allocatable (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) ext3_test_allocatable (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) etc. Lee ext3_reservation.bz2 Description: application/bzip
Re: ext3 allocate-with-reservation latencies
* Lee Revell [EMAIL PROTECTED] wrote: I can trigger latencies up to ~1.1 ms with a CVS checkout. It looks like inside ext3_try_to_allocate_with_rsv, we spend a long time in this loop: ext3_test_allocatable (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) ext3_test_allocatable (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) Breaking the lock is not really possible at that point, and it doesnt look too easy to make that path preemptable either. (To make it preemptable rsv_lock would need to become a semaphore (this could be fine, as it's only used when a new reservation window is created).) The hard part is the seqlock - the read side is performance-critical, maybe it could be solved via a preemptable but still scalable seqlock variant that uses a semaphore for the write side? It all depends on what the scalability impact of using a semaphore for the new-window code would be. the best longterm solution for these types of tradeoffs seems to be to add a locking primitive that is a spinlock on !PREEMPT kernels and a semaphore on PREEMPT kernels. I.e. not as drastic as a full PREEMPT_RT kernel, but good enough to make latency-critical codepaths of ext3 preemptable, without having to hurt scalability on !PREEMPT. The PREEMPT_RT kernel has all the 'compile-time type-switching' infrastructure for such tricks, all that needs to be changed to switch a lock's type is to change the spinlock definition - all the spin_lock(lock) uses can remain unchanged. (The same method is used on PREEMPT_RT to have 'dual-type' spinlocks.) the same thing could then also be used for things like the mm lock, and other longer-held locks that PREEMPT would like to see preemptable. It would also be a good first step towards merging the PREEMPT_RT infrastructure ;-) I'll cook up something. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/