Re: [Ext2-devel] Re: ext3 allocate-with-reservation latencies

2005-04-18 Thread Mingming Cao
On Mon, 2005-04-18 at 19:00 +0100, Stephen C. Tweedie wrote:

> > > Note that there _is_ some additional complexity here.  It's not entirely
> > > free.  Specifically, if we have other tasks waiting on our provisional
> > > window, then we need to be very careful about the life expectancy of the
> > > wait queues involved, so that if the first task fixes its window and
> > > then deletes it before the waiters have woken up, they don't get
> > > confused by the provisional window vanishing while being waited for.
> 
> > This approach(provisional window) sounds interesting to me, but it's a
> > little more complicated than I thought:(
> 
> Yep.  Once you have other tasks waiting on your window while it's not
> locked, object lifetime becomes a much bigger problem.
> 
> > alloc_new_reservation()
> > retry:
> > lock the tree
> > search for a new space start from the given startblock
> > found a new space
> > if the new space is not a "provisional window" 

> I was thinking rather that we _start_ with the window, and _then_ look
> for new space.
> 

It seems I am lost here.  Could you elaborate more here, please?  What
is "the window" you are referring to here? The old(stale) reservation
window?  I thought we are discussing the algorithm to how to allocate a
new window.

> So we'd start with:
> 
> if we already have a window, 
>   mark it provisional; 
> else,
>   do
>   search for a new window;
>   if the immediately preceding window is provisional, 
>   wait for that window;
>   continue;
>   allocate the window and mark it provisional;
>   break
> 
> At this point we have a provisional window, and we know that nobody else
> is going to allocate either in it, or in the empty space following it
> (because if they tried to, they'd bump into our own provisional window
> as their predecessor and would wait for us.)  So even if the window
> doesn't occupy the _whole_ unreserved area, we can still keep searching
> without fear of multiple tasks trying to do so in the same space at the
> same time.
> 
> --Stephen
> 
> 
Hmm.. This thread was to address the latency of holding the global
per-fs reservation lock while scanning the block group bitmap.  And the
whole point of the "provisional window" is to prevent other inodes being
forced to put into other block groups when one inode hold/reserve the
whole block group as a temporary window to do the bitmap scan. I clearly
see it's a win if there are no multiple threads allocating blocks nearby
at the same time. However the whole reservation is there to address the
case where multiple allocations happen at the same time near the same
place. 

With the provisional method, if multiple threads trying to allocate new
windows in the same area, they are still have to wait for other new-
window-allocation-and-bitmap-scan finish. After that the probably will
compete the same window again and again:(  I am worried that the
latency is not being being improved than before (holding the lock for
bitmap scan), and we have paid extra cpu time or context switches. Also
the changes to the algorithm is no going to be trivial. 

Now we all agree that the right thing to fix the latency is to drop the
lock and then scan the bitmap. Before that we need to reserve the open
window in case someone else is trying to target at the same window.
Question was should we reserve the whole free reservable space or just
the window size we need.  Now that we have explored and discussed so
many possible solutions,  I think the lest evil and less intrusive way
to just lock the small window.  We were worried about in the case the
scanned free bit is not inside the temporary reserved window, it is
possible we have to do many window unlink and link operations on the rb
tree, however this is not avoidable with the provisional window proposal
either.  Probably we could start with that simple proposal first, and if
there are benchmarks shows the cpu usage is really high and a concern,
we could address that later? 

Thanks,
Mingming

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-18 Thread Stephen C. Tweedie
Hi,

On Fri, 2005-04-15 at 21:32, Mingming Cao wrote:

> Sorry for the delaying. I was not in office these days.
No problem.


> > > Also I am concerned about the possible
> > > starvation on writers.
> > In what way?
> I was worried about the rw lock case.:)

OK, so we're both on the same track here. :-)

> > Note that there _is_ some additional complexity here.  It's not entirely
> > free.  Specifically, if we have other tasks waiting on our provisional
> > window, then we need to be very careful about the life expectancy of the
> > wait queues involved, so that if the first task fixes its window and
> > then deletes it before the waiters have woken up, they don't get
> > confused by the provisional window vanishing while being waited for.

> This approach(provisional window) sounds interesting to me, but it's a
> little more complicated than I thought:(

Yep.  Once you have other tasks waiting on your window while it's not
locked, object lifetime becomes a much bigger problem.

> alloc_new_reservation()
> retry:
> lock the tree
>   search for a new space start from the given startblock
>   found a new space
>   if the new space is not a "provisional window" 

I was thinking rather that we _start_ with the window, and _then_ look
for new space.

So we'd start with:

if we already have a window, 
mark it provisional; 
else,
do
search for a new window;
if the immediately preceding window is provisional, 
wait for that window;
continue;
allocate the window and mark it provisional;
break

At this point we have a provisional window, and we know that nobody else
is going to allocate either in it, or in the empty space following it
(because if they tried to, they'd bump into our own provisional window
as their predecessor and would wait for us.)  So even if the window
doesn't occupy the _whole_ unreserved area, we can still keep searching
without fear of multiple tasks trying to do so in the same space at the
same time.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ext2-devel] Re: ext3 allocate-with-reservation latencies

2005-04-18 Thread Mingming Cao
On Mon, 2005-04-18 at 19:00 +0100, Stephen C. Tweedie wrote:

   Note that there _is_ some additional complexity here.  It's not entirely
   free.  Specifically, if we have other tasks waiting on our provisional
   window, then we need to be very careful about the life expectancy of the
   wait queues involved, so that if the first task fixes its window and
   then deletes it before the waiters have woken up, they don't get
   confused by the provisional window vanishing while being waited for.
 
  This approach(provisional window) sounds interesting to me, but it's a
  little more complicated than I thought:(
 
 Yep.  Once you have other tasks waiting on your window while it's not
 locked, object lifetime becomes a much bigger problem.
 
  alloc_new_reservation()
  retry:
  lock the tree
  search for a new space start from the given startblock
  found a new space
  if the new space is not a provisional window 

 I was thinking rather that we _start_ with the window, and _then_ look
 for new space.
 

It seems I am lost here.  Could you elaborate more here, please?  What
is the window you are referring to here? The old(stale) reservation
window?  I thought we are discussing the algorithm to how to allocate a
new window.

 So we'd start with:
 
 if we already have a window, 
   mark it provisional; 
 else,
   do
   search for a new window;
   if the immediately preceding window is provisional, 
   wait for that window;
   continue;
   allocate the window and mark it provisional;
   break
 
 At this point we have a provisional window, and we know that nobody else
 is going to allocate either in it, or in the empty space following it
 (because if they tried to, they'd bump into our own provisional window
 as their predecessor and would wait for us.)  So even if the window
 doesn't occupy the _whole_ unreserved area, we can still keep searching
 without fear of multiple tasks trying to do so in the same space at the
 same time.
 
 --Stephen
 
 
Hmm.. This thread was to address the latency of holding the global
per-fs reservation lock while scanning the block group bitmap.  And the
whole point of the provisional window is to prevent other inodes being
forced to put into other block groups when one inode hold/reserve the
whole block group as a temporary window to do the bitmap scan. I clearly
see it's a win if there are no multiple threads allocating blocks nearby
at the same time. However the whole reservation is there to address the
case where multiple allocations happen at the same time near the same
place. 

With the provisional method, if multiple threads trying to allocate new
windows in the same area, they are still have to wait for other new-
window-allocation-and-bitmap-scan finish. After that the probably will
compete the same window again and again:(  I am worried that the
latency is not being being improved than before (holding the lock for
bitmap scan), and we have paid extra cpu time or context switches. Also
the changes to the algorithm is no going to be trivial. 

Now we all agree that the right thing to fix the latency is to drop the
lock and then scan the bitmap. Before that we need to reserve the open
window in case someone else is trying to target at the same window.
Question was should we reserve the whole free reservable space or just
the window size we need.  Now that we have explored and discussed so
many possible solutions,  I think the lest evil and less intrusive way
to just lock the small window.  We were worried about in the case the
scanned free bit is not inside the temporary reserved window, it is
possible we have to do many window unlink and link operations on the rb
tree, however this is not avoidable with the provisional window proposal
either.  Probably we could start with that simple proposal first, and if
there are benchmarks shows the cpu usage is really high and a concern,
we could address that later? 

Thanks,
Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-13 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-13 at 00:27, Mingming Cao wrote:

> > I wonder if there's not a simple solution for this --- mark the window
> > as "provisional", and if any other task tries to allocate in the space
> > immediately following such a window, it needs to block until that window
> > is released.

> Sounds interesting. However that implies we need a write lock to mark
> the window as provisional and block other files looking for windows near
> it: we need to insert the provisional window into the tree and then mark
> it as a temporary window, to really let other file notice this kind of
> "hold".

We need a lock for the tree modification, yes.

> I wonder if the benefit of read/write lock is worth all the hassles now.

The point of the provisional window is that you don't need read/write
locks at all.  The provisional window lets you unlock the tree
completely while you do the bitmap scan, so there's really no reason to
have rwlocks for the tree any more.

> If the new window's neighbor stays the same, we only need to roll
> forward once.  If not, after a successful scan, we need to grab the
> write lock, and make sure the window is still there.

When we take the provisional window, we can make a note of how much
space we have before the next window.  And because all future allocators
will stall if they try to allocate at this point due to the provisional
window, we know that that space will remain outside any other window
until we come to fix the provisional window and potentially roll it
forward to the space we found.

>  If we dropped the
> lock without holding the new space, we have to search the whole tree
> again to find out if the space is still there

As long as the space is within the area between the provisional window
and its successor, we don't need to do that.  (It gets more complex if
the bitmap search returns a bit _beyond_ the next window, though.)

> Also I am concerned about the possible
> starvation on writers.

In what way?

> I am thinking, maybe back to the spin_lock is not that bad with the
> "mark provisional" suggest you made?

Right, that was the intent --- sorry if I didn't make it clear. 

>  It allows us to mark the new space
> as provisional if we find a new space(prevent other window searching run
> into the same neighborhood). We could release the lock and scan the
> bitmap without worry about the new space will be taking by others.

Exactly.

Note that there _is_ some additional complexity here.  It's not entirely
free.  Specifically, if we have other tasks waiting on our provisional
window, then we need to be very careful about the life expectancy of the
wait queues involved, so that if the first task fixes its window and
then deletes it before the waiters have woken up, they don't get
confused by the provisional window vanishing while being waited for.

The easy solution is a global wait queue for that, but that doesn't
scale well.  The complex solution is a per-window wait queue and
reference count, which is obviously a bit of bloat, though probably
worth it for the high-load case.

Cheers,
 Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-13 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-13 at 00:27, Mingming Cao wrote:

  I wonder if there's not a simple solution for this --- mark the window
  as provisional, and if any other task tries to allocate in the space
  immediately following such a window, it needs to block until that window
  is released.

 Sounds interesting. However that implies we need a write lock to mark
 the window as provisional and block other files looking for windows near
 it: we need to insert the provisional window into the tree and then mark
 it as a temporary window, to really let other file notice this kind of
 hold.

We need a lock for the tree modification, yes.

 I wonder if the benefit of read/write lock is worth all the hassles now.

The point of the provisional window is that you don't need read/write
locks at all.  The provisional window lets you unlock the tree
completely while you do the bitmap scan, so there's really no reason to
have rwlocks for the tree any more.

 If the new window's neighbor stays the same, we only need to roll
 forward once.  If not, after a successful scan, we need to grab the
 write lock, and make sure the window is still there.

When we take the provisional window, we can make a note of how much
space we have before the next window.  And because all future allocators
will stall if they try to allocate at this point due to the provisional
window, we know that that space will remain outside any other window
until we come to fix the provisional window and potentially roll it
forward to the space we found.

  If we dropped the
 lock without holding the new space, we have to search the whole tree
 again to find out if the space is still there

As long as the space is within the area between the provisional window
and its successor, we don't need to do that.  (It gets more complex if
the bitmap search returns a bit _beyond_ the next window, though.)

 Also I am concerned about the possible
 starvation on writers.

In what way?

 I am thinking, maybe back to the spin_lock is not that bad with the
 mark provisional suggest you made?

Right, that was the intent --- sorry if I didn't make it clear. 

  It allows us to mark the new space
 as provisional if we find a new space(prevent other window searching run
 into the same neighborhood). We could release the lock and scan the
 bitmap without worry about the new space will be taking by others.

Exactly.

Note that there _is_ some additional complexity here.  It's not entirely
free.  Specifically, if we have other tasks waiting on our provisional
window, then we need to be very careful about the life expectancy of the
wait queues involved, so that if the first task fixes its window and
then deletes it before the waiters have woken up, they don't get
confused by the provisional window vanishing while being waited for.

The easy solution is a global wait queue for that, but that doesn't
scale well.  The complex solution is a per-window wait queue and
reference count, which is obviously a bit of bloat, though probably
worth it for the high-load case.

Cheers,
 Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-12 Thread Mingming Cao
On Tue, 2005-04-12 at 12:18 +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Tue, 2005-04-12 at 07:41, Mingming Cao wrote:
> 
> > > Note that this may improve average case latencies, but it's not likely
> > > to improve worst-case ones.  We still need a write lock to install a new
> > > window, and that's going to have to wait for us to finish finding a free
> > > bit even if that operation starts using a read lock.  
> > > 
> > Yes indeed. However nothing is free and there are always trade-offs.:) 
> > 
> > By worse case you mean multiple writes trying to allocate blocks around
> > same area?
> 
> It doesn't matter where they are; multiple new file opens will all be
> looking for a write lock.  You only need one long-held read lock and all
> the writers still block.  The worst-case latencies can't be properly
> solved with r/w locks --- those let the readers go more quickly
> (assuming they are in the majority), which helps the average case, but
> writers still have to wait for exclusive access.  We only really help
> them by dropping the lock entirely.
> 
> > Even if we take out the whole
> > reservation, we still possibility run into this kind of latency: the
> > bitmap on disk and on journal are extremely inconsistent so we need to
> > keep searching them both until we find a free bit on both map.
> 
> Quite possibly.  But as long as that code is running without locks, it's
> much easier to deal with those latencies: they won't impact other CPUs,
> cond_resched() is easier, and there's even CONFIG_PREEMPT.
> 

Okey, I agree.

> > > I'm not really sure what to do about worst-case here.  For that, we
> > > really do want to drop the lock entirely while we do the bitmap scan.
> 
> > Hmm...if we drop the lock entirely while scan the bitmap, assuming you
> > mean drop the read lock, then I am afraid we have to re-check the tree
> > (require a read or write lock ) to see if the new window space is still
> > there after the scan succeed.
> 
> Sure.  You basically start off with a provisional window, and then if
> necessary roll it forward just the same way you roll normal windows
> forward when they get to their end.  That means you can still drop the
> lock while you search for new space.  When you get there, reacquire the
> lock and check that the intervening space is still available.
> 
Please note that the bitmap scan does not only scan the provisional
window range, it will return the first free it on the bitmap start from
the start block of the provisional window until the end of the whole
block group.

So in the case the new window's neighbor is the same as the old one(that
means window is rolled forward), we only need roll forward once to find
it. Either we find a free bit inside the provisional window, or we find
a free bit out of it. In the first case we just roll window forward and
we are done. In the second case, it's possible the free bit is inside
someone else's window(which means we can't take that window) or it
inside the new space after a already reserved window. Either way we have
to lock up the whole tree to remove the old window and insert the new
window.

> That's really cheap for the common case.  The difficulty is when you
> have many parallel allocations hitting the same bg: they allocate
> provisional windows, find the same free area later on in the bg, and
> then stomp on each other as they try to move their windows there.
> 

> I wonder if there's not a simple solution for this --- mark the window
> as "provisional", and if any other task tries to allocate in the space
> immediately following such a window, it needs to block until that window
> is released.
> 

Sounds interesting. However that implies we need a write lock to mark
the window as provisional and block other files looking for windows near
it: we need to insert the provisional window into the tree and then mark
it as a temporary window, to really let other file notice this kind of
"hold".

I wonder if the benefit of read/write lock is worth all the hassles now.
If the new window's neighbor stays the same, we only need to roll
forward once.  If not, after a successful scan, we need to grab the
write lock, and make sure the window is still there. If we dropped the
lock without holding the new space, we have to search the whole tree
again to find out if the space is still there, we cannot use the
previous node returned by find_next_reservable_window() since the
previous node could be gone while we scan the bitmap without the lock.
Basically we do twice tree search(once for roll forward case) and twice
locking in the normal case. Also I am concerned about the possible
starvation on writers.

I am thinking, maybe back to the spin_lock is not that bad with the
"mark provisional" suggest you made? It allows us to mark the new space
as provisional if we find a new space(prevent other window searching run
into the same neighborhood). We could release the lock and scan the
bitmap without worry about the new space will be taking by 

Re: ext3 allocate-with-reservation latencies

2005-04-12 Thread Stephen C. Tweedie
Hi,

On Tue, 2005-04-12 at 07:41, Mingming Cao wrote:

> > Note that this may improve average case latencies, but it's not likely
> > to improve worst-case ones.  We still need a write lock to install a new
> > window, and that's going to have to wait for us to finish finding a free
> > bit even if that operation starts using a read lock.  
> > 
> Yes indeed. However nothing is free and there are always trade-offs.:) 
> 
> By worse case you mean multiple writes trying to allocate blocks around
> same area?

It doesn't matter where they are; multiple new file opens will all be
looking for a write lock.  You only need one long-held read lock and all
the writers still block.  The worst-case latencies can't be properly
solved with r/w locks --- those let the readers go more quickly
(assuming they are in the majority), which helps the average case, but
writers still have to wait for exclusive access.  We only really help
them by dropping the lock entirely.

> Even if we take out the whole
> reservation, we still possibility run into this kind of latency: the
> bitmap on disk and on journal are extremely inconsistent so we need to
> keep searching them both until we find a free bit on both map.

Quite possibly.  But as long as that code is running without locks, it's
much easier to deal with those latencies: they won't impact other CPUs,
cond_resched() is easier, and there's even CONFIG_PREEMPT.

> > I'm not really sure what to do about worst-case here.  For that, we
> > really do want to drop the lock entirely while we do the bitmap scan.

> Hmm...if we drop the lock entirely while scan the bitmap, assuming you
> mean drop the read lock, then I am afraid we have to re-check the tree
> (require a read or write lock ) to see if the new window space is still
> there after the scan succeed.

Sure.  You basically start off with a provisional window, and then if
necessary roll it forward just the same way you roll normal windows
forward when they get to their end.  That means you can still drop the
lock while you search for new space.  When you get there, reacquire the
lock and check that the intervening space is still available.

That's really cheap for the common case.  The difficulty is when you
have many parallel allocations hitting the same bg: they allocate
provisional windows, find the same free area later on in the bg, and
then stomp on each other as they try to move their windows there.

I wonder if there's not a simple solution for this --- mark the window
as "provisional", and if any other task tries to allocate in the space
immediately following such a window, it needs to block until that window
is released.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-12 Thread Mingming Cao
On Mon, 2005-04-11 at 20:57 +0100, Stephen C. Tweedie wrote:
> Hi,

Hi Stephen, 

> 
> On Mon, 2005-04-11 at 19:38, Mingming Cao wrote:
> > Ah.. I see the win with the read lock now: once the a reservation window
> > is added, updating it (either winding it forward and or searching for a
> > avaliable window) probably is the majorirty of the operations on the
> > tree, and using read lock for that should help reduce the latency.
> 
> Right.  The down side is that for things like a kernel "tar xf", we'll
> be doing lots of small file unpacks, and hopefully most files will be
> just one or two reservations --- so there's little winding forward going
> on.  The searching will still be improved in that case.

Just a side note that "tar xf" should know the file size before
unpacking it.  So it could set the reservation window size large enough
to fit the entire file before doing the file write through ioctl
command.

> Note that this may improve average case latencies, but it's not likely
> to improve worst-case ones.  We still need a write lock to install a new
> window, and that's going to have to wait for us to finish finding a free
> bit even if that operation starts using a read lock.  
> 
Yes indeed. However nothing is free and there are always trade-offs.:) 

By worse case you mean multiple writes trying to allocate blocks around
same area?

But I wonder if the latency saw by Lee belongs to this worst-case: the
latency comes mostly from loop calling find_next_zero_bit() in
bitmap_search_next_usable_block(). Even if we take out the whole
reservation, we still possibility run into this kind of latency: the
bitmap on disk and on journal are extremely inconsistent so we need to
keep searching them both until we find a free bit on both map.

> I'm not really sure what to do about worst-case here.  For that, we
> really do want to drop the lock entirely while we do the bitmap scan.
> 

Hmm...if we drop the lock entirely while scan the bitmap, assuming you
mean drop the read lock, then I am afraid we have to re-check the tree
(require a read or write lock ) to see if the new window space is still
there after the scan succeed. This is probably not very interesting for
the window rotating case.

> That leaves two options.  Hold a reservation while we do that; or don't.
> Holding one poses the problems we discussed before: either you hold a
> large reservation (bad for disk layout in the presence of concurrent
> allocators), or you hold smaller ones (high cost as you continually
> advance the window, which requires some read lock on the tree to avoid
> bumping into the next window.)
> 

Well, we cannot hold a reservation (which need to update the tree)
without a write lock. I guess if we want to improve the average case
latency by replacing the current spin_lock with the read lock for the
new window space searching, we don't have much choice here.

> Just how bad would it be if we didn't hold a lock _or_ a window at all
> while doing the search for new window space?  

I wonder if this is feasible: Walk through the rb tree without a lock?
What if some node is being removed by another thread while we are
walking through the tree and trying to get the next node from it?


Thanks,

Mingming

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-12 Thread Mingming Cao
On Mon, 2005-04-11 at 20:57 +0100, Stephen C. Tweedie wrote:
 Hi,

Hi Stephen, 

 
 On Mon, 2005-04-11 at 19:38, Mingming Cao wrote:
  Ah.. I see the win with the read lock now: once the a reservation window
  is added, updating it (either winding it forward and or searching for a
  avaliable window) probably is the majorirty of the operations on the
  tree, and using read lock for that should help reduce the latency.
 
 Right.  The down side is that for things like a kernel tar xf, we'll
 be doing lots of small file unpacks, and hopefully most files will be
 just one or two reservations --- so there's little winding forward going
 on.  The searching will still be improved in that case.

Just a side note that tar xf should know the file size before
unpacking it.  So it could set the reservation window size large enough
to fit the entire file before doing the file write through ioctl
command.

 Note that this may improve average case latencies, but it's not likely
 to improve worst-case ones.  We still need a write lock to install a new
 window, and that's going to have to wait for us to finish finding a free
 bit even if that operation starts using a read lock.  
 
Yes indeed. However nothing is free and there are always trade-offs.:) 

By worse case you mean multiple writes trying to allocate blocks around
same area?

But I wonder if the latency saw by Lee belongs to this worst-case: the
latency comes mostly from loop calling find_next_zero_bit() in
bitmap_search_next_usable_block(). Even if we take out the whole
reservation, we still possibility run into this kind of latency: the
bitmap on disk and on journal are extremely inconsistent so we need to
keep searching them both until we find a free bit on both map.

 I'm not really sure what to do about worst-case here.  For that, we
 really do want to drop the lock entirely while we do the bitmap scan.
 

Hmm...if we drop the lock entirely while scan the bitmap, assuming you
mean drop the read lock, then I am afraid we have to re-check the tree
(require a read or write lock ) to see if the new window space is still
there after the scan succeed. This is probably not very interesting for
the window rotating case.

 That leaves two options.  Hold a reservation while we do that; or don't.
 Holding one poses the problems we discussed before: either you hold a
 large reservation (bad for disk layout in the presence of concurrent
 allocators), or you hold smaller ones (high cost as you continually
 advance the window, which requires some read lock on the tree to avoid
 bumping into the next window.)
 

Well, we cannot hold a reservation (which need to update the tree)
without a write lock. I guess if we want to improve the average case
latency by replacing the current spin_lock with the read lock for the
new window space searching, we don't have much choice here.

 Just how bad would it be if we didn't hold a lock _or_ a window at all
 while doing the search for new window space?  

I wonder if this is feasible: Walk through the rb tree without a lock?
What if some node is being removed by another thread while we are
walking through the tree and trying to get the next node from it?


Thanks,

Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-12 Thread Stephen C. Tweedie
Hi,

On Tue, 2005-04-12 at 07:41, Mingming Cao wrote:

  Note that this may improve average case latencies, but it's not likely
  to improve worst-case ones.  We still need a write lock to install a new
  window, and that's going to have to wait for us to finish finding a free
  bit even if that operation starts using a read lock.  
  
 Yes indeed. However nothing is free and there are always trade-offs.:) 
 
 By worse case you mean multiple writes trying to allocate blocks around
 same area?

It doesn't matter where they are; multiple new file opens will all be
looking for a write lock.  You only need one long-held read lock and all
the writers still block.  The worst-case latencies can't be properly
solved with r/w locks --- those let the readers go more quickly
(assuming they are in the majority), which helps the average case, but
writers still have to wait for exclusive access.  We only really help
them by dropping the lock entirely.

 Even if we take out the whole
 reservation, we still possibility run into this kind of latency: the
 bitmap on disk and on journal are extremely inconsistent so we need to
 keep searching them both until we find a free bit on both map.

Quite possibly.  But as long as that code is running without locks, it's
much easier to deal with those latencies: they won't impact other CPUs,
cond_resched() is easier, and there's even CONFIG_PREEMPT.

  I'm not really sure what to do about worst-case here.  For that, we
  really do want to drop the lock entirely while we do the bitmap scan.

 Hmm...if we drop the lock entirely while scan the bitmap, assuming you
 mean drop the read lock, then I am afraid we have to re-check the tree
 (require a read or write lock ) to see if the new window space is still
 there after the scan succeed.

Sure.  You basically start off with a provisional window, and then if
necessary roll it forward just the same way you roll normal windows
forward when they get to their end.  That means you can still drop the
lock while you search for new space.  When you get there, reacquire the
lock and check that the intervening space is still available.

That's really cheap for the common case.  The difficulty is when you
have many parallel allocations hitting the same bg: they allocate
provisional windows, find the same free area later on in the bg, and
then stomp on each other as they try to move their windows there.

I wonder if there's not a simple solution for this --- mark the window
as provisional, and if any other task tries to allocate in the space
immediately following such a window, it needs to block until that window
is released.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-12 Thread Mingming Cao
On Tue, 2005-04-12 at 12:18 +0100, Stephen C. Tweedie wrote:
 Hi,
 
 On Tue, 2005-04-12 at 07:41, Mingming Cao wrote:
 
   Note that this may improve average case latencies, but it's not likely
   to improve worst-case ones.  We still need a write lock to install a new
   window, and that's going to have to wait for us to finish finding a free
   bit even if that operation starts using a read lock.  
   
  Yes indeed. However nothing is free and there are always trade-offs.:) 
  
  By worse case you mean multiple writes trying to allocate blocks around
  same area?
 
 It doesn't matter where they are; multiple new file opens will all be
 looking for a write lock.  You only need one long-held read lock and all
 the writers still block.  The worst-case latencies can't be properly
 solved with r/w locks --- those let the readers go more quickly
 (assuming they are in the majority), which helps the average case, but
 writers still have to wait for exclusive access.  We only really help
 them by dropping the lock entirely.
 
  Even if we take out the whole
  reservation, we still possibility run into this kind of latency: the
  bitmap on disk and on journal are extremely inconsistent so we need to
  keep searching them both until we find a free bit on both map.
 
 Quite possibly.  But as long as that code is running without locks, it's
 much easier to deal with those latencies: they won't impact other CPUs,
 cond_resched() is easier, and there's even CONFIG_PREEMPT.
 

Okey, I agree.

   I'm not really sure what to do about worst-case here.  For that, we
   really do want to drop the lock entirely while we do the bitmap scan.
 
  Hmm...if we drop the lock entirely while scan the bitmap, assuming you
  mean drop the read lock, then I am afraid we have to re-check the tree
  (require a read or write lock ) to see if the new window space is still
  there after the scan succeed.
 
 Sure.  You basically start off with a provisional window, and then if
 necessary roll it forward just the same way you roll normal windows
 forward when they get to their end.  That means you can still drop the
 lock while you search for new space.  When you get there, reacquire the
 lock and check that the intervening space is still available.
 
Please note that the bitmap scan does not only scan the provisional
window range, it will return the first free it on the bitmap start from
the start block of the provisional window until the end of the whole
block group.

So in the case the new window's neighbor is the same as the old one(that
means window is rolled forward), we only need roll forward once to find
it. Either we find a free bit inside the provisional window, or we find
a free bit out of it. In the first case we just roll window forward and
we are done. In the second case, it's possible the free bit is inside
someone else's window(which means we can't take that window) or it
inside the new space after a already reserved window. Either way we have
to lock up the whole tree to remove the old window and insert the new
window.

 That's really cheap for the common case.  The difficulty is when you
 have many parallel allocations hitting the same bg: they allocate
 provisional windows, find the same free area later on in the bg, and
 then stomp on each other as they try to move their windows there.
 

 I wonder if there's not a simple solution for this --- mark the window
 as provisional, and if any other task tries to allocate in the space
 immediately following such a window, it needs to block until that window
 is released.
 

Sounds interesting. However that implies we need a write lock to mark
the window as provisional and block other files looking for windows near
it: we need to insert the provisional window into the tree and then mark
it as a temporary window, to really let other file notice this kind of
hold.

I wonder if the benefit of read/write lock is worth all the hassles now.
If the new window's neighbor stays the same, we only need to roll
forward once.  If not, after a successful scan, we need to grab the
write lock, and make sure the window is still there. If we dropped the
lock without holding the new space, we have to search the whole tree
again to find out if the space is still there, we cannot use the
previous node returned by find_next_reservable_window() since the
previous node could be gone while we scan the bitmap without the lock.
Basically we do twice tree search(once for roll forward case) and twice
locking in the normal case. Also I am concerned about the possible
starvation on writers.

I am thinking, maybe back to the spin_lock is not that bad with the
mark provisional suggest you made? It allows us to mark the new space
as provisional if we find a new space(prevent other window searching run
into the same neighborhood). We could release the lock and scan the
bitmap without worry about the new space will be taking by others.  If
there is a free bit, then we could just clear the provisional bit and
simply 

Re: ext3 allocate-with-reservation latencies

2005-04-11 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-04-11 at 19:38, Mingming Cao wrote:

> I agree. We should not skip the home block group of the file.  I guess
> what I was suggesting is, if allocation from the home group failed and
> we continuing the linear search the rest of block groups, we could
> probably try to skip the block groups without enough usable free blocks
> to make a reservation. 

Fair enough.  Once those are the only bgs left, performance is going to
drop pretty quickly, but that's not really avoidable on a very full
filesystem.

> Ah.. I see the win with the read lock now: once the a reservation window
> is added, updating it (either winding it forward and or searching for a
> avaliable window) probably is the majorirty of the operations on the
> tree, and using read lock for that should help reduce the latency.

Right.  The down side is that for things like a kernel "tar xf", we'll
be doing lots of small file unpacks, and hopefully most files will be
just one or two reservations --- so there's little winding forward going
on.  The searching will still be improved in that case.

Note that this may improve average case latencies, but it's not likely
to improve worst-case ones.  We still need a write lock to install a new
window, and that's going to have to wait for us to finish finding a free
bit even if that operation starts using a read lock.  

I'm not really sure what to do about worst-case here.  For that, we
really do want to drop the lock entirely while we do the bitmap scan.

That leaves two options.  Hold a reservation while we do that; or don't.
Holding one poses the problems we discussed before: either you hold a
large reservation (bad for disk layout in the presence of concurrent
allocators), or you hold smaller ones (high cost as you continually
advance the window, which requires some read lock on the tree to avoid
bumping into the next window.)

Just how bad would it be if we didn't hold a lock _or_ a window at all
while doing the search for new window space?  I didn't like that
alternative at first because of the problem when you've got multiple
tasks trying to allocate in the same space at the same time; but given
the locking overhead of the alternatives, I'm wondering if this is
actually the lesser evil.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-11 Thread Lee Revell
On Mon, 2005-04-11 at 15:12 -0400, Lee Revell wrote:
> On Mon, 2005-04-11 at 11:38 -0700, Mingming Cao wrote:
> > I will work on a patch for Lee to try sometime tonight.
> > 
> 
> Just FYI, this will take a while to test, because this latency seems
> quite rare.  I haven't seen it again since my original report.

Never mind, I spoke too soon.  As soon as I posted this, I hit it again.
Seems like doing a CVS checkout reliably triggers the problem.

Lee

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-11 Thread Lee Revell
On Mon, 2005-04-11 at 11:38 -0700, Mingming Cao wrote:
> I will work on a patch for Lee to try sometime tonight.
> 

Just FYI, this will take a while to test, because this latency seems
quite rare.  I haven't seen it again since my original report.
Hopefully I can reproduce it with enough dbench processes.

This seems to agree with your earlier observation that we seem to have
been quite unlucky.

Anyway, if I understand the thread it seems like this could increase
performance as well as help latency.

Lee

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-11 Thread Mingming Cao
On Mon, 2005-04-11 at 12:48 +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Fri, 2005-04-08 at 19:10, Mingming Cao wrote:
> 
> > > It still needs to be done under locking to prevent us from expanding
> > > over the next window, though.  And having to take and drop a spinlock a
> > > dozen times or more just to find out that there are no usable free
> > > blocks in the current block group is still expensive, even if we're not
> > > actually fully unlinking the window each time.
> 
> > Isn't this a rare case? The whole group is relatively full and the free
> > bits are all reserved by other files.
> 
> Well, that's not much different from the case where there _are_ usable
> bits but they are all near the end of the bitmap.  And that's common
> enough as you fill up a block group with small files, for example.  Once
> the bg becomes nearly full, new file opens-for-write will still try to
> allocate in that bg (because that's where the inode was allocated), but
> as it's a new fd we have no prior knowledge of _where_ in the bh to
> allocate, and we'll have to walk it to the end to find any free space. 
> This is the access pattern I'd expect of (for example) "tar xvjf
> linux-2.6.12.tar.bz2", not exactly a rare case.
> 

Okey.

> >   Probably we should avoid trying
> > to make reservation in this block group at the first place
> 
> Not in this case, because it's the "home" of the file in question, and
> skipping to another bg would just leave useful space empty --- and that
> leads to fragmentation.
> 
I agree. We should not skip the home block group of the file.  I guess
what I was suggesting is, if allocation from the home group failed and
we continuing the linear search the rest of block groups, we could
probably try to skip the block groups without enough usable free blocks
to make a reservation. Checking for the number of free blocks left in
the quary bg is a good way, but probably not good enough, since those
free blocks might already being reserved.

> > You are proposing that we hold the read lock first, do the window search
> > and bitmap scan, then once we confirm there is free bit in the candidate
> > window, we grab the write lock and update the tree?  
> 
> No, I'm suggesting that if we need the write lock for tree updates, we
> may still be able to get away with just a read lock when updating an
> individual window.  If all we're doing is winding the window forwards a
> bit, that's not actually changing the structure of the tree.
> 
> > However I am still worried that the rw lock will allow concurrent files
> > trying to lock the same window at the same time. 
> 
> Adding a new window will need the write lock, always.  But with the read
> lock, we can still check the neighbouring windows (the structure is
> pinned so those remain valid), so advancing a window with just that lock
> can still be done without risking overlapping the next window.
> 

Ah.. I see the win with the read lock now: once the a reservation window
is added, updating it (either winding it forward and or searching for a
avaliable window) probably is the majorirty of the operations on the
tree, and using read lock for that should help reduce the latency.

I will work on a patch for Lee to try sometime tonight.

Thanks,

Mingming

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-11 Thread Stephen C. Tweedie
Hi,

On Fri, 2005-04-08 at 19:10, Mingming Cao wrote:

> > It still needs to be done under locking to prevent us from expanding
> > over the next window, though.  And having to take and drop a spinlock a
> > dozen times or more just to find out that there are no usable free
> > blocks in the current block group is still expensive, even if we're not
> > actually fully unlinking the window each time.

> Isn't this a rare case? The whole group is relatively full and the free
> bits are all reserved by other files.

Well, that's not much different from the case where there _are_ usable
bits but they are all near the end of the bitmap.  And that's common
enough as you fill up a block group with small files, for example.  Once
the bg becomes nearly full, new file opens-for-write will still try to
allocate in that bg (because that's where the inode was allocated), but
as it's a new fd we have no prior knowledge of _where_ in the bh to
allocate, and we'll have to walk it to the end to find any free space. 
This is the access pattern I'd expect of (for example) "tar xvjf
linux-2.6.12.tar.bz2", not exactly a rare case.

>   Probably we should avoid trying
> to make reservation in this block group at the first place

Not in this case, because it's the "home" of the file in question, and
skipping to another bg would just leave useful space empty --- and that
leads to fragmentation.

> You are proposing that we hold the read lock first, do the window search
> and bitmap scan, then once we confirm there is free bit in the candidate
> window, we grab the write lock and update the tree?  

No, I'm suggesting that if we need the write lock for tree updates, we
may still be able to get away with just a read lock when updating an
individual window.  If all we're doing is winding the window forwards a
bit, that's not actually changing the structure of the tree.

> However I am still worried that the rw lock will allow concurrent files
> trying to lock the same window at the same time. 

Adding a new window will need the write lock, always.  But with the read
lock, we can still check the neighbouring windows (the structure is
pinned so those remain valid), so advancing a window with just that lock
can still be done without risking overlapping the next window.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-11 Thread Stephen C. Tweedie
Hi,

On Fri, 2005-04-08 at 19:10, Mingming Cao wrote:

  It still needs to be done under locking to prevent us from expanding
  over the next window, though.  And having to take and drop a spinlock a
  dozen times or more just to find out that there are no usable free
  blocks in the current block group is still expensive, even if we're not
  actually fully unlinking the window each time.

 Isn't this a rare case? The whole group is relatively full and the free
 bits are all reserved by other files.

Well, that's not much different from the case where there _are_ usable
bits but they are all near the end of the bitmap.  And that's common
enough as you fill up a block group with small files, for example.  Once
the bg becomes nearly full, new file opens-for-write will still try to
allocate in that bg (because that's where the inode was allocated), but
as it's a new fd we have no prior knowledge of _where_ in the bh to
allocate, and we'll have to walk it to the end to find any free space. 
This is the access pattern I'd expect of (for example) tar xvjf
linux-2.6.12.tar.bz2, not exactly a rare case.

   Probably we should avoid trying
 to make reservation in this block group at the first place

Not in this case, because it's the home of the file in question, and
skipping to another bg would just leave useful space empty --- and that
leads to fragmentation.

 You are proposing that we hold the read lock first, do the window search
 and bitmap scan, then once we confirm there is free bit in the candidate
 window, we grab the write lock and update the tree?  

No, I'm suggesting that if we need the write lock for tree updates, we
may still be able to get away with just a read lock when updating an
individual window.  If all we're doing is winding the window forwards a
bit, that's not actually changing the structure of the tree.

 However I am still worried that the rw lock will allow concurrent files
 trying to lock the same window at the same time. 

Adding a new window will need the write lock, always.  But with the read
lock, we can still check the neighbouring windows (the structure is
pinned so those remain valid), so advancing a window with just that lock
can still be done without risking overlapping the next window.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-11 Thread Mingming Cao
On Mon, 2005-04-11 at 12:48 +0100, Stephen C. Tweedie wrote:
 Hi,
 
 On Fri, 2005-04-08 at 19:10, Mingming Cao wrote:
 
   It still needs to be done under locking to prevent us from expanding
   over the next window, though.  And having to take and drop a spinlock a
   dozen times or more just to find out that there are no usable free
   blocks in the current block group is still expensive, even if we're not
   actually fully unlinking the window each time.
 
  Isn't this a rare case? The whole group is relatively full and the free
  bits are all reserved by other files.
 
 Well, that's not much different from the case where there _are_ usable
 bits but they are all near the end of the bitmap.  And that's common
 enough as you fill up a block group with small files, for example.  Once
 the bg becomes nearly full, new file opens-for-write will still try to
 allocate in that bg (because that's where the inode was allocated), but
 as it's a new fd we have no prior knowledge of _where_ in the bh to
 allocate, and we'll have to walk it to the end to find any free space. 
 This is the access pattern I'd expect of (for example) tar xvjf
 linux-2.6.12.tar.bz2, not exactly a rare case.
 

Okey.

Probably we should avoid trying
  to make reservation in this block group at the first place
 
 Not in this case, because it's the home of the file in question, and
 skipping to another bg would just leave useful space empty --- and that
 leads to fragmentation.
 
I agree. We should not skip the home block group of the file.  I guess
what I was suggesting is, if allocation from the home group failed and
we continuing the linear search the rest of block groups, we could
probably try to skip the block groups without enough usable free blocks
to make a reservation. Checking for the number of free blocks left in
the quary bg is a good way, but probably not good enough, since those
free blocks might already being reserved.

  You are proposing that we hold the read lock first, do the window search
  and bitmap scan, then once we confirm there is free bit in the candidate
  window, we grab the write lock and update the tree?  
 
 No, I'm suggesting that if we need the write lock for tree updates, we
 may still be able to get away with just a read lock when updating an
 individual window.  If all we're doing is winding the window forwards a
 bit, that's not actually changing the structure of the tree.
 
  However I am still worried that the rw lock will allow concurrent files
  trying to lock the same window at the same time. 
 
 Adding a new window will need the write lock, always.  But with the read
 lock, we can still check the neighbouring windows (the structure is
 pinned so those remain valid), so advancing a window with just that lock
 can still be done without risking overlapping the next window.
 

Ah.. I see the win with the read lock now: once the a reservation window
is added, updating it (either winding it forward and or searching for a
avaliable window) probably is the majorirty of the operations on the
tree, and using read lock for that should help reduce the latency.

I will work on a patch for Lee to try sometime tonight.

Thanks,

Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-11 Thread Lee Revell
On Mon, 2005-04-11 at 11:38 -0700, Mingming Cao wrote:
 I will work on a patch for Lee to try sometime tonight.
 

Just FYI, this will take a while to test, because this latency seems
quite rare.  I haven't seen it again since my original report.
Hopefully I can reproduce it with enough dbench processes.

This seems to agree with your earlier observation that we seem to have
been quite unlucky.

Anyway, if I understand the thread it seems like this could increase
performance as well as help latency.

Lee

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-11 Thread Lee Revell
On Mon, 2005-04-11 at 15:12 -0400, Lee Revell wrote:
 On Mon, 2005-04-11 at 11:38 -0700, Mingming Cao wrote:
  I will work on a patch for Lee to try sometime tonight.
  
 
 Just FYI, this will take a while to test, because this latency seems
 quite rare.  I haven't seen it again since my original report.

Never mind, I spoke too soon.  As soon as I posted this, I hit it again.
Seems like doing a CVS checkout reliably triggers the problem.

Lee

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-11 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-04-11 at 19:38, Mingming Cao wrote:

 I agree. We should not skip the home block group of the file.  I guess
 what I was suggesting is, if allocation from the home group failed and
 we continuing the linear search the rest of block groups, we could
 probably try to skip the block groups without enough usable free blocks
 to make a reservation. 

Fair enough.  Once those are the only bgs left, performance is going to
drop pretty quickly, but that's not really avoidable on a very full
filesystem.

 Ah.. I see the win with the read lock now: once the a reservation window
 is added, updating it (either winding it forward and or searching for a
 avaliable window) probably is the majorirty of the operations on the
 tree, and using read lock for that should help reduce the latency.

Right.  The down side is that for things like a kernel tar xf, we'll
be doing lots of small file unpacks, and hopefully most files will be
just one or two reservations --- so there's little winding forward going
on.  The searching will still be improved in that case.

Note that this may improve average case latencies, but it's not likely
to improve worst-case ones.  We still need a write lock to install a new
window, and that's going to have to wait for us to finish finding a free
bit even if that operation starts using a read lock.  

I'm not really sure what to do about worst-case here.  For that, we
really do want to drop the lock entirely while we do the bitmap scan.

That leaves two options.  Hold a reservation while we do that; or don't.
Holding one poses the problems we discussed before: either you hold a
large reservation (bad for disk layout in the presence of concurrent
allocators), or you hold smaller ones (high cost as you continually
advance the window, which requires some read lock on the tree to avoid
bumping into the next window.)

Just how bad would it be if we didn't hold a lock _or_ a window at all
while doing the search for new window space?  I didn't like that
alternative at first because of the problem when you've got multiple
tasks trying to allocate in the same space at the same time; but given
the locking overhead of the alternatives, I'm wondering if this is
actually the lesser evil.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-08 Thread Lee Revell
On Fri, 2005-04-08 at 11:10 -0700, Mingming Cao wrote:
> However I am still worried that the rw lock will allow concurrent files
> trying to lock the same window at the same time. Only one succeed
> though., high cpu usage then.  And also, in the normal case the
> filesystem is not really full, probably rw lock becomes expensive.

FWIW this was a 125GB filesystem, about 70% full.

Lee

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-08 Thread Mingming Cao
On Fri, 2005-04-08 at 15:40 +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Fri, 2005-04-08 at 00:37, Mingming Cao wrote:
> 
> > Actually, we do not have to do an rbtree link and unlink for every
> > window we search.  If the reserved window(old) has no free bit and the
> > new reservable window's is right after the old one, no need to unlink
> > the old window from the rbtree and then link the new window, just update
> > the start and end of the old one.
> 
> It still needs to be done under locking to prevent us from expanding
> over the next window, though.  And having to take and drop a spinlock a
> dozen times or more just to find out that there are no usable free
> blocks in the current block group is still expensive, even if we're not
> actually fully unlinking the window each time.
> 

Isn't this a rare case? The whole group is relatively full and the free
bits are all reserved by other files.  Probably we should avoid trying
to make reservation in this block group at the first place, if we could
find a way to detect the number of _usable_ free bits are less than the
requested window size. 


> I wonder if this can possibly be done safely without locking?  It would
> be really good if we could rotate windows forward with no global locks. 
> At the very least, a fair rwlock would let us freeze the total layout of
> the tree, while still letting us modify individual windows safely.  As
> long as we use wmb() to make sure that we always extend the end of the
> window before we shrink the start of it, I think we could get away with
> such shared locking.  And rw locking is much better for concurrency, so
> we might be able to hold it over the whole bitmap search rather than
> taking it and dropping it at each window advance.
> 

You are proposing that we hold the read lock first, do the window search
and bitmap scan, then once we confirm there is free bit in the candidate
window, we grab the write lock and update the tree?  

I think this is a good idea to address case you have concerned: when we
need to do lots of window search before settle down. Also if later we
decide (I think we have discussed this before) to always try to reserve
the window with at least 8 contigous free blocks, the search will be
more expensive and the read lock will help.

However I am still worried that the rw lock will allow concurrent files
trying to lock the same window at the same time. Only one succeed
though., high cpu usage then.  And also, in the normal case the
filesystem is not really full, probably rw lock becomes expensive.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-08 Thread Arjan van de Ven
>  And rw locking is much better for concurrency, so
> we might be able to hold it over the whole bitmap search rather than
> taking it and dropping it at each window advance.

rw locks only help if readers are 10x more common than writers generally
speaking. They are quite expensive locks, so they really should be used
with the utmost care. 
(if you have really long hold times the dynamics are different, but if
you have really long hold times your latency sucks too and wasn't that
what this thread was trying to fix?)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-08 Thread Stephen C. Tweedie
Hi,

On Fri, 2005-04-08 at 00:37, Mingming Cao wrote:

> Actually, we do not have to do an rbtree link and unlink for every
> window we search.  If the reserved window(old) has no free bit and the
> new reservable window's is right after the old one, no need to unlink
> the old window from the rbtree and then link the new window, just update
> the start and end of the old one.

It still needs to be done under locking to prevent us from expanding
over the next window, though.  And having to take and drop a spinlock a
dozen times or more just to find out that there are no usable free
blocks in the current block group is still expensive, even if we're not
actually fully unlinking the window each time.

I wonder if this can possibly be done safely without locking?  It would
be really good if we could rotate windows forward with no global locks. 
At the very least, a fair rwlock would let us freeze the total layout of
the tree, while still letting us modify individual windows safely.  As
long as we use wmb() to make sure that we always extend the end of the
window before we shrink the start of it, I think we could get away with
such shared locking.  And rw locking is much better for concurrency, so
we might be able to hold it over the whole bitmap search rather than
taking it and dropping it at each window advance.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-08 Thread Stephen C. Tweedie
Hi,

On Fri, 2005-04-08 at 00:37, Mingming Cao wrote:

 Actually, we do not have to do an rbtree link and unlink for every
 window we search.  If the reserved window(old) has no free bit and the
 new reservable window's is right after the old one, no need to unlink
 the old window from the rbtree and then link the new window, just update
 the start and end of the old one.

It still needs to be done under locking to prevent us from expanding
over the next window, though.  And having to take and drop a spinlock a
dozen times or more just to find out that there are no usable free
blocks in the current block group is still expensive, even if we're not
actually fully unlinking the window each time.

I wonder if this can possibly be done safely without locking?  It would
be really good if we could rotate windows forward with no global locks. 
At the very least, a fair rwlock would let us freeze the total layout of
the tree, while still letting us modify individual windows safely.  As
long as we use wmb() to make sure that we always extend the end of the
window before we shrink the start of it, I think we could get away with
such shared locking.  And rw locking is much better for concurrency, so
we might be able to hold it over the whole bitmap search rather than
taking it and dropping it at each window advance.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-08 Thread Arjan van de Ven
  And rw locking is much better for concurrency, so
 we might be able to hold it over the whole bitmap search rather than
 taking it and dropping it at each window advance.

rw locks only help if readers are 10x more common than writers generally
speaking. They are quite expensive locks, so they really should be used
with the utmost care. 
(if you have really long hold times the dynamics are different, but if
you have really long hold times your latency sucks too and wasn't that
what this thread was trying to fix?)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-08 Thread Mingming Cao
On Fri, 2005-04-08 at 15:40 +0100, Stephen C. Tweedie wrote:
 Hi,
 
 On Fri, 2005-04-08 at 00:37, Mingming Cao wrote:
 
  Actually, we do not have to do an rbtree link and unlink for every
  window we search.  If the reserved window(old) has no free bit and the
  new reservable window's is right after the old one, no need to unlink
  the old window from the rbtree and then link the new window, just update
  the start and end of the old one.
 
 It still needs to be done under locking to prevent us from expanding
 over the next window, though.  And having to take and drop a spinlock a
 dozen times or more just to find out that there are no usable free
 blocks in the current block group is still expensive, even if we're not
 actually fully unlinking the window each time.
 

Isn't this a rare case? The whole group is relatively full and the free
bits are all reserved by other files.  Probably we should avoid trying
to make reservation in this block group at the first place, if we could
find a way to detect the number of _usable_ free bits are less than the
requested window size. 


 I wonder if this can possibly be done safely without locking?  It would
 be really good if we could rotate windows forward with no global locks. 
 At the very least, a fair rwlock would let us freeze the total layout of
 the tree, while still letting us modify individual windows safely.  As
 long as we use wmb() to make sure that we always extend the end of the
 window before we shrink the start of it, I think we could get away with
 such shared locking.  And rw locking is much better for concurrency, so
 we might be able to hold it over the whole bitmap search rather than
 taking it and dropping it at each window advance.
 

You are proposing that we hold the read lock first, do the window search
and bitmap scan, then once we confirm there is free bit in the candidate
window, we grab the write lock and update the tree?  

I think this is a good idea to address case you have concerned: when we
need to do lots of window search before settle down. Also if later we
decide (I think we have discussed this before) to always try to reserve
the window with at least 8 contigous free blocks, the search will be
more expensive and the read lock will help.

However I am still worried that the rw lock will allow concurrent files
trying to lock the same window at the same time. Only one succeed
though., high cpu usage then.  And also, in the normal case the
filesystem is not really full, probably rw lock becomes expensive.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-08 Thread Lee Revell
On Fri, 2005-04-08 at 11:10 -0700, Mingming Cao wrote:
 However I am still worried that the rw lock will allow concurrent files
 trying to lock the same window at the same time. Only one succeed
 though., high cpu usage then.  And also, in the normal case the
 filesystem is not really full, probably rw lock becomes expensive.

FWIW this was a 125GB filesystem, about 70% full.

Lee

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-07 Thread Mingming Cao
On Thu, 2005-04-07 at 14:08 +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote:
> 
> > doesnt the first option also allow searches to be in parallel?
> 
> In terms of CPU usage, yes.  But either we use large windows, in which
> case we *can't* search remotely near areas of the disk in parallel; or
> we use small windows, in which case failure to find a free bit becomes
> disproportionately expensive (we have to do an rbtree link and unlink
> for each window we search.)

Actually, we do not have to do an rbtree link and unlink for every
window we search.  If the reserved window(old) has no free bit and the
new reservable window's is right after the old one, no need to unlink
the old window from the rbtree and then link the new window, just update
the start and end of the old one. That's in the current designed
already, to reduce the cost of rbtree link and unlink.

Mingming


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-07 Thread Mingming Cao
On Thu, 2005-04-07 at 14:08 +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote:
> 
> > doesnt the first option also allow searches to be in parallel?
> 
> In terms of CPU usage, yes.  But either we use large windows, in which
> case we *can't* search remotely near areas of the disk in parallel; or
> we use small windows, in which case failure to find a free bit becomes
> disproportionately expensive (we have to do an rbtree link and unlink
> for each window we search.)
> 

I was thinking that at the end of find_next_reservable_window(), since
we already know the previous node in the rbtree to insert, we could just
link the window there, that way the rbtree link cost will be minimized.

Then I realize in rbtree, previous node is not necessary the parent
node, we still have to search part of the rbtree to find the parent
node, which is really expensive in the case the insertion operations are
very frequent. Hmmm...

> >  This particular one took over 1 msec, 

If you look at the pattern in the latency report:
.
 cvs-14981 0.n.2 1044us : find_next_zero_bit 
(bitmap_search_next_usable_block)
 cvs-14981 0.n.2 1045us : ext3_test_allocatable 
(bitmap_search_next_usable_block)
 cvs-14981 0.n.2 1045us : find_next_zero_bit 
(bitmap_search_next_usable_block)
 cvs-14981 0.n.2 1046us : find_next_zero_bit 
(bitmap_search_next_usable_block)
 cvs-14981 0.n.2 1047us : ext3_test_allocatable 
(bitmap_search_next_usable_block)
 cvs-14981 0.n.2 1048us : find_next_zero_bit 
(bitmap_search_next_usable_block)
 cvs-14981 0.n.2 1049us : find_next_zero_bit 
(bitmap_search_next_usable_block)
 cvs-14981 0.n.2 1049us : ext3_test_allocatable 
(bitmap_search_next_usable_block)


the patten is: two calls to find_next_zero_bit() followed by a
ext3_test_allocatabl(). And we have about 440 iteration in this report.

That means we are really unlucky: Checked the on disk bitmap, found a
free bit, but it's not free in the journal copy, then we check the
journal copy, find a free bit, but it's not free on diskI could
imagine we need to check the journal copy once or twice(to prevent re-
use the just freed bit before it is been commited), but how could this
many (440) iterations happen?


Mingming

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-07 Thread Stephen C. Tweedie
Hi,

On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote:

> doesnt the first option also allow searches to be in parallel?

In terms of CPU usage, yes.  But either we use large windows, in which
case we *can't* search remotely near areas of the disk in parallel; or
we use small windows, in which case failure to find a free bit becomes
disproportionately expensive (we have to do an rbtree link and unlink
for each window we search.)

>  This 
> particular one took over 1 msec, so it seems there's a fair room for 
> parallellizing multiple writers and thus improving scalability. (or is 
> this codepath serialized globally anyway?)

No, the rbtree ops are serialised per-sb, and allocations within any one
inode are serialised, but the bitmap searches need not be serialised
globally.  (They are in the case of new window searches right now, which
is what we're trying to fix.)

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-07 Thread Ingo Molnar

* Mingming Cao <[EMAIL PROTECTED]> wrote:

> It seems we are holding the rsv_block while searching the bitmap for a 
> free bit.  In alloc_new_reservation(), we first find a available to 
> create a reservation window, then we check the bitmap to see if it 
> contains any free block. If not, we will search for next available 
> window, so on and on. During the whole process we are holding the 
> global rsv_lock.  We could, and probably should, avoid that.  Just 
> unlock the rsv_lock before the bitmap search and re-grab it after it.  
> We need to make sure that the available space that are still available 
> after we re- grab the lock.
> 
> Another option is to hold that available window before we release the 
> rsv_lock, and if there is no free bit inside that window, we will 
> remove it from the tree in the next round of searching for next 
> available window.
> 
> I prefer the second option, and plan to code it up soon. Any comments?

doesnt the first option also allow searches to be in parallel? This 
particular one took over 1 msec, so it seems there's a fair room for 
parallellizing multiple writers and thus improving scalability. (or is 
this codepath serialized globally anyway?)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-07 Thread Ingo Molnar

* Mingming Cao [EMAIL PROTECTED] wrote:

 It seems we are holding the rsv_block while searching the bitmap for a 
 free bit.  In alloc_new_reservation(), we first find a available to 
 create a reservation window, then we check the bitmap to see if it 
 contains any free block. If not, we will search for next available 
 window, so on and on. During the whole process we are holding the 
 global rsv_lock.  We could, and probably should, avoid that.  Just 
 unlock the rsv_lock before the bitmap search and re-grab it after it.  
 We need to make sure that the available space that are still available 
 after we re- grab the lock.
 
 Another option is to hold that available window before we release the 
 rsv_lock, and if there is no free bit inside that window, we will 
 remove it from the tree in the next round of searching for next 
 available window.
 
 I prefer the second option, and plan to code it up soon. Any comments?

doesnt the first option also allow searches to be in parallel? This 
particular one took over 1 msec, so it seems there's a fair room for 
parallellizing multiple writers and thus improving scalability. (or is 
this codepath serialized globally anyway?)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-07 Thread Stephen C. Tweedie
Hi,

On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote:

 doesnt the first option also allow searches to be in parallel?

In terms of CPU usage, yes.  But either we use large windows, in which
case we *can't* search remotely near areas of the disk in parallel; or
we use small windows, in which case failure to find a free bit becomes
disproportionately expensive (we have to do an rbtree link and unlink
for each window we search.)

  This 
 particular one took over 1 msec, so it seems there's a fair room for 
 parallellizing multiple writers and thus improving scalability. (or is 
 this codepath serialized globally anyway?)

No, the rbtree ops are serialised per-sb, and allocations within any one
inode are serialised, but the bitmap searches need not be serialised
globally.  (They are in the case of new window searches right now, which
is what we're trying to fix.)

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-07 Thread Mingming Cao
On Thu, 2005-04-07 at 14:08 +0100, Stephen C. Tweedie wrote:
 Hi,
 
 On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote:
 
  doesnt the first option also allow searches to be in parallel?
 
 In terms of CPU usage, yes.  But either we use large windows, in which
 case we *can't* search remotely near areas of the disk in parallel; or
 we use small windows, in which case failure to find a free bit becomes
 disproportionately expensive (we have to do an rbtree link and unlink
 for each window we search.)
 

I was thinking that at the end of find_next_reservable_window(), since
we already know the previous node in the rbtree to insert, we could just
link the window there, that way the rbtree link cost will be minimized.

Then I realize in rbtree, previous node is not necessary the parent
node, we still have to search part of the rbtree to find the parent
node, which is really expensive in the case the insertion operations are
very frequent. Hmmm...

   This particular one took over 1 msec, 

If you look at the pattern in the latency report:
.
 cvs-14981 0.n.2 1044us : find_next_zero_bit 
(bitmap_search_next_usable_block)
 cvs-14981 0.n.2 1045us : ext3_test_allocatable 
(bitmap_search_next_usable_block)
 cvs-14981 0.n.2 1045us : find_next_zero_bit 
(bitmap_search_next_usable_block)
 cvs-14981 0.n.2 1046us : find_next_zero_bit 
(bitmap_search_next_usable_block)
 cvs-14981 0.n.2 1047us : ext3_test_allocatable 
(bitmap_search_next_usable_block)
 cvs-14981 0.n.2 1048us : find_next_zero_bit 
(bitmap_search_next_usable_block)
 cvs-14981 0.n.2 1049us : find_next_zero_bit 
(bitmap_search_next_usable_block)
 cvs-14981 0.n.2 1049us : ext3_test_allocatable 
(bitmap_search_next_usable_block)


the patten is: two calls to find_next_zero_bit() followed by a
ext3_test_allocatabl(). And we have about 440 iteration in this report.

That means we are really unlucky: Checked the on disk bitmap, found a
free bit, but it's not free in the journal copy, then we check the
journal copy, find a free bit, but it's not free on diskI could
imagine we need to check the journal copy once or twice(to prevent re-
use the just freed bit before it is been commited), but how could this
many (440) iterations happen?


Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-07 Thread Mingming Cao
On Thu, 2005-04-07 at 14:08 +0100, Stephen C. Tweedie wrote:
 Hi,
 
 On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote:
 
  doesnt the first option also allow searches to be in parallel?
 
 In terms of CPU usage, yes.  But either we use large windows, in which
 case we *can't* search remotely near areas of the disk in parallel; or
 we use small windows, in which case failure to find a free bit becomes
 disproportionately expensive (we have to do an rbtree link and unlink
 for each window we search.)

Actually, we do not have to do an rbtree link and unlink for every
window we search.  If the reserved window(old) has no free bit and the
new reservable window's is right after the old one, no need to unlink
the old window from the rbtree and then link the new window, just update
the start and end of the old one. That's in the current designed
already, to reduce the cost of rbtree link and unlink.

Mingming


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-06 Thread Mingming Cao
On Wed, 2005-04-06 at 19:22 +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Wed, 2005-04-06 at 17:53, Mingming Cao wrote:
> 
> > > Possible, but not necessarily nice.  If you've got a nearly-full disk,
> > > most bits will be already allocated.  As you scan the bitmaps, it may
> > > take quite a while to find a free bit; do you really want to (a) lock
> > > the whole block group with a temporary window just to do the scan, or
> > > (b) keep allocating multiple smaller windows until you finally find a
> > > free bit?  The former is bad for concurrency if you have multiple tasks
> > > trying to allocate nearby on disk --- you'll force them into different
> > > block groups.  The latter is high overhead.
> 
> > I am not quite understand what you mean about (a).  In this proposal, we
> > will drop the lock before the scan. 
> 
> s/lock/reserve/.  
> 
> > And for (b), maybe I did not make myself clear: I am not proposing to
> > keeping allocating multiple smaller windows until finally find a free
> > bit. I mean, we book the window(just link the node into the tree) before
> > we drop the lock, if there is no free bit inside that window, we will go
> > back search for another window(call find_next_reserveable_window()),
> > inside it, we will remove the temporary window we just created and find
> > next window. SO we only have one temporary window at a time. 
> 
> And that's the problem.  Either we create small temporary windows, in
> which case we may end up thrashing through vast numbers of them before
> we find a bit that's available --- very expensive as the disk gets full
> --- or we use large windows but get worse layout when there are parallel
> allocators going on.
> 

Ah... I see your points. (a) is certainly not a good option. (b) is not
very nice, but not necessary that bad when the disk is really full: We
are not only scanning the bitmap within the reserved window range,
instead, scanning the bitmap start from the reserved window start block
to the last block of the group, to find the next free bit; So in the
case the group is really full, we could reduce the # of small windows to
to try.

Mingming


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-06 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 17:53, Mingming Cao wrote:

> > Possible, but not necessarily nice.  If you've got a nearly-full disk,
> > most bits will be already allocated.  As you scan the bitmaps, it may
> > take quite a while to find a free bit; do you really want to (a) lock
> > the whole block group with a temporary window just to do the scan, or
> > (b) keep allocating multiple smaller windows until you finally find a
> > free bit?  The former is bad for concurrency if you have multiple tasks
> > trying to allocate nearby on disk --- you'll force them into different
> > block groups.  The latter is high overhead.

> I am not quite understand what you mean about (a).  In this proposal, we
> will drop the lock before the scan. 

s/lock/reserve/.  

> And for (b), maybe I did not make myself clear: I am not proposing to
> keeping allocating multiple smaller windows until finally find a free
> bit. I mean, we book the window(just link the node into the tree) before
> we drop the lock, if there is no free bit inside that window, we will go
> back search for another window(call find_next_reserveable_window()),
> inside it, we will remove the temporary window we just created and find
> next window. SO we only have one temporary window at a time. 

And that's the problem.  Either we create small temporary windows, in
which case we may end up thrashing through vast numbers of them before
we find a bit that's available --- very expensive as the disk gets full
--- or we use large windows but get worse layout when there are parallel
allocators going on.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-06 Thread Mingming Cao
On Wed, 2005-04-06 at 10:51 +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Wed, 2005-04-06 at 06:35, Mingming Cao wrote:
> 
> > It seems we are holding the rsv_block while searching the bitmap for a
> > free bit.  
> 
> Probably something to avoid!
> 
> > In alloc_new_reservation(), we first find a available to
> > create a reservation window, then we check the bitmap to see if it
> > contains any free block. If not, we will search for next available
> > window, so on and on. During the whole process we are holding the global
> > rsv_lock.  We could, and probably should, avoid that.  Just unlock the
> > rsv_lock before the bitmap search and re-grab it after it.  We need to
> > make sure that the available space that are still available after we re-
> > grab the lock. 
> 
> Not necessarily.  As long as the windows remain mutually exclusive in
> the rbtree, it doesn't matter too much if we occasionally allocate a
> full one --- as long as that case is rare, the worst that happens is
> that we fail to allocate from the window and have to repeat the window
> reserve.
> 
> The difficulty will be in keeping it rare.  What we need to avoid is the
> case where multiple tasks need a new window, they all drop the lock,
> find the same bits free in the bitmap, then all try to take that
> window.  One will succeed, the others will fail; but as the files in
> that bit of the disk continue to grow, we risk those processes
> *continually* repeating the same stomp-on-each-others'-windows operation
> and raising the allocation overhead significantly.
> 

Agreed. That's why the second option is preferred. When a file find a
window open to book, it booked it, then drop the lock and do the bitmap
search. So other files will not attempt to book the same window.

> > Another option is to hold that available window before we release the
> > rsv_lock, and if there is no free bit inside that window, we will remove
> > it from the tree in the next round of searching for next available
> > window.
> 
> Possible, but not necessarily nice.  If you've got a nearly-full disk,
> most bits will be already allocated.  As you scan the bitmaps, it may
> take quite a while to find a free bit; do you really want to (a) lock
> the whole block group with a temporary window just to do the scan, or
> (b) keep allocating multiple smaller windows until you finally find a
> free bit?  The former is bad for concurrency if you have multiple tasks
> trying to allocate nearby on disk --- you'll force them into different
> block groups.  The latter is high overhead.
> 

I am not quite understand what you mean about (a).  In this proposal, we
will drop the lock before the scan. 

And for (b), maybe I did not make myself clear: I am not proposing to
keeping allocating multiple smaller windows until finally find a free
bit. I mean, we book the window(just link the node into the tree) before
we drop the lock, if there is no free bit inside that window, we will go
back search for another window(call find_next_reserveable_window()),
inside it, we will remove the temporary window we just created and find
next window. SO we only have one temporary window at a time. In the case
of we did find a free block inside the temporary created window, we make
it official: do the rb tree coloring rotate stuff then.

Does it make sense? Let me know, thanks,

Maybe I should post my proposed patch here.( I am testing it now. )


---

 linux-2.6.11-ming/fs/ext3/balloc.c |  107 +++--
 1 files changed, 68 insertions(+), 39 deletions(-)

diff -puN fs/ext3/balloc.c~ext3-reduce-reservation-lock-latency fs/ext3/balloc.c
--- linux-2.6.11/fs/ext3/balloc.c~ext3-reduce-reservation-lock-latency  
2005-04-05 10:48:31.0 -0700
+++ linux-2.6.11-ming/fs/ext3/balloc.c  2005-04-06 09:24:24.203028968 -0700
@@ -241,8 +241,13 @@ void ext3_rsv_window_add(struct super_bl
BUG();
}
 
+   /*
+* we only link the node the the rb tree here
+* the real color balancing will be done
+* after we confirm the window has some free
+* block.
+*/
rb_link_node(node, parent, p);
-   rb_insert_color(node, root);
 }
 
 static void rsv_window_remove(struct super_block *sb,
@@ -749,24 +754,24 @@ fail_access:
  * to find a free region that is of my size and has not
  * been reserved.
  *
- * on succeed, it returns the reservation window to be appended to.
- * failed, return NULL.
  */
-static struct ext3_reserve_window_node *find_next_reservable_window(
+static int find_next_reservable_window(
struct ext3_reserve_window_node *search_head,
-   unsigned long size, int *start_block,
+   struct ext3_reserve_window_node *my_rsv,
+   struct super_block * sb, int start_block,
int last_block)
 {
struct rb_node *next;
struct 

Re: ext3 allocate-with-reservation latencies

2005-04-06 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 06:35, Mingming Cao wrote:

> It seems we are holding the rsv_block while searching the bitmap for a
> free bit.  

Probably something to avoid!

> In alloc_new_reservation(), we first find a available to
> create a reservation window, then we check the bitmap to see if it
> contains any free block. If not, we will search for next available
> window, so on and on. During the whole process we are holding the global
> rsv_lock.  We could, and probably should, avoid that.  Just unlock the
> rsv_lock before the bitmap search and re-grab it after it.  We need to
> make sure that the available space that are still available after we re-
> grab the lock. 

Not necessarily.  As long as the windows remain mutually exclusive in
the rbtree, it doesn't matter too much if we occasionally allocate a
full one --- as long as that case is rare, the worst that happens is
that we fail to allocate from the window and have to repeat the window
reserve.

The difficulty will be in keeping it rare.  What we need to avoid is the
case where multiple tasks need a new window, they all drop the lock,
find the same bits free in the bitmap, then all try to take that
window.  One will succeed, the others will fail; but as the files in
that bit of the disk continue to grow, we risk those processes
*continually* repeating the same stomp-on-each-others'-windows operation
and raising the allocation overhead significantly.

> Another option is to hold that available window before we release the
> rsv_lock, and if there is no free bit inside that window, we will remove
> it from the tree in the next round of searching for next available
> window.

Possible, but not necessarily nice.  If you've got a nearly-full disk,
most bits will be already allocated.  As you scan the bitmaps, it may
take quite a while to find a free bit; do you really want to (a) lock
the whole block group with a temporary window just to do the scan, or
(b) keep allocating multiple smaller windows until you finally find a
free bit?  The former is bad for concurrency if you have multiple tasks
trying to allocate nearby on disk --- you'll force them into different
block groups.  The latter is high overhead.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-06 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 06:35, Mingming Cao wrote:

 It seems we are holding the rsv_block while searching the bitmap for a
 free bit.  

Probably something to avoid!

 In alloc_new_reservation(), we first find a available to
 create a reservation window, then we check the bitmap to see if it
 contains any free block. If not, we will search for next available
 window, so on and on. During the whole process we are holding the global
 rsv_lock.  We could, and probably should, avoid that.  Just unlock the
 rsv_lock before the bitmap search and re-grab it after it.  We need to
 make sure that the available space that are still available after we re-
 grab the lock. 

Not necessarily.  As long as the windows remain mutually exclusive in
the rbtree, it doesn't matter too much if we occasionally allocate a
full one --- as long as that case is rare, the worst that happens is
that we fail to allocate from the window and have to repeat the window
reserve.

The difficulty will be in keeping it rare.  What we need to avoid is the
case where multiple tasks need a new window, they all drop the lock,
find the same bits free in the bitmap, then all try to take that
window.  One will succeed, the others will fail; but as the files in
that bit of the disk continue to grow, we risk those processes
*continually* repeating the same stomp-on-each-others'-windows operation
and raising the allocation overhead significantly.

 Another option is to hold that available window before we release the
 rsv_lock, and if there is no free bit inside that window, we will remove
 it from the tree in the next round of searching for next available
 window.

Possible, but not necessarily nice.  If you've got a nearly-full disk,
most bits will be already allocated.  As you scan the bitmaps, it may
take quite a while to find a free bit; do you really want to (a) lock
the whole block group with a temporary window just to do the scan, or
(b) keep allocating multiple smaller windows until you finally find a
free bit?  The former is bad for concurrency if you have multiple tasks
trying to allocate nearby on disk --- you'll force them into different
block groups.  The latter is high overhead.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-06 Thread Mingming Cao
On Wed, 2005-04-06 at 10:51 +0100, Stephen C. Tweedie wrote:
 Hi,
 
 On Wed, 2005-04-06 at 06:35, Mingming Cao wrote:
 
  It seems we are holding the rsv_block while searching the bitmap for a
  free bit.  
 
 Probably something to avoid!
 
  In alloc_new_reservation(), we first find a available to
  create a reservation window, then we check the bitmap to see if it
  contains any free block. If not, we will search for next available
  window, so on and on. During the whole process we are holding the global
  rsv_lock.  We could, and probably should, avoid that.  Just unlock the
  rsv_lock before the bitmap search and re-grab it after it.  We need to
  make sure that the available space that are still available after we re-
  grab the lock. 
 
 Not necessarily.  As long as the windows remain mutually exclusive in
 the rbtree, it doesn't matter too much if we occasionally allocate a
 full one --- as long as that case is rare, the worst that happens is
 that we fail to allocate from the window and have to repeat the window
 reserve.
 
 The difficulty will be in keeping it rare.  What we need to avoid is the
 case where multiple tasks need a new window, they all drop the lock,
 find the same bits free in the bitmap, then all try to take that
 window.  One will succeed, the others will fail; but as the files in
 that bit of the disk continue to grow, we risk those processes
 *continually* repeating the same stomp-on-each-others'-windows operation
 and raising the allocation overhead significantly.
 

Agreed. That's why the second option is preferred. When a file find a
window open to book, it booked it, then drop the lock and do the bitmap
search. So other files will not attempt to book the same window.

  Another option is to hold that available window before we release the
  rsv_lock, and if there is no free bit inside that window, we will remove
  it from the tree in the next round of searching for next available
  window.
 
 Possible, but not necessarily nice.  If you've got a nearly-full disk,
 most bits will be already allocated.  As you scan the bitmaps, it may
 take quite a while to find a free bit; do you really want to (a) lock
 the whole block group with a temporary window just to do the scan, or
 (b) keep allocating multiple smaller windows until you finally find a
 free bit?  The former is bad for concurrency if you have multiple tasks
 trying to allocate nearby on disk --- you'll force them into different
 block groups.  The latter is high overhead.
 

I am not quite understand what you mean about (a).  In this proposal, we
will drop the lock before the scan. 

And for (b), maybe I did not make myself clear: I am not proposing to
keeping allocating multiple smaller windows until finally find a free
bit. I mean, we book the window(just link the node into the tree) before
we drop the lock, if there is no free bit inside that window, we will go
back search for another window(call find_next_reserveable_window()),
inside it, we will remove the temporary window we just created and find
next window. SO we only have one temporary window at a time. In the case
of we did find a free block inside the temporary created window, we make
it official: do the rb tree coloring rotate stuff then.

Does it make sense? Let me know, thanks,

Maybe I should post my proposed patch here.( I am testing it now. )


---

 linux-2.6.11-ming/fs/ext3/balloc.c |  107 +++--
 1 files changed, 68 insertions(+), 39 deletions(-)

diff -puN fs/ext3/balloc.c~ext3-reduce-reservation-lock-latency fs/ext3/balloc.c
--- linux-2.6.11/fs/ext3/balloc.c~ext3-reduce-reservation-lock-latency  
2005-04-05 10:48:31.0 -0700
+++ linux-2.6.11-ming/fs/ext3/balloc.c  2005-04-06 09:24:24.203028968 -0700
@@ -241,8 +241,13 @@ void ext3_rsv_window_add(struct super_bl
BUG();
}
 
+   /*
+* we only link the node the the rb tree here
+* the real color balancing will be done
+* after we confirm the window has some free
+* block.
+*/
rb_link_node(node, parent, p);
-   rb_insert_color(node, root);
 }
 
 static void rsv_window_remove(struct super_block *sb,
@@ -749,24 +754,24 @@ fail_access:
  * to find a free region that is of my size and has not
  * been reserved.
  *
- * on succeed, it returns the reservation window to be appended to.
- * failed, return NULL.
  */
-static struct ext3_reserve_window_node *find_next_reservable_window(
+static int find_next_reservable_window(
struct ext3_reserve_window_node *search_head,
-   unsigned long size, int *start_block,
+   struct ext3_reserve_window_node *my_rsv,
+   struct super_block * sb, int start_block,
int last_block)
 {
struct rb_node *next;
struct ext3_reserve_window_node *rsv, *prev;
int cur;
+   

Re: ext3 allocate-with-reservation latencies

2005-04-06 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 17:53, Mingming Cao wrote:

  Possible, but not necessarily nice.  If you've got a nearly-full disk,
  most bits will be already allocated.  As you scan the bitmaps, it may
  take quite a while to find a free bit; do you really want to (a) lock
  the whole block group with a temporary window just to do the scan, or
  (b) keep allocating multiple smaller windows until you finally find a
  free bit?  The former is bad for concurrency if you have multiple tasks
  trying to allocate nearby on disk --- you'll force them into different
  block groups.  The latter is high overhead.

 I am not quite understand what you mean about (a).  In this proposal, we
 will drop the lock before the scan. 

s/lock/reserve/.  

 And for (b), maybe I did not make myself clear: I am not proposing to
 keeping allocating multiple smaller windows until finally find a free
 bit. I mean, we book the window(just link the node into the tree) before
 we drop the lock, if there is no free bit inside that window, we will go
 back search for another window(call find_next_reserveable_window()),
 inside it, we will remove the temporary window we just created and find
 next window. SO we only have one temporary window at a time. 

And that's the problem.  Either we create small temporary windows, in
which case we may end up thrashing through vast numbers of them before
we find a bit that's available --- very expensive as the disk gets full
--- or we use large windows but get worse layout when there are parallel
allocators going on.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-06 Thread Mingming Cao
On Wed, 2005-04-06 at 19:22 +0100, Stephen C. Tweedie wrote:
 Hi,
 
 On Wed, 2005-04-06 at 17:53, Mingming Cao wrote:
 
   Possible, but not necessarily nice.  If you've got a nearly-full disk,
   most bits will be already allocated.  As you scan the bitmaps, it may
   take quite a while to find a free bit; do you really want to (a) lock
   the whole block group with a temporary window just to do the scan, or
   (b) keep allocating multiple smaller windows until you finally find a
   free bit?  The former is bad for concurrency if you have multiple tasks
   trying to allocate nearby on disk --- you'll force them into different
   block groups.  The latter is high overhead.
 
  I am not quite understand what you mean about (a).  In this proposal, we
  will drop the lock before the scan. 
 
 s/lock/reserve/.  
 
  And for (b), maybe I did not make myself clear: I am not proposing to
  keeping allocating multiple smaller windows until finally find a free
  bit. I mean, we book the window(just link the node into the tree) before
  we drop the lock, if there is no free bit inside that window, we will go
  back search for another window(call find_next_reserveable_window()),
  inside it, we will remove the temporary window we just created and find
  next window. SO we only have one temporary window at a time. 
 
 And that's the problem.  Either we create small temporary windows, in
 which case we may end up thrashing through vast numbers of them before
 we find a bit that's available --- very expensive as the disk gets full
 --- or we use large windows but get worse layout when there are parallel
 allocators going on.
 

Ah... I see your points. (a) is certainly not a good option. (b) is not
very nice, but not necessary that bad when the disk is really full: We
are not only scanning the bitmap within the reserved window range,
instead, scanning the bitmap start from the reserved window start block
to the last block of the group, to find the next free bit; So in the
case the group is really full, we could reduce the # of small windows to
to try.

Mingming


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-05 Thread Mingming Cao
On Tue, 2005-04-05 at 06:13 +0200, Ingo Molnar wrote:
> * Lee Revell <[EMAIL PROTECTED]> wrote:
> 
> > I can trigger latencies up to ~1.1 ms with a CVS checkout.  It looks
> > like inside ext3_try_to_allocate_with_rsv, we spend a long time in this
> > loop:
> > 
> > ext3_test_allocatable (bitmap_search_next_usable_block)
> > find_next_zero_bit (bitmap_search_next_usable_block)
> > find_next_zero_bit (bitmap_search_next_usable_block)
> > 
> > ext3_test_allocatable (bitmap_search_next_usable_block)
> > find_next_zero_bit (bitmap_search_next_usable_block)
> > find_next_zero_bit (bitmap_search_next_usable_block)
> 
> Breaking the lock is not really possible at that point, and it doesnt 
> look too easy to make that path preemptable either. (To make it 
> preemptable rsv_lock would need to become a semaphore (this could be 
> fine, as it's only used when a new reservation window is created).)

It seems we are holding the rsv_block while searching the bitmap for a
free bit.  In alloc_new_reservation(), we first find a available to
create a reservation window, then we check the bitmap to see if it
contains any free block. If not, we will search for next available
window, so on and on. During the whole process we are holding the global
rsv_lock.  We could, and probably should, avoid that.  Just unlock the
rsv_lock before the bitmap search and re-grab it after it.  We need to
make sure that the available space that are still available after we re-
grab the lock. 

Another option is to hold that available window before we release the
rsv_lock, and if there is no free bit inside that window, we will remove
it from the tree in the next round of searching for next available
window.

I prefer the second option, and plan to code it up soon. Any comments?

Thanks,

Mingming

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-05 Thread Lee Revell
On Mon, 2005-04-04 at 23:10 -0700, Mingming Cao wrote:
> On Tue, 2005-04-05 at 06:13 +0200, Ingo Molnar wrote:
> > * Lee Revell <[EMAIL PROTECTED]> wrote:
> > 
> > > I can trigger latencies up to ~1.1 ms with a CVS checkout.  It looks
> > > like inside ext3_try_to_allocate_with_rsv, we spend a long time in this
> > > loop:
> > > 
> 
> We have not modify the reservation create algorithm for a long time.
> Sorry, I missed the background here, on which kernel did you see this
> latency? And what tests you were running?

Makes sense, I have been seeing this one for a long time.  I get it with
dbench, or when doing a CVS checkout of a large module.

Kernel is 2.6.12-rc1-RT-V0.7.43-05, with PREEMPT_DESKTOP which AFAIK
should be equivalent to mainline.

Lee

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-05 Thread Mingming Cao
On Tue, 2005-04-05 at 06:13 +0200, Ingo Molnar wrote:
> * Lee Revell <[EMAIL PROTECTED]> wrote:
> 
> > I can trigger latencies up to ~1.1 ms with a CVS checkout.  It looks
> > like inside ext3_try_to_allocate_with_rsv, we spend a long time in this
> > loop:
> > 

We have not modify the reservation create algorithm for a long time.
Sorry, I missed the background here, on which kernel did you see this
latency? And what tests you were running?

> > ext3_test_allocatable (bitmap_search_next_usable_block)
> > find_next_zero_bit (bitmap_search_next_usable_block)
> > find_next_zero_bit (bitmap_search_next_usable_block)
> > 
> > ext3_test_allocatable (bitmap_search_next_usable_block)
> > find_next_zero_bit (bitmap_search_next_usable_block)
> > find_next_zero_bit (bitmap_search_next_usable_block)
> 

The loop in bitmap_search_next_usable_block tries to find the next zero
bit on the bitmap, if it find, then it need to check whether this zero
bit is also be zero on the journalling last copy the the bitmap. The
double check is there prevent re-use the just freed block on disk before
it is committed from journalling. If it's not marked as a free block on
the journalling copy, then it will find the next free bit on that copy
first, then loop back etc.etc.

The bitmap_search_next_usable_block() code is quite simple:

static int
bitmap_search_next_usable_block(int start, struct buffer_head *bh,
int maxblocks)
{
int next;
struct journal_head *jh = bh2jh(bh);

/*
 * The bitmap search --- search forward alternately through the actual
 * bitmap and the last-committed copy until we find a bit free in
 * both
 */
while (start < maxblocks) {
next = ext3_find_next_zero_bit(bh->b_data, maxblocks, start);
if (next >= maxblocks)
return -1;
if (ext3_test_allocatable(next, bh))
return next;
jbd_lock_bh_state(bh);
if (jh->b_committed_data)
start = ext3_find_next_zero_bit(jh->b_committed_data,
maxblocks, next);
jbd_unlock_bh_state(bh);
}
return -1;
}

The latency report shows the pattern of two calls of ext3_find_next_zero_bit () 
after a call to ext3_test_allocatable(), matches what I suspect above.


I wonder if re-run the test with -0 noreservation mount option make any 
difference?

> Breaking the lock is not really possible at that point, and it doesnt 
> look too easy to make that path preemptable either. (To make it 
> preemptable rsv_lock would need to become a semaphore (this could be 
> fine, as it's only used when a new reservation window is created).)
> 
> The hard part is the seqlock - the read side is performance-critical, 
> maybe it could be solved via a preemptable but still scalable seqlock 
> variant that uses a semaphore for the write side? It all depends on what 
> the scalability impact of using a semaphore for the new-window code 
> would be.
> 
The seqlock was there to prevent a smp race, but actually turns out to be 
unnecessary,  and it has been removed recently. While even if it is there, it 
will not cause much latency here: the whole ext3_try_to_allocate_with_rsv is 
guard by ei->truncate_sem, thus no multiple readers/writers at the same time.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-05 Thread Mingming Cao
On Tue, 2005-04-05 at 06:13 +0200, Ingo Molnar wrote:
 * Lee Revell [EMAIL PROTECTED] wrote:
 
  I can trigger latencies up to ~1.1 ms with a CVS checkout.  It looks
  like inside ext3_try_to_allocate_with_rsv, we spend a long time in this
  loop:
  
  ext3_test_allocatable (bitmap_search_next_usable_block)
  find_next_zero_bit (bitmap_search_next_usable_block)
  find_next_zero_bit (bitmap_search_next_usable_block)
  
  ext3_test_allocatable (bitmap_search_next_usable_block)
  find_next_zero_bit (bitmap_search_next_usable_block)
  find_next_zero_bit (bitmap_search_next_usable_block)
 
 Breaking the lock is not really possible at that point, and it doesnt 
 look too easy to make that path preemptable either. (To make it 
 preemptable rsv_lock would need to become a semaphore (this could be 
 fine, as it's only used when a new reservation window is created).)

It seems we are holding the rsv_block while searching the bitmap for a
free bit.  In alloc_new_reservation(), we first find a available to
create a reservation window, then we check the bitmap to see if it
contains any free block. If not, we will search for next available
window, so on and on. During the whole process we are holding the global
rsv_lock.  We could, and probably should, avoid that.  Just unlock the
rsv_lock before the bitmap search and re-grab it after it.  We need to
make sure that the available space that are still available after we re-
grab the lock. 

Another option is to hold that available window before we release the
rsv_lock, and if there is no free bit inside that window, we will remove
it from the tree in the next round of searching for next available
window.

I prefer the second option, and plan to code it up soon. Any comments?

Thanks,

Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-05 Thread Mingming Cao
On Tue, 2005-04-05 at 06:13 +0200, Ingo Molnar wrote:
 * Lee Revell [EMAIL PROTECTED] wrote:
 
  I can trigger latencies up to ~1.1 ms with a CVS checkout.  It looks
  like inside ext3_try_to_allocate_with_rsv, we spend a long time in this
  loop:
  

We have not modify the reservation create algorithm for a long time.
Sorry, I missed the background here, on which kernel did you see this
latency? And what tests you were running?

  ext3_test_allocatable (bitmap_search_next_usable_block)
  find_next_zero_bit (bitmap_search_next_usable_block)
  find_next_zero_bit (bitmap_search_next_usable_block)
  
  ext3_test_allocatable (bitmap_search_next_usable_block)
  find_next_zero_bit (bitmap_search_next_usable_block)
  find_next_zero_bit (bitmap_search_next_usable_block)
 

The loop in bitmap_search_next_usable_block tries to find the next zero
bit on the bitmap, if it find, then it need to check whether this zero
bit is also be zero on the journalling last copy the the bitmap. The
double check is there prevent re-use the just freed block on disk before
it is committed from journalling. If it's not marked as a free block on
the journalling copy, then it will find the next free bit on that copy
first, then loop back etc.etc.

The bitmap_search_next_usable_block() code is quite simple:

static int
bitmap_search_next_usable_block(int start, struct buffer_head *bh,
int maxblocks)
{
int next;
struct journal_head *jh = bh2jh(bh);

/*
 * The bitmap search --- search forward alternately through the actual
 * bitmap and the last-committed copy until we find a bit free in
 * both
 */
while (start  maxblocks) {
next = ext3_find_next_zero_bit(bh-b_data, maxblocks, start);
if (next = maxblocks)
return -1;
if (ext3_test_allocatable(next, bh))
return next;
jbd_lock_bh_state(bh);
if (jh-b_committed_data)
start = ext3_find_next_zero_bit(jh-b_committed_data,
maxblocks, next);
jbd_unlock_bh_state(bh);
}
return -1;
}

The latency report shows the pattern of two calls of ext3_find_next_zero_bit () 
after a call to ext3_test_allocatable(), matches what I suspect above.


I wonder if re-run the test with -0 noreservation mount option make any 
difference?

 Breaking the lock is not really possible at that point, and it doesnt 
 look too easy to make that path preemptable either. (To make it 
 preemptable rsv_lock would need to become a semaphore (this could be 
 fine, as it's only used when a new reservation window is created).)
 
 The hard part is the seqlock - the read side is performance-critical, 
 maybe it could be solved via a preemptable but still scalable seqlock 
 variant that uses a semaphore for the write side? It all depends on what 
 the scalability impact of using a semaphore for the new-window code 
 would be.
 
The seqlock was there to prevent a smp race, but actually turns out to be 
unnecessary,  and it has been removed recently. While even if it is there, it 
will not cause much latency here: the whole ext3_try_to_allocate_with_rsv is 
guard by ei-truncate_sem, thus no multiple readers/writers at the same time.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-05 Thread Lee Revell
On Mon, 2005-04-04 at 23:10 -0700, Mingming Cao wrote:
 On Tue, 2005-04-05 at 06:13 +0200, Ingo Molnar wrote:
  * Lee Revell [EMAIL PROTECTED] wrote:
  
   I can trigger latencies up to ~1.1 ms with a CVS checkout.  It looks
   like inside ext3_try_to_allocate_with_rsv, we spend a long time in this
   loop:
   
 
 We have not modify the reservation create algorithm for a long time.
 Sorry, I missed the background here, on which kernel did you see this
 latency? And what tests you were running?

Makes sense, I have been seeing this one for a long time.  I get it with
dbench, or when doing a CVS checkout of a large module.

Kernel is 2.6.12-rc1-RT-V0.7.43-05, with PREEMPT_DESKTOP which AFAIK
should be equivalent to mainline.

Lee

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-04 Thread Ingo Molnar

* Lee Revell <[EMAIL PROTECTED]> wrote:

> I can trigger latencies up to ~1.1 ms with a CVS checkout.  It looks
> like inside ext3_try_to_allocate_with_rsv, we spend a long time in this
> loop:
> 
> ext3_test_allocatable (bitmap_search_next_usable_block)
> find_next_zero_bit (bitmap_search_next_usable_block)
> find_next_zero_bit (bitmap_search_next_usable_block)
> 
> ext3_test_allocatable (bitmap_search_next_usable_block)
> find_next_zero_bit (bitmap_search_next_usable_block)
> find_next_zero_bit (bitmap_search_next_usable_block)

Breaking the lock is not really possible at that point, and it doesnt 
look too easy to make that path preemptable either. (To make it 
preemptable rsv_lock would need to become a semaphore (this could be 
fine, as it's only used when a new reservation window is created).)

The hard part is the seqlock - the read side is performance-critical, 
maybe it could be solved via a preemptable but still scalable seqlock 
variant that uses a semaphore for the write side? It all depends on what 
the scalability impact of using a semaphore for the new-window code 
would be.

the best longterm solution for these types of tradeoffs seems to be to 
add a locking primitive that is a spinlock on !PREEMPT kernels and a 
semaphore on PREEMPT kernels. I.e. not as drastic as a full PREEMPT_RT 
kernel, but good enough to make latency-critical codepaths of ext3 
preemptable, without having to hurt scalability on !PREEMPT. The 
PREEMPT_RT kernel has all the 'compile-time type-switching' 
infrastructure for such tricks, all that needs to be changed to switch a 
lock's type is to change the spinlock definition - all the 
spin_lock() uses can remain unchanged. (The same method is used on 
PREEMPT_RT to have 'dual-type' spinlocks.)

the same thing could then also be used for things like the mm lock, and 
other longer-held locks that PREEMPT would like to see preemptable. It 
would also be a good first step towards merging the PREEMPT_RT 
infrastructure ;-) I'll cook up something.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


ext3 allocate-with-reservation latencies

2005-04-04 Thread Lee Revell
I can trigger latencies up to ~1.1 ms with a CVS checkout.  It looks
like inside ext3_try_to_allocate_with_rsv, we spend a long time in this
loop:

ext3_test_allocatable (bitmap_search_next_usable_block)
find_next_zero_bit (bitmap_search_next_usable_block)
find_next_zero_bit (bitmap_search_next_usable_block)

ext3_test_allocatable (bitmap_search_next_usable_block)
find_next_zero_bit (bitmap_search_next_usable_block)
find_next_zero_bit (bitmap_search_next_usable_block)

etc.

Lee


ext3_reservation.bz2
Description: application/bzip


ext3 allocate-with-reservation latencies

2005-04-04 Thread Lee Revell
I can trigger latencies up to ~1.1 ms with a CVS checkout.  It looks
like inside ext3_try_to_allocate_with_rsv, we spend a long time in this
loop:

ext3_test_allocatable (bitmap_search_next_usable_block)
find_next_zero_bit (bitmap_search_next_usable_block)
find_next_zero_bit (bitmap_search_next_usable_block)

ext3_test_allocatable (bitmap_search_next_usable_block)
find_next_zero_bit (bitmap_search_next_usable_block)
find_next_zero_bit (bitmap_search_next_usable_block)

etc.

Lee


ext3_reservation.bz2
Description: application/bzip


Re: ext3 allocate-with-reservation latencies

2005-04-04 Thread Ingo Molnar

* Lee Revell [EMAIL PROTECTED] wrote:

 I can trigger latencies up to ~1.1 ms with a CVS checkout.  It looks
 like inside ext3_try_to_allocate_with_rsv, we spend a long time in this
 loop:
 
 ext3_test_allocatable (bitmap_search_next_usable_block)
 find_next_zero_bit (bitmap_search_next_usable_block)
 find_next_zero_bit (bitmap_search_next_usable_block)
 
 ext3_test_allocatable (bitmap_search_next_usable_block)
 find_next_zero_bit (bitmap_search_next_usable_block)
 find_next_zero_bit (bitmap_search_next_usable_block)

Breaking the lock is not really possible at that point, and it doesnt 
look too easy to make that path preemptable either. (To make it 
preemptable rsv_lock would need to become a semaphore (this could be 
fine, as it's only used when a new reservation window is created).)

The hard part is the seqlock - the read side is performance-critical, 
maybe it could be solved via a preemptable but still scalable seqlock 
variant that uses a semaphore for the write side? It all depends on what 
the scalability impact of using a semaphore for the new-window code 
would be.

the best longterm solution for these types of tradeoffs seems to be to 
add a locking primitive that is a spinlock on !PREEMPT kernels and a 
semaphore on PREEMPT kernels. I.e. not as drastic as a full PREEMPT_RT 
kernel, but good enough to make latency-critical codepaths of ext3 
preemptable, without having to hurt scalability on !PREEMPT. The 
PREEMPT_RT kernel has all the 'compile-time type-switching' 
infrastructure for such tricks, all that needs to be changed to switch a 
lock's type is to change the spinlock definition - all the 
spin_lock(lock) uses can remain unchanged. (The same method is used on 
PREEMPT_RT to have 'dual-type' spinlocks.)

the same thing could then also be used for things like the mm lock, and 
other longer-held locks that PREEMPT would like to see preemptable. It 
would also be a good first step towards merging the PREEMPT_RT 
infrastructure ;-) I'll cook up something.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/