Re: [PATCH 00/10][V2] Delayed refs rsv

2018-12-06 Thread David Sterba
On Mon, Dec 03, 2018 at 10:20:28AM -0500, Josef Bacik wrote:
> v1->v2:
> - addressed the comments from the various reviewers.
> - split "introduce delayed_refs_rsv" into 5 patches.  The patches are the same
>   together as they were, just split out more logically.  They can't really be
>   bisected across in that you will likely have fun enospc failures, but they
>   compile properly.  This was done to make it easier for review.

Thanks, that was helpful. The bisectability is IMO not hurt, as it
compiles and runs despite the potential ENOSPC problems. I guess the
point is to be able to reach any commit & compile & run and not be hit
by unrelated bugs. If somebody is bisecting an ENOSPC problem and lands
in the middle of the series, there's a hint in the changelogs that
something is going on.

> -- Original message --
> 
> This patchset changes how we do space reservations for delayed refs.  We were
> hitting probably 20-40 enospc abort's per day in production while running
> delayed refs at transaction commit time.  This means we ran out of space in 
> the
> global reserve and couldn't easily get more space in use_block_rsv().
> 
> The global reserve has grown to cover everything we don't reserve space
> explicitly for, and we've grown a lot of weird ad-hoc hueristics to know if
> we're running short on space and when it's time to force a commit.  A failure
> rate of 20-40 file systems when we run hundreds of thousands of them isn't 
> super
> high, but cleaning up this code will make things less ugly and more 
> predictible.
> 
> Thus the delayed refs rsv.  We always know how many delayed refs we have
> outstanding, and although running them generates more we can use the global
> reserve for that spill over, which fits better into it's desired use than a 
> full
> blown reservation.  This first approach is to simply take how many times we're
> reserving space for and multiply that by 2 in order to save enough space for 
> the
> delayed refs that could be generated.  This is a niave approach and will
> probably evolve, but for now it works.
> 
> With this patchset we've gone down to 2-8 failures per week.  It's not 
> perfect,
> there are some corner cases that still need to be addressed, but is
> significantly better than what we had.  Thanks,

I've merged the series to misc-next, the amount of reviews and testing
is sufficient. Though there are some more comments or suggestions for
improvements, they can be done as followups.

The other patchsets from you are namely fixes, that can be applied at
the -rc time, for that reason I'm glad the main infrastructure change
can be merged before the 4.21 code freeze.


[PATCH 00/10][V2] Delayed refs rsv

2018-12-03 Thread Josef Bacik
v1->v2:
- addressed the comments from the various reviewers.
- split "introduce delayed_refs_rsv" into 5 patches.  The patches are the same
  together as they were, just split out more logically.  They can't really be
  bisected across in that you will likely have fun enospc failures, but they
  compile properly.  This was done to make it easier for review.

-- Original message --

This patchset changes how we do space reservations for delayed refs.  We were
hitting probably 20-40 enospc abort's per day in production while running
delayed refs at transaction commit time.  This means we ran out of space in the
global reserve and couldn't easily get more space in use_block_rsv().

The global reserve has grown to cover everything we don't reserve space
explicitly for, and we've grown a lot of weird ad-hoc hueristics to know if
we're running short on space and when it's time to force a commit.  A failure
rate of 20-40 file systems when we run hundreds of thousands of them isn't super
high, but cleaning up this code will make things less ugly and more predictible.

Thus the delayed refs rsv.  We always know how many delayed refs we have
outstanding, and although running them generates more we can use the global
reserve for that spill over, which fits better into it's desired use than a full
blown reservation.  This first approach is to simply take how many times we're
reserving space for and multiply that by 2 in order to save enough space for the
delayed refs that could be generated.  This is a niave approach and will
probably evolve, but for now it works.

With this patchset we've gone down to 2-8 failures per week.  It's not perfect,
there are some corner cases that still need to be addressed, but is
significantly better than what we had.  Thanks,

Josef