Re: Fsyncgate: errors on fsync are unrecovarable

2019-07-25 Thread Mike Rhodes
All,

I spent a couple of hours with this last week -- it's well worth the read to 
understand some of the reasons for this behaviour.

> tl;dr: running fsync() after an fsync() that reported EIO clears that error 
> state with no way of recovery on Linux.

The important thing that Jan didn't call out, which was a key point for me, was 
the mechanism. That specifically the dirty bits on the pages in the file cache 
are cleared on Linux after the fsync which returns EIO. This means that the 
kernel is now free to throw away those pages (with unwritten data) from the 
page cache, which results in the writes being lost. I just thought it worth 
expanding on that.

This is why you have to ensure you redo _every_ write to the file between the 
last successful fsync and the failing fsync, such that you end up with the 
pages ending up dirty again so the OS knows it needs to write them to durable 
storage before throwing them out of memory.

>From what I can see, this means that _somewhere_ in the stack needs to keep 
>track of the writes which happened since the last fsync, whether the low level 
>or the higher level. That is, Jan's (2) is a case where a higher level in the 
>system is either doing (1) or reporting failure direct back to the client so 
>they know their commit didn't make it to disk (probably). I think the 
>correctness of the latter might depend on whether we only write one new header 
>per result reported back to the client (i.e., not per request as _bulk_docs 
>has multiple results reported and I don't know if that's one header write and 
>one fsync).

One comment from the PR:

> As for the "fsync will forget about errors" issue... well ain't that a kick 
> in the pants? I agree there's not a whole lot we can do. Although I will have 
> to say that given the length of time multiple popular databases have run on 
> Linux without even a whisper of "fsync is totally broken" makes me think that 
> actually encountering that particular issue in the wild is likely fairly 
> miniscule.

In the chain, the postgres developers do come to the opinion that quite a few 
more common (though not super-common) errors seen by postgres users could be 
traced back down to this behaviour. No one really thought to delve into the 
fsync question until the OP in that thread started poking.

-- 
Mike.

On Sun, 21 Jul 2019, at 21:26, Joan Touzet wrote:
> On 2019-07-21 1:23 p.m., Paul Davis wrote:
> > I’m browsing on my phone but I’m pretty sure we should add an `ok =` to
> > this line so that we force a bad match:
> > 
> > https://github.com/apache/couchdb/blob/9d098787a71d1c7f7f6adea05da15b0da3ecc7ef/src/couch/src/couch_file.erl#L223
> > 
> > Unless I’m missing somewhere else that we’re making that assertion.
> 
> This looks good to me, and I've asked Paul a question in the PR that he 
> answers extremely well...if you're curious about how this would play out 
> should an entire filesystem go read-only or missing from underneath 
> CouchDB, I recommend reading it. (It's also on the 
> notifications@couchdb.a.o list if you prefer.)
> 
> +1 to the fix.
> 
> -Joan
> 
> 
> > 
> > On Sun, Jul 21, 2019 at 4:21 AM Jan Lehnardt  wrote:
> > 
> >> Hey all,
> >>
> >>
> >> Joan sent this along in IRC and it reads bad enough[tm] that we should at
> >> least have a few pairs of eyes looking at if we have to do anything:
> >>
> >>  http://danluu.com/fsyncgate/
> >>
> >> (It is long and dense, you’ll need to read 10-15% of that page to get the
> >> main picture.
> >>
> >> tl;dr: running fsync() after an fsync() that reported EIO clears that
> >> error state with no way of recovery on Linux.
> >>
> >> There are two ways of handling this correctly:
> >>
> >> 1. whatever you wrote() between the last successful fsync() and the
> >> fsync() that raised the error, keep around until after the second fsync(),
> >> so you can write() it again.
> >>
> >> 2. if any one fsync() returns EIO, report this back up immediately, so
> >> whoever calls you can retry.
> >>
> >> * * *
> >>
> >> We seem to be doing 2. as per my reading.
> >>
> >> Erlang looks like it correctly just raises whatever error fsync() might
> >> return:
> >>
> >> 1.
> >> https://github.com/erlang/otp/blob/maint-r14/erts/emulator/drivers/unix/unix_efile.c#L792-L809
> >> 2.
> >> https://github.com/erlang/otp/blob/maint-r14/erts/emulator/drivers/unix/unix_efile.c#L151-L163
> >>
> >> couch_file too:
> >>
> >> 1.
> >> https://github.com/apache/couchdb/blob/master/src/couch/src/couch_file.erl#L215-L223
> >>
> >> I glanced at a few paths going up this chain and couldn’t spot a catch
> >> where we’d hide that error, but it’d be great to get some confirmation on
> >> this.
> >>
> >> * * *
> >>
> >> Please double-check my understanding of the issue, the correct ways
> >> forward and the findings in Erlang and CouchDB.
> >>
> >> Best
> >> Jan
> >> --
> >> Professional Support for Apache CouchDB:
> >> https://neighbourhood.ie/couchdb-support/
> >>
> >>
> > 
>


Re: Fsyncgate: errors on fsync are unrecovarable

2019-07-21 Thread Joan Touzet

On 2019-07-21 1:23 p.m., Paul Davis wrote:

I’m browsing on my phone but I’m pretty sure we should add an `ok =` to
this line so that we force a bad match:

https://github.com/apache/couchdb/blob/9d098787a71d1c7f7f6adea05da15b0da3ecc7ef/src/couch/src/couch_file.erl#L223

Unless I’m missing somewhere else that we’re making that assertion.


This looks good to me, and I've asked Paul a question in the PR that he 
answers extremely well...if you're curious about how this would play out 
should an entire filesystem go read-only or missing from underneath 
CouchDB, I recommend reading it. (It's also on the 
notifications@couchdb.a.o list if you prefer.)


+1 to the fix.

-Joan




On Sun, Jul 21, 2019 at 4:21 AM Jan Lehnardt  wrote:


Hey all,


Joan sent this along in IRC and it reads bad enough[tm] that we should at
least have a few pairs of eyes looking at if we have to do anything:

 http://danluu.com/fsyncgate/

(It is long and dense, you’ll need to read 10-15% of that page to get the
main picture.

tl;dr: running fsync() after an fsync() that reported EIO clears that
error state with no way of recovery on Linux.

There are two ways of handling this correctly:

1. whatever you wrote() between the last successful fsync() and the
fsync() that raised the error, keep around until after the second fsync(),
so you can write() it again.

2. if any one fsync() returns EIO, report this back up immediately, so
whoever calls you can retry.

* * *

We seem to be doing 2. as per my reading.

Erlang looks like it correctly just raises whatever error fsync() might
return:

1.
https://github.com/erlang/otp/blob/maint-r14/erts/emulator/drivers/unix/unix_efile.c#L792-L809
2.
https://github.com/erlang/otp/blob/maint-r14/erts/emulator/drivers/unix/unix_efile.c#L151-L163

couch_file too:

1.
https://github.com/apache/couchdb/blob/master/src/couch/src/couch_file.erl#L215-L223

I glanced at a few paths going up this chain and couldn’t spot a catch
where we’d hide that error, but it’d be great to get some confirmation on
this.

* * *

Please double-check my understanding of the issue, the correct ways
forward and the findings in Erlang and CouchDB.

Best
Jan
--
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/






Re: Fsyncgate: errors on fsync are unrecovarable

2019-07-21 Thread Paul Davis
I’m browsing on my phone but I’m pretty sure we should add an `ok =` to
this line so that we force a bad match:

https://github.com/apache/couchdb/blob/9d098787a71d1c7f7f6adea05da15b0da3ecc7ef/src/couch/src/couch_file.erl#L223

Unless I’m missing somewhere else that we’re making that assertion.

On Sun, Jul 21, 2019 at 4:21 AM Jan Lehnardt  wrote:

> Hey all,
>
>
> Joan sent this along in IRC and it reads bad enough[tm] that we should at
> least have a few pairs of eyes looking at if we have to do anything:
>
> http://danluu.com/fsyncgate/
>
> (It is long and dense, you’ll need to read 10-15% of that page to get the
> main picture.
>
> tl;dr: running fsync() after an fsync() that reported EIO clears that
> error state with no way of recovery on Linux.
>
> There are two ways of handling this correctly:
>
> 1. whatever you wrote() between the last successful fsync() and the
> fsync() that raised the error, keep around until after the second fsync(),
> so you can write() it again.
>
> 2. if any one fsync() returns EIO, report this back up immediately, so
> whoever calls you can retry.
>
> * * *
>
> We seem to be doing 2. as per my reading.
>
> Erlang looks like it correctly just raises whatever error fsync() might
> return:
>
> 1.
> https://github.com/erlang/otp/blob/maint-r14/erts/emulator/drivers/unix/unix_efile.c#L792-L809
> 2.
> https://github.com/erlang/otp/blob/maint-r14/erts/emulator/drivers/unix/unix_efile.c#L151-L163
>
> couch_file too:
>
> 1.
> https://github.com/apache/couchdb/blob/master/src/couch/src/couch_file.erl#L215-L223
>
> I glanced at a few paths going up this chain and couldn’t spot a catch
> where we’d hide that error, but it’d be great to get some confirmation on
> this.
>
> * * *
>
> Please double-check my understanding of the issue, the correct ways
> forward and the findings in Erlang and CouchDB.
>
> Best
> Jan
> --
> Professional Support for Apache CouchDB:
> https://neighbourhood.ie/couchdb-support/
>
>


Fsyncgate: errors on fsync are unrecovarable

2019-07-21 Thread Jan Lehnardt
Hey all,


Joan sent this along in IRC and it reads bad enough[tm] that we should at least 
have a few pairs of eyes looking at if we have to do anything:

http://danluu.com/fsyncgate/

(It is long and dense, you’ll need to read 10-15% of that page to get the main 
picture.

tl;dr: running fsync() after an fsync() that reported EIO clears that error 
state with no way of recovery on Linux.

There are two ways of handling this correctly:

1. whatever you wrote() between the last successful fsync() and the fsync() 
that raised the error, keep around until after the second fsync(), so you can 
write() it again.

2. if any one fsync() returns EIO, report this back up immediately, so whoever 
calls you can retry.

* * *

We seem to be doing 2. as per my reading.

Erlang looks like it correctly just raises whatever error fsync() might return:

1. 
https://github.com/erlang/otp/blob/maint-r14/erts/emulator/drivers/unix/unix_efile.c#L792-L809
2. 
https://github.com/erlang/otp/blob/maint-r14/erts/emulator/drivers/unix/unix_efile.c#L151-L163

couch_file too:

1. 
https://github.com/apache/couchdb/blob/master/src/couch/src/couch_file.erl#L215-L223

I glanced at a few paths going up this chain and couldn’t spot a catch where 
we’d hide that error, but it’d be great to get some confirmation on this.

* * *

Please double-check my understanding of the issue, the correct ways forward and 
the findings in Erlang and CouchDB.

Best
Jan
-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/