Re: [Qemu-block] [PATCH for-2.6 v2 0/3] Bug fixes for gluster

2016-04-20 Thread Ric Wheeler

On 04/20/2016 05:24 AM, Kevin Wolf wrote:

Am 20.04.2016 um 03:56 hat Ric Wheeler geschrieben:

On 04/19/2016 10:09 AM, Jeff Cody wrote:

On Tue, Apr 19, 2016 at 08:18:39AM -0400, Ric Wheeler wrote:

On 04/19/2016 08:07 AM, Jeff Cody wrote:

Bug fixes for gluster; third patch is to prevent
a potential data loss when trying to recover from
a recoverable error (such as ENOSPC).

Hi Jeff,

Just a note, I have been talking to some of the disk drive people
here at LSF (the kernel summit for file and storage people) and got
a non-public confirmation that individual storage devices (s-ata
drives or scsi) can also dump cache state when a synchronize cache
command fails.  Also followed up with Rik van Riel - in the page
cache in general, when we fail to write back dirty pages, they are
simply marked "clean" (which means effectively that they get
dropped).

Long winded way of saying that I think that this scenario is not
unique to gluster - any failed fsync() to a file (or block device)
might be an indication of permanent data loss.


Ric,

Thanks.

I think you are right, we likely do need to address how QEMU handles fsync
failures across the board in QEMU at some point (2.7?).  Another point to
consider is that QEMU is cross-platform - so not only do we have different
protocols, and filesystems, but also different underlying host OSes as well.
It is likely, like you said, that there are other non-gluster scenarios where
we have non-recoverable data loss on fsync failure.

With Gluster specifically, if we look at just ENOSPC, does this mean that
even if Gluster retains its cache after fsync failure, we still won't know
that there was no permanent data loss?  If we hit ENOSPC during an fsync, I
presume that means Gluster itself may have encountered ENOSPC from a fsync to
the underlying storage.  In that case, does Gluster just pass the error up
the stack?

Jeff

I still worry that in many non-gluster situations we will have
permanent data loss here. Specifically, the way the page cache
works, if we fail to write back cached data *at any time*, a future
fsync() will get a failure.

And this is actually what saves the semantic correctness. If you threw
away data, any following fsync() must fail. This is of course
inconvenient because you won't be able to resume a VM that is configured
to stop on errors, and it means some data loss, but it's safe because we
never tell the guest that the data is on disk when it really isn't.

gluster's behaviour (without resync-failed-syncs-after-fsync set) is
different, if I understand correctly. It will throw away the data and
then happily report success on the next fsync() call. And this is what
causes not only data loss, but corruption.


Yes, that makes sense to me - the kernel will remember that it could not write 
data back from the page cache and the future fsync() will see an error.




[ Hm, or having read what's below... Did I misunderstand and Linux
   returns failure only for a single fsync() and on the next one it
   returns success again? That would be bad. ]


I would need to think through that scenario with the memory management people to 
see if that could happen.



That failure could be because of a thinly provisioned backing store,
but in the interim, the page cache is free to drop the pages that
had failed. In effect, we end up with data loss in part or in whole
without a way to detect which bits got dropped.

Note that this is not a gluster issue, this is for any file system
on top of thinly provisioned storage (i.e., we would see this with
xfs on thin storage or ext4 on thin storage).  In effect, if gluster
has written the data back to xfs and that is on top of a thinly
provisioned target, the kernel might drop that data before you can
try an fsync again. Even if you retry the fsync(), the pages are
marked clean so they will not be pushed back to storage on that
second fsync().

I'm wondering... Marking the page clean means that it can be evicted
from the cache, right? Which happens whenever something more useful can
be done with the memory, i.e. possibly at any time. Does this mean that
two consecutive reads of the same block can return different data even
though no process has written to the file in between?


This we should tease out with a careful review of the behavior, but I think that 
might be able to happen.


Specifically,

Time 0: File has pattern A at offset 0. Any reads at this point see pattern A

Time 1: Write pattern B to offset 0. Reads now see pattern B.

Time 2: Run out of space on the backing store (before the data has been written 
back)


Time 3: Do an fsync() *OR* have the page cache fail to write back that page

Time 4: Under memory pressure, the page which was marked clean, is dropped

Time 5: Read offset 0 again - do we now see pattern A again? Or an IO error?


Also, O_DIRECT bypasses the problem, right? In that already the write
request would fail there, not only the fsync(). We reco

Re: [Qemu-block] [PATCH for-2.6 v2 0/3] Bug fixes for gluster

2016-04-19 Thread Ric Wheeler

On 04/19/2016 10:09 AM, Jeff Cody wrote:

On Tue, Apr 19, 2016 at 08:18:39AM -0400, Ric Wheeler wrote:

On 04/19/2016 08:07 AM, Jeff Cody wrote:

Bug fixes for gluster; third patch is to prevent
a potential data loss when trying to recover from
a recoverable error (such as ENOSPC).

Hi Jeff,

Just a note, I have been talking to some of the disk drive people
here at LSF (the kernel summit for file and storage people) and got
a non-public confirmation that individual storage devices (s-ata
drives or scsi) can also dump cache state when a synchronize cache
command fails.  Also followed up with Rik van Riel - in the page
cache in general, when we fail to write back dirty pages, they are
simply marked "clean" (which means effectively that they get
dropped).

Long winded way of saying that I think that this scenario is not
unique to gluster - any failed fsync() to a file (or block device)
might be an indication of permanent data loss.


Ric,

Thanks.

I think you are right, we likely do need to address how QEMU handles fsync
failures across the board in QEMU at some point (2.7?).  Another point to
consider is that QEMU is cross-platform - so not only do we have different
protocols, and filesystems, but also different underlying host OSes as well.
It is likely, like you said, that there are other non-gluster scenarios where
we have non-recoverable data loss on fsync failure.

With Gluster specifically, if we look at just ENOSPC, does this mean that
even if Gluster retains its cache after fsync failure, we still won't know
that there was no permanent data loss?  If we hit ENOSPC during an fsync, I
presume that means Gluster itself may have encountered ENOSPC from a fsync to
the underlying storage.  In that case, does Gluster just pass the error up
the stack?

Jeff


I still worry that in many non-gluster situations we will have permanent data 
loss here. Specifically, the way the page cache works, if we fail to write back 
cached data *at any time*, a future fsync() will get a failure.


That failure could be because of a thinly provisioned backing store, but in the 
interim, the page cache is free to drop the pages that had failed. In effect, we 
end up with data loss in part or in whole without a way to detect which bits got 
dropped.


Note that this is not a gluster issue, this is for any file system on top of 
thinly provisioned storage (i.e., we would see this with xfs on thin storage or 
ext4 on thin storage).  In effect, if gluster has written the data back to xfs 
and that is on top of a thinly provisioned target, the kernel might drop that 
data before you can try an fsync again. Even if you retry the fsync(), the pages 
are marked clean so they will not be pushed back to storage on that second fsync().


Same issue with link loss - if we lose connection to a storage target, it is 
likely to take time to detect that, more time to reconnect. In the interim, any 
page cache data is very likely to get dropped under memory pressure.


In both of these cases, fsync() failure is effectively a signal of a high chance 
of data that has been already lost. A retry will not save the day.


At LSF/MM today, we discussed an option that would allow the page cache to hang 
on to data - for re-tryable errors only for example - so that this would not 
happen. The impact of this is also potentially huge (page cache/physical memory 
could be exhausted while waiting for an admin to fix the issue) so it would have 
to be a non-default option.


I think that we will need some discussions with the kernel memory management 
team (and some storage kernel people) to see what seems reasonable here.


Regards,

Ric




The final patch closes the gluster fd and sets the
protocol drv to NULL on fsync failure in gluster;
we have no way of knowing what gluster versions
support retaining fysnc cache on error, so until
we do the safest thing to do is invalidate the
drive.

Jeff Cody (3):
   block/gluster: return correct error value
   block/gluster: code movement of qemu_gluster_close()
   block/gluster: prevent data loss after i/o error

  block/gluster.c | 66 ++---
  configure   |  8 +++
  2 files changed, 62 insertions(+), 12 deletions(-)






Re: [Qemu-block] [PATCH for-2.6 v2 0/3] Bug fixes for gluster

2016-04-19 Thread Ric Wheeler

On 04/19/2016 08:07 AM, Jeff Cody wrote:

Bug fixes for gluster; third patch is to prevent
a potential data loss when trying to recover from
a recoverable error (such as ENOSPC).


Hi Jeff,

Just a note, I have been talking to some of the disk drive people here at LSF 
(the kernel summit for file and storage people) and got a non-public 
confirmation that individual storage devices (s-ata drives or scsi) can also 
dump cache state when a synchronize cache command fails.  Also followed up with 
Rik van Riel - in the page cache in general, when we fail to write back dirty 
pages, they are simply marked "clean" (which means effectively that they get 
dropped).


Long winded way of saying that I think that this scenario is not unique to 
gluster - any failed fsync() to a file (or block device) might be an indication 
of permanent data loss.


Regards,

Ric



The final patch closes the gluster fd and sets the
protocol drv to NULL on fsync failure in gluster;
we have no way of knowing what gluster versions
support retaining fysnc cache on error, so until
we do the safest thing to do is invalidate the
drive.

Jeff Cody (3):
   block/gluster: return correct error value
   block/gluster: code movement of qemu_gluster_close()
   block/gluster: prevent data loss after i/o error

  block/gluster.c | 66 ++---
  configure   |  8 +++
  2 files changed, 62 insertions(+), 12 deletions(-)






Re: [Qemu-block] [PATCH for-2.6 2/2] block/gluster: prevent data loss after i/o error

2016-04-06 Thread Ric Wheeler


We had a thread discussing this not on the upstream list.

My summary of the thread is that I don't understand why gluster should drop 
cached data after a failed fsync() for any open file. For closed files, I think 
it might still happen but this is the same as any file system (and unlikely to 
be the case for qemu?).


I will note that Linux in general had (still has I think?) the behavior that 
once the process closes a file (or exits), we lose context to return an error 
to. From that point on, any failed IO from the page cache to the target disk 
will be dropped from cache. To hold things in the cache would lead it to fill 
with old data that is not really recoverable and we have no good way to know 
that the situation is repairable and how long that might take. Upstream kernel 
people have debated this, the behavior might be tweaked for certain types of errors.


Regards,

Ric


On 04/06/2016 07:02 AM, Kevin Wolf wrote:

[ Adding some CCs ]

Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben:

Upon receiving an I/O error after an fsync, by default gluster will
dump its cache.  However, QEMU will retry the fsync, which is especially
useful when encountering errors such as ENOSPC when using the werror=stop
option.  When using caching with gluster, however, the last written data
will be lost upon encountering ENOSPC.  Using the cache xlator option of
'resync-failed-syncs-after-fsync' should cause gluster to retain the
cached data after a failed fsync, so that ENOSPC and other transient
errors are recoverable.

Signed-off-by: Jeff Cody 
---
  block/gluster.c | 27 +++
  configure   |  8 
  2 files changed, 35 insertions(+)

diff --git a/block/gluster.c b/block/gluster.c
index 30a827e..b1cf71b 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs,  QDict 
*options,
  goto out;
  }
  
+#ifdef CONFIG_GLUSTERFS_XLATOR_OPT

+/* Without this, if fsync fails for a recoverable reason (for instance,
+ * ENOSPC), gluster will dump its cache, preventing retries.  This means
+ * almost certain data loss.  Not all gluster versions support the
+ * 'resync-failed-syncs-after-fsync' key value, but there is no way to
+ * discover during runtime if it is supported (this api returns success for
+ * unknown key/value pairs) */

Honestly, this sucks. There is apparently no way to operate gluster so
we can safely recover after a failed fsync. "We hope everything is fine,
but depending on your gluster version, we may now corrupt your image"
isn't very good.

We need to consider very carefully if this is good enough to go on after
an error. I'm currently leaning towards "no". That is, we should only
enable this after Gluster provides us a way to make sure that the option
is really set.


+ret = glfs_set_xlator_option (s->glfs, "*-write-behind",
+   "resync-failed-syncs-after-fsync",
+   "on");
+if (ret < 0) {
+error_setg_errno(errp, errno, "Unable to set xlator key/value pair");
+ret = -errno;
+goto out;
+}
+#endif

We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT.
In this case (as well as theoretically in the case that the option
didn't take effect - if only we could know about it), a failed
glfs_fsync_async() is fatal and we need to stop operating on the image,
i.e. set bs->drv = NULL like when we detect corruption in qcow2 images.
The guest will see a broken disk that fails all I/O requests, but that's
better than corrupting data.

Kevin