Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2012-08-15 Thread Bruce Momjian
On Tue, Sep 13, 2011 at 10:32:01AM -0400, Aidan Van Dyk wrote:
 On Tue, Sep 13, 2011 at 10:14 AM, Florian Pflug f...@phlo.org wrote:
 
  Personally, I'ld think that's ripe for bugs.   If the contract is that
  ret != amount is the error case, then don't return -1 for an error
  *sometimes*.
 
  Hm, but isn't that how write() works also? AFAIK (non-interruptible) write()
  will return the number of bytes written, which may be less than the 
  requested
  number if there's not enough free space, or -1 in case of an error like
  an invalid fd being passed.
 
 Looking through the code, it appears as if all the write calls I've
 seen are checking ret != amount, so it's probably not as big a deal
 for PG as I fear...
 
 But the subtle change in semantics (from system write ret != amount
 not necessarily a real error, hence no errno set) of pg_write ret !=
 amount only happening after a real error (errno should be set) is
 one that could yet lead to confusion.

I assume there is no TODO here.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-13 Thread Florian Pflug
[CC'ing to the list again - I assume you omitted pgsql-hackers from the
recipient list by accident]

On Sep13, 2011, at 03:00 , George Barnett wrote:
 On 12/09/2011, at 11:39 PM, Florian Pflug wrote:
 Also, non-interruptible IO primitives are by no means right. At best, 
 they're
 a compromise between complexity and functionality for I/O devices with rather
 short (and bounded) communication timeouts - because in that case, processes 
 are
 only blocked un-interruptibly for a short while.
 
 Just to expand on that - I'm now in the situation where I can run my nfs 
 mounts
 'nointr' and postgres will work, but that means if I lose a storage unit I 
 have
 a number of stuck processes, effectively meaning I need to reboot all my 
 frontend
 servers before I can fail over to backup nfs stores.
 
 However, if I run the mounts with intr, then if a storage unit fails, I can 
 fail
 over to a backup node (taking a minor loss of data hit I'm willing to accept) 
 but
 postgres breaks under a moderate insert load.
 
 With the patch I supplied though, I'm able to have most of my cake and eat it.
 
 I'd be very interested in moving this forward - is there something I can 
 change
 in the patch to make it more acceptable for a merge?

Here are a few comments

Tom already remarked that if we do that for write()s, we ought to do it for 
read()s
also which I agree with. All other primitives like lseek, close, ... should be 
taken
care of by SA_RESTART, but I'd be a good idea to verify that.

Also, I don't think that POSIX mandates that errno be reset to 0 if a function 
returns
successfully, making that returnCode == 0  errno == 0 check pretty dubious. 
I'm not
sure of this was what Tom was getting at with his remark about the ENOSPC 
handling being
wrong in the retry case.

And I also think that if we do this, we might as well handle EINTR correctly, 
even if
our use of SA_RESTART should prevent us from ever seeing that. The rules 
surrounding
EINTR and SA_RESTART for read/write are quite subtle...

If we retry, shouldn't be do CHECK_FOR_INTERRUPTS? Otherwise, processes waiting 
for
a vanished NFS server would be killable only with SIGKILL, not SIGTERM or 
SIGINT.
But I'm not sure if it's safe to put that into a generic function like 
pg_write_nointr.

Finally, WriteAll() seems like a poor name for that function. How about 
pg_write_nointr()?

Here's my suggested implementation for pg_write_nointr. pg_read_nointr should 
be similar
(but obviously without the ENOSPC handling)

int pg_write_nointr(int fd, const void *bytes, Size amount)
{
  int written = 0;

  while (amount  0)
  {
int ret;

ret = write(fd, bytes, amount);
if ((ret  0)  (errno == EINTR))
{
  /* interrupted by signal before first byte was written. Retry */
  /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */
  CHECK_FOR_INTERRUPTS();
  continue;
}
else if (ret  0)
{
  /* error occurred. Abort */
  return -1;
}
else if (ret == 0)
{
  /* out of disk space. Abort */
  return written;
}

/* made progress */

/* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */
CHECK_FOR_INTERRUPTS();

written += ret;
amount -= ret;
bytes = (const char *) bytes + ret;
  }
}

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-13 Thread Florian Pflug
On Sep13, 2011, at 13:07 , Florian Pflug wrote:
 Here's my suggested implementation for pg_write_nointr. pg_read_nointr should 
 be similar
 (but obviously without the ENOSPC handling)
 
 wrong pg_write_nointr implementation snipped

Sorry for the self-reply. I realized only after hitting send that I
got the ENOSPC handling wrong again - we probably ought to check for
ENOSPC as well as ret == 0. Also, it seems preferable to return the
number of bytes actually written instead of -1 if we hit an error during
retry.

With this version, any return value other than amount signals an
error, the number of actually written bytes is reported even in the
case of an error (to the best of pg_write_nointr's knowledge), and
errno always indicates the kind of error.

int pg_write_nointr(int fd, const void *bytes, Size amount)
{
 int written = 0;

 while (amount  0)
 {
   int ret;

   ret = write(fd, bytes, amount);

   if ((ret  0)  (errno == EINTR))
   {
 /* interrupted by signal before first byte was written. Retry */

 /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */
 CHECK_FOR_INTERRUPTS();

 continue;
   }
   else if (ret  1)
   {
 /* error occurred. Abort */

 if (ret == 0)
   /* out of disk space */
   errno = ENOSPC;

 if (written == 0)
   return -1;
 else
   return written;
   }

   /* made progress */
   written += ret;
   amount -= ret;
   bytes = (const char *) bytes + ret;
   
   /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */
   CHECK_FOR_INTERRUPTS();
 }
}

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-13 Thread k...@rice.edu
On Tue, Sep 13, 2011 at 01:30:34PM +0200, Florian Pflug wrote:
 On Sep13, 2011, at 13:07 , Florian Pflug wrote:
  Here's my suggested implementation for pg_write_nointr. pg_read_nointr 
  should be similar
  (but obviously without the ENOSPC handling)
  
  wrong pg_write_nointr implementation snipped
 
 Sorry for the self-reply. I realized only after hitting send that I
 got the ENOSPC handling wrong again - we probably ought to check for
 ENOSPC as well as ret == 0. Also, it seems preferable to return the
 number of bytes actually written instead of -1 if we hit an error during
 retry.
 
 With this version, any return value other than amount signals an
 error, the number of actually written bytes is reported even in the
 case of an error (to the best of pg_write_nointr's knowledge), and
 errno always indicates the kind of error.
 
 int pg_write_nointr(int fd, const void *bytes, Size amount)
 {
  int written = 0;
 
  while (amount  0)
  {
int ret;
 
ret = write(fd, bytes, amount);
 
if ((ret  0)  (errno == EINTR))
{
  /* interrupted by signal before first byte was written. Retry */
 
  /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */
  CHECK_FOR_INTERRUPTS();
 
  continue;
}
else if (ret  1)
{
  /* error occurred. Abort */
 
  if (ret == 0)
/* out of disk space */
errno = ENOSPC;
 
  if (written == 0)
return -1;
  else
return written;
}
 
/* made progress */
written += ret;
amount -= ret;
bytes = (const char *) bytes + ret;

/* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */
CHECK_FOR_INTERRUPTS();
  }
 }
 
 best regards,
 Florian Pflug
 

It will be interesting to see if there are any performance ramifications to
this new write function.

Regards,
Ken

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-13 Thread Florian Pflug
On Sep13, 2011, at 14:58 , k...@rice.edu wrote:
 It will be interesting to see if there are any performance ramifications to
 this new write function.

What would those be? For non-interruptible reads and writes, the overhead
comes down to an additional function call (if we don't make pg_write_nointr
inlined) and a few conditional jumps (which branch prediction should be
able to take care of). These are bound to disappear in the noise compared
to the cost of the actual syscall.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-13 Thread k...@rice.edu
On Tue, Sep 13, 2011 at 03:02:57PM +0200, Florian Pflug wrote:
 On Sep13, 2011, at 14:58 , k...@rice.edu wrote:
  It will be interesting to see if there are any performance ramifications to
  this new write function.
 
 What would those be? For non-interruptible reads and writes, the overhead
 comes down to an additional function call (if we don't make pg_write_nointr
 inlined) and a few conditional jumps (which branch prediction should be
 able to take care of). These are bound to disappear in the noise compared
 to the cost of the actual syscall.
 
 best regards,
 Florian Pflug
 
That would be my expectation too. It is just always nice to benchmark changes,
just in case. I have had similar simple changes blow out a cache and have a
much greater impact on performance than might be expected from inspection. :)

Regards,
Ken

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-13 Thread Aidan Van Dyk
On Tue, Sep 13, 2011 at 7:30 AM, Florian Pflug f...@phlo.org wrote:


 Sorry for the self-reply. I realized only after hitting send that I
 got the ENOSPC handling wrong again - we probably ought to check for
 ENOSPC as well as ret == 0. Also, it seems preferable to return the
 number of bytes actually written instead of -1 if we hit an error during
 retry.

 With this version, any return value other than amount signals an
 error, the number of actually written bytes is reported even in the
 case of an error (to the best of pg_write_nointr's knowledge), and
 errno always indicates the kind of error.

Personally, I'ld think that's ripe for bugs.   If the contract is that
ret != amount is the error case, then don't return -1 for an error
*sometimes*.

If you sometimes return -1 for an error, even though ret != amount is
the *real* test, I'm going to guess there will be lots of chance for
code to do:
  if (pg_write_no_intr(...)  0)
   ...

which will only catch some of the errors, and happily continue with the rest...

a.

-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-13 Thread Florian Pflug
On Sep13, 2011, at 15:05 , Aidan Van Dyk wrote:
 On Tue, Sep 13, 2011 at 7:30 AM, Florian Pflug f...@phlo.org wrote:
 Sorry for the self-reply. I realized only after hitting send that I
 got the ENOSPC handling wrong again - we probably ought to check for
 ENOSPC as well as ret == 0. Also, it seems preferable to return the
 number of bytes actually written instead of -1 if we hit an error during
 retry.
 
 With this version, any return value other than amount signals an
 error, the number of actually written bytes is reported even in the
 case of an error (to the best of pg_write_nointr's knowledge), and
 errno always indicates the kind of error.
 
 Personally, I'ld think that's ripe for bugs.   If the contract is that
 ret != amount is the error case, then don't return -1 for an error
 *sometimes*.

Hm, but isn't that how write() works also? AFAIK (non-interruptible) write()
will return the number of bytes written, which may be less than the requested
number if there's not enough free space, or -1 in case of an error like
an invalid fd being passed.

 If you sometimes return -1 for an error, even though ret != amount is
 the *real* test, I'm going to guess there will be lots of chance for
 code to do:
  if (pg_write_no_intr(...)  0)
   ...
 
 which will only catch some of the errors, and happily continue with the 
 rest...

Yeah, but that's equally wrong for plain write(), so I'm not sure I share
your concern there. Also, I'm not sure how to improve that. We could always
return -1 in case of an error, and amount in case of success, but that makes
it impossible to determine how many bytes where actually written (and also feel
awkward). Or we could return 0 instead of -1 if there was an error and zero
bytes where written. But that feels awkward also...

One additional possibility would be to make the signature

  boolean pg_write_nointr(int fd, const void *bytes, int len, int *written)

and simply return true on success and false on error. Callers who're interested
in the number of bytes actually written (in the case of an error) would need to
pass some non-NULL pointer for written, while all others would just pass NULL.

Thoughts?

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-13 Thread Tom Lane
Florian Pflug f...@phlo.org writes:
 On Sep13, 2011, at 15:05 , Aidan Van Dyk wrote:
 Personally, I'ld think that's ripe for bugs.   If the contract is that
 ret != amount is the error case, then don't return -1 for an error
 *sometimes*.

 Hm, but isn't that how write() works also?

Yeah.  It's not possible to maintain the same error-reporting contract
that bare write() has got, unless you're willing to forget about actual
errors reported by a non-first write attempt.  Which might not be
totally unreasonable, because presumably something similar is going on
under the hood within write() itself.  Most of the errors one might
think are worth reporting would have had to occur on the first write
attempt anyway.

But if you do want to report such errors, I think you have to push the
error reporting logic into the subroutine, which seems a bit messy since
there's quite a variety of error message phrasings out there, all of
which require information that write() itself does not have.  Also, we
do *not* want e.g. gettext() to be invoked unless an error actually
occurs and has to be reported.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-13 Thread Aidan Van Dyk
On Tue, Sep 13, 2011 at 10:14 AM, Florian Pflug f...@phlo.org wrote:

 Personally, I'ld think that's ripe for bugs.   If the contract is that
 ret != amount is the error case, then don't return -1 for an error
 *sometimes*.

 Hm, but isn't that how write() works also? AFAIK (non-interruptible) write()
 will return the number of bytes written, which may be less than the requested
 number if there's not enough free space, or -1 in case of an error like
 an invalid fd being passed.

Looking through the code, it appears as if all the write calls I've
seen are checking ret != amount, so it's probably not as big a deal
for PG as I fear...

But the subtle change in semantics (from system write ret != amount
not necessarily a real error, hence no errno set) of pg_write ret !=
amount only happening after a real error (errno should be set) is
one that could yet lead to confusion.

a.


-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-13 Thread Florian Pflug
On Sep13, 2011, at 16:25 , Tom Lane wrote:
 Florian Pflug f...@phlo.org writes:
 On Sep13, 2011, at 15:05 , Aidan Van Dyk wrote:
 Personally, I'ld think that's ripe for bugs.   If the contract is that
 ret != amount is the error case, then don't return -1 for an error
 *sometimes*.
 
 Hm, but isn't that how write() works also?
 
 Yeah.  It's not possible to maintain the same error-reporting contract
 that bare write() has got, unless you're willing to forget about actual
 errors reported by a non-first write attempt.

Hm, yeah, but we're only replacing the exclusive or in either sets errno
*or* returns = 0 and  amount by a non-exclusive one. Which, in practice,
doesn't make much difference for callers. They can (and should) continue to
check whether they correct amount of bytes has been written, and they may
still use errno to distinguish different kinds of errors. They should just
do so upon any error condition, not upon us returning -1.

The important thing, I believe, is that we don't withhold any information
from callers, which we don't. If write() sets errno, it must return -1,
so we'll abort and hence leave the errno in place to be inspected by the
caller. And we faithfully track the actual number of bytes written.

Or am I missing something?

 But if you do want to report such errors, I think you have to push the
 error reporting logic into the subroutine, which seems a bit messy since
 there's quite a variety of error message phrasings out there, all of
 which require information that write() itself does not have.  Also, we
 do *not* want e.g. gettext() to be invoked unless an error actually
 occurs and has to be reported.

Yeah, I had the same idea (moving the error reporting into the subroutine)
when I first looked at the OP's patch, but then figured it'd just complicate
the API for no good reason.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-12 Thread Florian Pflug
On Sep12, 2011, at 06:30 , George Barnett wrote:
 On 10/09/2011, at 1:30 AM, Bernd Helmle wrote:
 
 --On 9. September 2011 10:27:22 -0400 Tom Lane t...@sss.pgh.pa.us wrote:
 
 On the whole I think you'd be better off lobbying your NFS implementors
 to provide something closer to the behavior of every other filesystem on
 the planet.  Or checking to see if you need to adjust your NFS
 configuration, as the other responders mentioned.
 
 You really need at least mount options 'hard' _and_ 'nointr' on NFS mounts, 
 otherwise you are out of luck. Oracle and DB2 guys recommend those settings 
 and without them any millisecond of network glitch could disturb things 
 unreasonably.
 
 My mount options include hard and intr.

If you really meant to say intr there (and not nointr) then that probably 
explains the partial writes.

Still, I agree with Noah and Kevin that we ought to deal more gracefully with 
this, i.e. resubmit after a partial read() or write(). AFAICS there's nothing 
to be gained by not doing that, and the increase in code complexity should be 
negligible. If we do that, however, I believe we might as well handle EINTR 
correctly, even if SA_RESTART should prevent us from ever seeing that.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-12 Thread George Barnett
On 12/09/2011, at 3:59 PM, Florian Pflug wrote:

 If you really meant to say intr there (and not nointr) then that probably 
 explains the partial writes.
 
 Still, I agree with Noah and Kevin that we ought to deal more gracefully with 
 this, i.e. resubmit after a partial read() or write(). AFAICS there's nothing 
 to be gained by not doing that, and the increase in code complexity should be 
 negligible. If we do that, however, I believe we might as well handle EINTR 
 correctly, even if SA_RESTART should prevent us from ever seeing that.


Hi Florian,

You are indeed correct.  Setting nointr also resolves my issue.  I could swear 
I checked this, but obviously not.

It does still concern me that pgsql did not deal with this as gracefully as 
other software.  I hope the list will consider a patch to resolve that.

Thanks in advance,

George
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-12 Thread Peter Eisentraut
On mån, 2011-09-12 at 16:46 +1000, George Barnett wrote:
 On 12/09/2011, at 3:59 PM, Florian Pflug wrote:
 
  If you really meant to say intr there (and not nointr) then that 
  probably explains the partial writes.
  
  Still, I agree with Noah and Kevin that we ought to deal more gracefully 
  with this, i.e. resubmit after a partial read() or write(). AFAICS there's 
  nothing to be gained by not doing that, and the increase in code complexity 
  should be negligible. If we do that, however, I believe we might as well 
  handle EINTR correctly, even if SA_RESTART should prevent us from ever 
  seeing that.
 
 
 Hi Florian,
 
 You are indeed correct.  Setting nointr also resolves my issue.  I could 
 swear I checked this, but obviously not.
 
 It does still concern me that pgsql did not deal with this as gracefully as 
 other software.  I hope the list will consider a patch to resolve that.

We have signal handling configured so that system calls are not
interrupted.  So there is ordinarily no reason to do anything more
graceful.  The problem is that NFS is in this case not observing that
setting.  It's debatable whether it's worth supporting that; just saying
that the code is correct as it stands.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-12 Thread k...@rice.edu
On Mon, Sep 12, 2011 at 04:46:53PM +1000, George Barnett wrote:
 On 12/09/2011, at 3:59 PM, Florian Pflug wrote:
 
  If you really meant to say intr there (and not nointr) then that 
  probably explains the partial writes.
  
  Still, I agree with Noah and Kevin that we ought to deal more gracefully 
  with this, i.e. resubmit after a partial read() or write(). AFAICS there's 
  nothing to be gained by not doing that, and the increase in code complexity 
  should be negligible. If we do that, however, I believe we might as well 
  handle EINTR correctly, even if SA_RESTART should prevent us from ever 
  seeing that.
 
 
 Hi Florian,
 
 You are indeed correct.  Setting nointr also resolves my issue.  I could 
 swear I checked this, but obviously not.
 
 It does still concern me that pgsql did not deal with this as gracefully as 
 other software.  I hope the list will consider a patch to resolve that.
 
 Thanks in advance,
 
 George

Hi George,

Many, many, many other software packages expect I/O usage to be the same on
an NFS volume and a local disk volume, including Oracle. Coding every 
application,
or more likely mis-coding, to handle this gives every application another chance
to get it wrong. If the OS does this, when it gets it right, all of the apps get
it right. I think you should be surprised when other software actually deals 
with
broken I/O semantics gracefully rather than concerned when one of a pantheon of
programs does not. My two cents.

Regards,
Ken

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-12 Thread Florian Pflug
On Sep12, 2011, at 14:54 , Peter Eisentraut wrote:
 On mån, 2011-09-12 at 16:46 +1000, George Barnett wrote:
 On 12/09/2011, at 3:59 PM, Florian Pflug wrote:
 Still, I agree with Noah and Kevin that we ought to deal more gracefully 
 with this, i.e. resubmit after a partial read() or write(). AFAICS there's 
 nothing to be gained by not doing that, and the increase in code complexity 
 should be negligible. If we do that, however, I believe we might as well 
 handle EINTR correctly, even if SA_RESTART should prevent us from ever 
 seeing that.
 
 It does still concern me that pgsql did not deal with this as gracefully as 
 other software.  I hope the list will consider a patch to resolve that.
 
 We have signal handling configured so that system calls are not
 interrupted.  So there is ordinarily no reason to do anything more
 graceful.  The problem is that NFS is in this case not observing that
 setting.  It's debatable whether it's worth supporting that; just saying
 that the code is correct as it stands.

SA_RESTART doesn't protect against partial reads/writes due to signal delivery,
it only removes the need to check for EINTR. In other words, it retries until
at least one byte has been written, not until all bytes have been written.

The GNU LibC documentation has this to say on the subject

  There is one situation where resumption never happens no matter which
   choice you make: when a data-transfer function such as read or write is
   interrupted by a signal after transferring part of the data. In this case,
   the function returns the number of bytes already transferred, indicating
   partial success.[1]

While it's true that reads and writes are by tradition non-interruptible, I
personally wouldn't bet that they'll stay that way forever. It all depends on
whether the timeouts involved in the communication with a disk are short enough
to mask the difference - once they get too long (or even infinite like in the
case of hard NFS mounts) you pay for non-interruptible primitives with
un-killable stuck processes. Since the current trend is to move storage further
away from processing, and to put non-deterministic networks like ethernet 
between
the two, I expect situations in which read/write primitives are interruptible
to increase, not decrease.

And BTW, the GNU LibC documentations doesn't seem to mention anything about
local reads and writes being non-interruptible. In fact, it even says

  A signal can arrive and be handled while an I/O primitive such as open or 
read
   is waiting for an I/O device. If the signal handler returns, the system faces
   the question: what should happen next?[1]

Had the GNU people faith in local read and writes being non-interruptible, 
they'd
probably have said network device or remove device, not I/O device.

best regards,
Florian Pflug

[1] 
http://www.gnu.org/s/hello/manual/libc/Interrupted-Primitives.html#Interrupted-Primitives
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-12 Thread Florian Pflug
On Sep12, 2011, at 14:54 , k...@rice.edu wrote:
 Many, many, many other software packages expect I/O usage to be the same on
 an NFS volume and a local disk volume, including Oracle. Coding every 
 application,
 or more likely mis-coding, to handle this gives every application another 
 chance
 to get it wrong. If the OS does this, when it gets it right, all of the apps 
 get
 it right. I think you should be surprised when other software actually deals 
 with
 broken I/O semantics gracefully rather than concerned when one of a pantheon 
 of
 programs does not. My two cents.

I don't buy that. People seem to be perfectly able to code correct networking
applications (correct from a read/write API POV at least), yet those 
applications
need to deal with partial reads and writes too.

Really, it's not *that* hard to put a retry loop around read and write.

Also, non-interruptible IO primitives are by no means right. At best, they're
a compromise between complexity and functionality for I/O devices with rather
short (and bounded) communication timeouts - because in that case, processes are
only blocked un-interruptibly for a short while.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-12 Thread Robert Haas
On Mon, Sep 12, 2011 at 9:39 AM, Florian Pflug f...@phlo.org wrote:
 Really, it's not *that* hard to put a retry loop around read and write.

+1.  I don't see what we're gaining by digging in our heels on this
one.  Retry loops around read() and write() are pretty routine, and I
wouldn't like to bet this is the only case where not having them could
cause an unnecessary failure.

Now, that having been said, I *really* think we could use some better
documentation on which mount options we believe to be safe, and not
just for NFS.  Right now we have practically nothing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-11 Thread George Barnett
On 10/09/2011, at 1:30 AM, Bernd Helmle wrote:

 --On 9. September 2011 10:27:22 -0400 Tom Lane t...@sss.pgh.pa.us wrote:
 
 On the whole I think you'd be better off lobbying your NFS implementors
 to provide something closer to the behavior of every other filesystem on
 the planet.  Or checking to see if you need to adjust your NFS
 configuration, as the other responders mentioned.
 
 You really need at least mount options 'hard' _and_ 'nointr' on NFS mounts, 
 otherwise you are out of luck. Oracle and DB2 guys recommend those settings 
 and without them any millisecond of network glitch could disturb things 
 unreasonably.

Hi,

My mount options include hard and intr.

George
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-09 Thread Thom Brown
On 9 September 2011 01:04, George Barnett gbarn...@atlassian.com wrote:
 After looking through the code I found that when postgres calls write() it 
 doesn't retry.  In order to address the issue with the PANIC in the WAL 
 writer I set the sync method to o_sync which solved the issue in that part of 
 the code, however I was still seeing failures in other areas of the code 
 (such as the FileWrite function).  Following this, I spoke to an NFS guru who 
 pointed out that writes under linux are not guaranteed to complete unless you 
 open up O_SYNC or similar on the file handle.

Have you run the test with varying wal_sync_method values?  On Linux
the default is fdatasync because historically Linux hasn't supported
O_DSYNC (a wal_sync_method value of open_datasync).  But I believe as
of Kernel 2.6.33 it does support it.  Have you tried modifying this
parameter in your tests?  Are you even using Linux? (you haven't
specified)

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-09 Thread Peter Eisentraut
On fre, 2011-09-09 at 10:04 +1000, George Barnett wrote:
 After looking through the code I found that when postgres calls
 write() it doesn't retry.  In order to address the issue with the
 PANIC in the WAL writer I set the sync method to o_sync which solved
 the issue in that part of the code, however I was still seeing
 failures in other areas of the code (such as the FileWrite function).
 Following this, I spoke to an NFS guru who pointed out that writes
 under linux are not guaranteed to complete unless you open up O_SYNC
 or similar on the file handle.

I've had this problem many years ago.  I recall that changing the mount
options for NFS also fixed it.  Could you post what mount options you
are using.

(We eventually moved away from NFS at that time, so I didn't pursue it
further, but my analysis back then matched yours.)



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-09 Thread Tom Lane
George Barnett gbarn...@atlassian.com writes:
 [ patch to retry writes on NFS ]

I'm having a hard time getting excited about this idea, because IMO
NFS is insufficiently reliable to run a database on, and no patch like
this can fix that.  However, some concrete points:

1. If writes need to be retried, why not reads?  (No, that's not an
invitation to expand the scope of the patch; it's a question about NFS
implementation.)

2. What is the rationale for supposing that a retry a nanosecond later
will help?  If it will help, why didn't the kernel just do that?

3. What about EINTR?  If you believe that this is necessary, then the
kernel logically has to return EINTR when it would like you to retry but
it hasn't been able to write any bytes yet.  If you get a zero return
you have to assume that means out-of-disk-space.

4. As coded, the patch behaves incorrectly if you get a zero return on a
retry.  If we were going to do this, I think we'd need to absorb the
errno-munging currently done by callers into the writeAll function.

On the whole I think you'd be better off lobbying your NFS implementors
to provide something closer to the behavior of every other filesystem on
the planet.  Or checking to see if you need to adjust your NFS
configuration, as the other responders mentioned.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-09 Thread Bernd Helmle



--On 9. September 2011 10:27:22 -0400 Tom Lane t...@sss.pgh.pa.us wrote:


On the whole I think you'd be better off lobbying your NFS implementors
to provide something closer to the behavior of every other filesystem on
the planet.  Or checking to see if you need to adjust your NFS
configuration, as the other responders mentioned.


You really need at least mount options 'hard' _and_ 'nointr' on NFS mounts, 
otherwise you are out of luck. Oracle and DB2 guys recommend those settings and 
without them any millisecond of network glitch could disturb things 
unreasonably.


--
Thanks

Bernd

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-09 Thread Noah Misch
On Fri, Sep 09, 2011 at 10:27:22AM -0400, Tom Lane wrote:
 George Barnett gbarn...@atlassian.com writes:
  [ patch to retry writes on NFS ]
 
 I'm having a hard time getting excited about this idea, because IMO
 NFS is insufficiently reliable to run a database on, and no patch like
 this can fix that.  However, some concrete points:
 
 1. If writes need to be retried, why not reads?  (No, that's not an
 invitation to expand the scope of the patch; it's a question about NFS
 implementation.)

To support all POSIX:2008-conforming read() implementations, we must indeed
retry reads.  I suppose the OP never encountered this in practice, though.

 2. What is the rationale for supposing that a retry a nanosecond later
 will help?  If it will help, why didn't the kernel just do that?

POSIX:2008 unconditionally permits[1] partial writes of requests exceeding 512
bytes (_POSIX_PIPE_BUF).  We shouldn't complain when a kernel provides a
conforming write(), even if it appears that the kernel achieved little by
using some freedom afforded it.

 3. What about EINTR?  If you believe that this is necessary, then the
 kernel logically has to return EINTR when it would like you to retry but
 it hasn't been able to write any bytes yet.  If you get a zero return
 you have to assume that means out-of-disk-space.

Assuming conforming SA_RESTART for write()/read(), this will not happen.  The
call will restart and resume blocking until it writes something.

nm

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-09 Thread Kevin Grittner
Noah Misch n...@leadboat.com wrote:
 
 We shouldn't complain when a kernel provides a conforming write(),
 even if it appears that the kernel achieved little by using some
 freedom afforded it.
 
I remember we had some compiler warnings in the logging area because
we were making the assumption that no implementation would ever use
that freedom.  I suggested we replace the simple, unchecked calls to
write with a function which did what we expected through an API
conforming loop:
 
http://archives.postgresql.org/pgsql-hackers/2011-02/msg01719.php
 
The response was that we could ignore the documented API because we
had never seen nor heard of it being true for writes to disk
files.  I'm still uncomfortable with that.  Where I have seen
people code to implementation details rather than the documented
API, it has often not turned out well in the long run.
 
I'd still be willing to put together a patch for that if people buy
into it.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-09 Thread Bruce Momjian
Tom Lane wrote:
 George Barnett gbarn...@atlassian.com writes:
  [ patch to retry writes on NFS ]
 
 I'm having a hard time getting excited about this idea, because IMO
 NFS is insufficiently reliable to run a database on, and no patch like
 this can fix that.  However, some concrete points:
 
 1. If writes need to be retried, why not reads?  (No, that's not an
 invitation to expand the scope of the patch; it's a question about NFS
 implementation.)
 
 2. What is the rationale for supposing that a retry a nanosecond later
 will help?  If it will help, why didn't the kernel just do that?
 
 3. What about EINTR?  If you believe that this is necessary, then the
 kernel logically has to return EINTR when it would like you to retry but
 it hasn't been able to write any bytes yet.  If you get a zero return
 you have to assume that means out-of-disk-space.
 
 4. As coded, the patch behaves incorrectly if you get a zero return on a
 retry.  If we were going to do this, I think we'd need to absorb the
 errno-munging currently done by callers into the writeAll function.
 
 On the whole I think you'd be better off lobbying your NFS implementors
 to provide something closer to the behavior of every other filesystem on
 the planet.  Or checking to see if you need to adjust your NFS
 configuration, as the other responders mentioned.

Can our NFS documentation be improved (section 17.2.1)?

http://www.postgresql.org/docs/9.1/static/creating-cluster.html

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-08 Thread George Barnett
Hi Hackers,


postgresql-writeall.patch
Description: Binary data

I run a number of postgresql installations on NFS and on the whole I find this 
to be very reliable.  I have however run into a few issues when there is 
concurrent writes from multiple processes.

I see errors such as the following:

2011-07-31 22:13:35 EST postgres postgres [local] LOG:  connection authorized: 
user=postgres database=postgres
2011-07-31 22:13:35 ESTERROR:  could not write block 1 of relation 
global/2671: wrote only 4096 of 8192 bytes
2011-07-31 22:13:35 ESTHINT:  Check free disk space.
2011-07-31 22:13:35 ESTCONTEXT:  writing block 1 of relation global/2671
2011-07-31 22:13:35 EST [unknown] [unknown]  LOG:  connection received: 
host=[local]

I have also seen similar errors coming out of the WAL writer, however they 
occur at the level PANIC, which is a little more drastic.

After spending some time with debug logging turned on and even more time 
staring at strace, I believe this occurs when one process was writing to a data 
file and it received a SIGINT from another process, eg:
(These logs are from another similar run)

[pid  1804] ... fsync resumed )   = 0
[pid 10198] kill(1804, SIGINT unfinished ...
[pid  1804] lseek(3, 4915200, SEEK_SET) = 4915200
[pid  1804] write(3, 
c\320\1\0\1\0\0\0\0\0\0\0\0\0K\2\6\1\0\0\0\0\373B\0\0\0\0\2\0m\0..., 32768 
unfinished ...
[pid 10198] ... kill resumed )= 0
[pid  1804] ... write resumed )   = 4096
[pid  1804] --- SIGINT (Interrupt) @ 0 (0) ---
[pid  1804] rt_sigreturn(0x2)   = 4096
[pid  1804] write(2, \0\0\373\0\f\7\0\0t2011-08-30 20:29:52.999..., 260) = 260
[pid  1804] rt_sigprocmask(SIG_UNBLOCK, [ABRT],  unfinished ...
[pid  1802] ... select resumed )  = 1 (in [5], left {0, 999000})
[pid  1804] ... rt_sigprocmask resumed NULL, 8) = 0
[pid  1804] tgkill(1804, 1804, SIGABRT) = 0
[pid  1802] read(5,  unfinished ...
[pid  1804] --- SIGABRT (Aborted) @ 0 (0) ---
Process 1804 detached

After finding this, I came up with the following test case which easily 
replicated our issue:

#!/bin/bash

name=$1
number=1
while true; do 
  /usr/bin/psql -c CREATE USER \$name$number\ WITH NOSUPERUSER INHERIT 
NOCREATEROLE NOCREATEDB LOGIN PASSWORD 'pass';
  /usr/bin/createdb -E UNICODE -O $name$number $name$number
  if `grep -q PANIC /data/postgresql/data/pg_log/*`; then
exit
  fi
  let number=$number+1
done

When I run a single copy of this script, I have no issues, however when I start 
up a few more copies to simultaneously hit the DB, it crashes quiet quickly - 
usually within 20 or 30 seconds.

After looking through the code I found that when postgres calls write() it 
doesn't retry.  In order to address the issue with the PANIC in the WAL writer 
I set the sync method to o_sync which solved the issue in that part of the 
code, however I was still seeing failures in other areas of the code (such as 
the FileWrite function).  Following this, I spoke to an NFS guru who pointed 
out that writes under linux are not guaranteed to complete unless you open up 
O_SYNC or similar on the file handle.  I had a look in the libc docs and found 
this:

http://www.gnu.org/s/libc/manual/html_node/I_002fO-Primitives.html


The write function writes up to size bytes from buffer to the file with 
descriptor filedes. The data in buffer is not necessarily a character string 
and a null character is output like any other character.

The return value is the number of bytes actually written. This may be size, but 
can always be smaller. Your program should always call write in a loop, 
iterating until all the data is written.


After finding this, I checked a number of other pieces of software that we see 
no issues with on NFS (such as the JVM) for their usage of write().  I 
confirmed they write in a while loop and set about patching the postgres source.

I have made this patch against 8.4.8 and confirmed that it fixes the issue we 
see on our systems.  I have also checked that make check still passes. 

As my C is terrible, I would welcome any comments on the implementation of this 
patch.

Best regards,

George





 
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-08 Thread Josh Berkus
George,

I'm quoting you here because in the version of your email which got
posted to the list your whole explanation got put below the patch text,
making it hard to find the justification for the patch.  Follows:

 I run a number of postgresql installations on NFS and on the whole I find 
 this to be very reliable.  I have however run into a few issues when there is 
 concurrent writes from multiple processes.
 
 I see errors such as the following:
 
 2011-07-31 22:13:35 EST postgres postgres [local] LOG:  connection 
 authorized: user=postgres database=postgres
 2011-07-31 22:13:35 ESTERROR:  could not write block 1 of relation 
 global/2671: wrote only 4096 of 8192 bytes
 2011-07-31 22:13:35 ESTHINT:  Check free disk space.
 2011-07-31 22:13:35 ESTCONTEXT:  writing block 1 of relation global/2671
 2011-07-31 22:13:35 EST [unknown] [unknown]  LOG:  connection received: 
 host=[local]
 
 I have also seen similar errors coming out of the WAL writer, however they 
 occur at the level PANIC, which is a little more drastic.
 
 After spending some time with debug logging turned on and even more time 
 staring at strace, I believe this occurs when one process was writing to a 
 data file and it received a SIGINT from another process, eg:
 (These logs are from another similar run)
 
 [pid  1804] ... fsync resumed )   = 0
 [pid 10198] kill(1804, SIGINT unfinished ...
 [pid  1804] lseek(3, 4915200, SEEK_SET) = 4915200
 [pid  1804] write(3, 
 c\320\1\0\1\0\0\0\0\0\0\0\0\0K\2\6\1\0\0\0\0\373B\0\0\0\0\2\0m\0..., 32768 
 unfinished ...
 [pid 10198] ... kill resumed )= 0
 [pid  1804] ... write resumed )   = 4096
 [pid  1804] --- SIGINT (Interrupt) @ 0 (0) ---
 [pid  1804] rt_sigreturn(0x2)   = 4096
 [pid  1804] write(2, \0\0\373\0\f\7\0\0t2011-08-30 20:29:52.999..., 260) = 
 260
 [pid  1804] rt_sigprocmask(SIG_UNBLOCK, [ABRT],  unfinished ...
 [pid  1802] ... select resumed )  = 1 (in [5], left {0, 999000})
 [pid  1804] ... rt_sigprocmask resumed NULL, 8) = 0
 [pid  1804] tgkill(1804, 1804, SIGABRT) = 0
 [pid  1802] read(5,  unfinished ...
 [pid  1804] --- SIGABRT (Aborted) @ 0 (0) ---
 Process 1804 detached
 
 After finding this, I came up with the following test case which easily 
 replicated our issue:
 
 #!/bin/bash
 
 name=$1
 number=1
 while true; do 
   /usr/bin/psql -c CREATE USER \$name$number\ WITH NOSUPERUSER INHERIT 
 NOCREATEROLE NOCREATEDB LOGIN PASSWORD 'pass';
   /usr/bin/createdb -E UNICODE -O $name$number $name$number
   if `grep -q PANIC /data/postgresql/data/pg_log/*`; then
 exit
   fi
   let number=$number+1
 done
 
 When I run a single copy of this script, I have no issues, however when I 
 start up a few more copies to simultaneously hit the DB, it crashes quiet 
 quickly - usually within 20 or 30 seconds.
 
 After looking through the code I found that when postgres calls write() it 
 doesn't retry.  In order to address the issue with the PANIC in the WAL 
 writer I set the sync method to o_sync which solved the issue in that part of 
 the code, however I was still seeing failures in other areas of the code 
 (such as the FileWrite function).  Following this, I spoke to an NFS guru who 
 pointed out that writes under linux are not guaranteed to complete unless you 
 open up O_SYNC or similar on the file handle.  I had a look in the libc docs 
 and found this:
 
 http://www.gnu.org/s/libc/manual/html_node/I_002fO-Primitives.html
 
 
 The write function writes up to size bytes from buffer to the file with 
 descriptor filedes. The data in buffer is not necessarily a character string 
 and a null character is output like any other character.
 
 The return value is the number of bytes actually written. This may be size, 
 but can always be smaller. Your program should always call write in a loop, 
 iterating until all the data is written.
 
 
 After finding this, I checked a number of other pieces of software that we 
 see no issues with on NFS (such as the JVM) for their usage of write().  I 
 confirmed they write in a while loop and set about patching the postgres 
 source.
 
 I have made this patch against 8.4.8 and confirmed that it fixes the issue we 
 see on our systems.  I have also checked that make check still passes. 
 
 As my C is terrible, I would welcome any comments on the implementation of 
 this patch.
 
 Best regards,
 
 George


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers