Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
On Wed, Aug 29, 2012 at 12:56:26AM -0400, Tom Lane wrote: Bruce Momjian br...@momjian.us writes: On Wed, Aug 29, 2012 at 12:24:26AM -0400, Alvaro Herrera wrote: It's a pretty strange line wrap you got in this version of the patch. Normally we just let the string run past the 78 char limit, without cutting it in any way. And moving the start of the string to the line following errhint( looks very odd to me. OK, updated patch attached. I agree with Alvaro's complaint that moving the whole string literal to the next line isn't conformant to our usual coding style. Definitely nitpicky, but why would you do it like that? I remember now why I added \n. I am used to writing pg_upgrade output strings, which are obviously not sent to log files. Seems I forgot that distinction. As far as moving the string to the next line, I was trying to keep the line from getting too long. Updated patch has everyone on the same line. I am fine with nitpicky. Frankly, I have applied so many patches in the past few weeks, I am glad someone is watching. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c new file mode 100644 index 775d71f..9a0f92c *** a/src/backend/utils/init/miscinit.c --- b/src/backend/utils/init/miscinit.c *** CreateLockFile(const char *filename, boo *** 766,771 --- 766,779 filename))); close(fd); + if (len == 0) + { + ereport(FATAL, + (errcode(ERRCODE_LOCK_FILE_EXISTS), + errmsg(lock file \%s\ is empty, filename), + errhint(Either another server is starting, or the lock file is the remnant of a previous server startup crash.))); + } + buffer[len] = '\0'; encoded_pid = atoi(buffer); diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c new file mode 100644 index af8d8b2..81ba39e *** a/src/bin/pg_ctl/pg_ctl.c --- b/src/bin/pg_ctl/pg_ctl.c *** get_pgpid(void) *** 292,299 } if (fscanf(pidf, %ld, pid) != 1) { ! write_stderr(_(%s: invalid data in PID file \%s\\n), ! progname, pid_file); exit(1); } fclose(pidf); --- 292,304 } if (fscanf(pidf, %ld, pid) != 1) { ! /* Is the file empty? */ ! if (ftell(pidf) == 0 feof(pidf)) ! write_stderr(_(%s: the PID file \%s\ is empty\n), ! progname, pid_file); ! else ! write_stderr(_(%s: invalid data in PID file \%s\\n), ! progname, pid_file); exit(1); } fclose(pidf); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
Applied. --- On Wed, Aug 29, 2012 at 08:51:40AM -0400, Bruce Momjian wrote: On Wed, Aug 29, 2012 at 12:56:26AM -0400, Tom Lane wrote: Bruce Momjian br...@momjian.us writes: On Wed, Aug 29, 2012 at 12:24:26AM -0400, Alvaro Herrera wrote: It's a pretty strange line wrap you got in this version of the patch. Normally we just let the string run past the 78 char limit, without cutting it in any way. And moving the start of the string to the line following errhint( looks very odd to me. OK, updated patch attached. I agree with Alvaro's complaint that moving the whole string literal to the next line isn't conformant to our usual coding style. Definitely nitpicky, but why would you do it like that? I remember now why I added \n. I am used to writing pg_upgrade output strings, which are obviously not sent to log files. Seems I forgot that distinction. As far as moving the string to the next line, I was trying to keep the line from getting too long. Updated patch has everyone on the same line. I am fine with nitpicky. Frankly, I have applied so many patches in the past few weeks, I am glad someone is watching. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c new file mode 100644 index 775d71f..9a0f92c *** a/src/backend/utils/init/miscinit.c --- b/src/backend/utils/init/miscinit.c *** CreateLockFile(const char *filename, boo *** 766,771 --- 766,779 filename))); close(fd); + if (len == 0) + { + ereport(FATAL, + (errcode(ERRCODE_LOCK_FILE_EXISTS), + errmsg(lock file \%s\ is empty, filename), + errhint(Either another server is starting, or the lock file is the remnant of a previous server startup crash.))); + } + buffer[len] = '\0'; encoded_pid = atoi(buffer); diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c new file mode 100644 index af8d8b2..81ba39e *** a/src/bin/pg_ctl/pg_ctl.c --- b/src/bin/pg_ctl/pg_ctl.c *** get_pgpid(void) *** 292,299 } if (fscanf(pidf, %ld, pid) != 1) { ! write_stderr(_(%s: invalid data in PID file \%s\\n), ! progname, pid_file); exit(1); } fclose(pidf); --- 292,304 } if (fscanf(pidf, %ld, pid) != 1) { ! /* Is the file empty? */ ! if (ftell(pidf) == 0 feof(pidf)) ! write_stderr(_(%s: the PID file \%s\ is empty\n), ! progname, pid_file); ! else ! write_stderr(_(%s: invalid data in PID file \%s\\n), ! progname, pid_file); exit(1); } fclose(pidf); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
On Mon, Aug 27, 2012 at 10:17:43PM -0400, Bruce Momjian wrote: Seems pg_ctl would also need some cleanup if we change the error message and/or timing. I am thinking we should just change the error message in the postmaster and pg_ctl to say the file is empty, and call it done (no hint message). If we do want a hint, say that either the file is stale from a crash or another postmaster is starting up, and let the user diagnose it. Updated patch attached which just reports the file as empty. I assume we don't want the extra text output for pg_ctl like we do for the backend. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c new file mode 100644 index 775d71f..9f0a58a *** a/src/backend/utils/init/miscinit.c --- b/src/backend/utils/init/miscinit.c *** CreateLockFile(const char *filename, boo *** 766,771 --- 766,781 filename))); close(fd); + if (len == 0) + { + ereport(FATAL, + (errcode(ERRCODE_LOCK_FILE_EXISTS), + errmsg(lock file \%s\ is empty, filename), + errhint( + Either another server is starting, or the lock file is the remnant\n + of a previous server startup crash.))); + } + buffer[len] = '\0'; encoded_pid = atoi(buffer); diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c new file mode 100644 index af8d8b2..0be0f2d *** a/src/bin/pg_ctl/pg_ctl.c --- b/src/bin/pg_ctl/pg_ctl.c *** get_pgpid(void) *** 292,299 } if (fscanf(pidf, %ld, pid) != 1) { ! write_stderr(_(%s: invalid data in PID file \%s\\n), progname, pid_file); exit(1); } fclose(pidf); --- 292,304 } if (fscanf(pidf, %ld, pid) != 1) { ! /* Is the file empty? */ ! if (feof(pidf)) ! write_stderr(_(%s: the PID file \%s\ is empty\n), progname, pid_file); + else + write_stderr(_(%s: invalid data in PID file \%s\\n), + progname, pid_file); exit(1); } fclose(pidf); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
Bruce Momjian br...@momjian.us writes: Updated patch attached which just reports the file as empty. I assume we don't want the extra text output for pg_ctl like we do for the backend. The backend side of this looks mostly sane to me (but drop the \n, messages are not supposed to contain those). But the feof test proposed for pg_ctl is no good: consider a file containing just, say, -. fscanf would eat the -, then hit eof, and this would complain the file is empty. Possibly checking for ftell(pidf) == 0 would do, though I'm not sure whether it's portable to assume fscanf would eat a non-numeric character before complaining. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
On Tue, Aug 28, 2012 at 04:25:36PM -0400, Tom Lane wrote: Bruce Momjian br...@momjian.us writes: Updated patch attached which just reports the file as empty. I assume we don't want the extra text output for pg_ctl like we do for the backend. The backend side of this looks mostly sane to me (but drop the \n, messages are not supposed to contain those). But the feof test proposed Removed. I thought we needed to add \n so that strings 80 would wrap properly. How do we handle this? for pg_ctl is no good: consider a file containing just, say, -. fscanf would eat the -, then hit eof, and this would complain the file is empty. Possibly checking for ftell(pidf) == 0 would do, though I'm not sure whether it's portable to assume fscanf would eat a non-numeric character before complaining. ftell() seems to work fine when combined with feof(), so I used that in the attached patch. ftell() alone remains at zero if the file contains A, so feof() is also needed. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c new file mode 100644 index 775d71f..eadfcbf *** a/src/backend/utils/init/miscinit.c --- b/src/backend/utils/init/miscinit.c *** CreateLockFile(const char *filename, boo *** 766,771 --- 766,781 filename))); close(fd); + if (len == 0) + { + ereport(FATAL, + (errcode(ERRCODE_LOCK_FILE_EXISTS), + errmsg(lock file \%s\ is empty, filename), + errhint( + Either another server is starting, or the lock file is the remnant + of a previous server startup crash.))); + } + buffer[len] = '\0'; encoded_pid = atoi(buffer); diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c new file mode 100644 index af8d8b2..81ba39e *** a/src/bin/pg_ctl/pg_ctl.c --- b/src/bin/pg_ctl/pg_ctl.c *** get_pgpid(void) *** 292,299 } if (fscanf(pidf, %ld, pid) != 1) { ! write_stderr(_(%s: invalid data in PID file \%s\\n), ! progname, pid_file); exit(1); } fclose(pidf); --- 292,304 } if (fscanf(pidf, %ld, pid) != 1) { ! /* Is the file empty? */ ! if (ftell(pidf) == 0 feof(pidf)) ! write_stderr(_(%s: the PID file \%s\ is empty\n), ! progname, pid_file); ! else ! write_stderr(_(%s: invalid data in PID file \%s\\n), ! progname, pid_file); exit(1); } fclose(pidf); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
Bruce Momjian br...@momjian.us writes: On Tue, Aug 28, 2012 at 04:25:36PM -0400, Tom Lane wrote: The backend side of this looks mostly sane to me (but drop the \n, messages are not supposed to contain those). But the feof test proposed Removed. I thought we needed to add \n so that strings 80 would wrap properly. How do we handle this? We don't. Per the message style guidelines, it's the responsibility of a client frontend to line-wrap such text if it feels the need to. The backend has no business assuming that 80 characters (or any other number) is where to wrap. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
Excerpts from Bruce Momjian's message of mar ago 28 22:21:27 -0400 2012: On Tue, Aug 28, 2012 at 04:25:36PM -0400, Tom Lane wrote: Bruce Momjian br...@momjian.us writes: Updated patch attached which just reports the file as empty. I assume we don't want the extra text output for pg_ctl like we do for the backend. The backend side of this looks mostly sane to me (but drop the \n, messages are not supposed to contain those). But the feof test proposed Removed. It's a pretty strange line wrap you got in this version of the patch. Normally we just let the string run past the 78 char limit, without cutting it in any way. And moving the start of the string to the line following errhint( looks very odd to me. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
On Wed, Aug 29, 2012 at 12:24:26AM -0400, Alvaro Herrera wrote: Excerpts from Bruce Momjian's message of mar ago 28 22:21:27 -0400 2012: On Tue, Aug 28, 2012 at 04:25:36PM -0400, Tom Lane wrote: Bruce Momjian br...@momjian.us writes: Updated patch attached which just reports the file as empty. I assume we don't want the extra text output for pg_ctl like we do for the backend. The backend side of this looks mostly sane to me (but drop the \n, messages are not supposed to contain those). But the feof test proposed Removed. It's a pretty strange line wrap you got in this version of the patch. Normally we just let the string run past the 78 char limit, without cutting it in any way. And moving the start of the string to the line following errhint( looks very odd to me. OK, updated patch attached. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c new file mode 100644 index 775d71f..efd5152 *** a/src/backend/utils/init/miscinit.c --- b/src/backend/utils/init/miscinit.c *** CreateLockFile(const char *filename, boo *** 766,771 --- 766,780 filename))); close(fd); + if (len == 0) + { + ereport(FATAL, + (errcode(ERRCODE_LOCK_FILE_EXISTS), + errmsg(lock file \%s\ is empty, filename), + errhint( + Either another server is starting, or the lock file is the remnant of a previous server startup crash.))); + } + buffer[len] = '\0'; encoded_pid = atoi(buffer); diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c new file mode 100644 index af8d8b2..81ba39e *** a/src/bin/pg_ctl/pg_ctl.c --- b/src/bin/pg_ctl/pg_ctl.c *** get_pgpid(void) *** 292,299 } if (fscanf(pidf, %ld, pid) != 1) { ! write_stderr(_(%s: invalid data in PID file \%s\\n), ! progname, pid_file); exit(1); } fclose(pidf); --- 292,304 } if (fscanf(pidf, %ld, pid) != 1) { ! /* Is the file empty? */ ! if (ftell(pidf) == 0 feof(pidf)) ! write_stderr(_(%s: the PID file \%s\ is empty\n), ! progname, pid_file); ! else ! write_stderr(_(%s: invalid data in PID file \%s\\n), ! progname, pid_file); exit(1); } fclose(pidf); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
Bruce Momjian br...@momjian.us writes: On Wed, Aug 29, 2012 at 12:24:26AM -0400, Alvaro Herrera wrote: It's a pretty strange line wrap you got in this version of the patch. Normally we just let the string run past the 78 char limit, without cutting it in any way. And moving the start of the string to the line following errhint( looks very odd to me. OK, updated patch attached. I agree with Alvaro's complaint that moving the whole string literal to the next line isn't conformant to our usual coding style. Definitely nitpicky, but why would you do it like that? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
Bruce Momjian br...@momjian.us writes: I have developed the attached patch to report a zero-length file, as you suggested. DIRECTORY_LOCK_FILE is entirely incorrect there. Taking a step back, I don't think this message is much better than the existing behavior of reporting bogus data. Either way, it's not obvious to typical users what the problem is or what to do about it. If we're going to emit a special message I think it should be more user friendly than this. Perhaps something like: FATAL: lock file foo is empty HINT: This may mean that another postmaster was starting at the same time. If not, remove the lock file and try again. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
On Mon, Aug 27, 2012 at 4:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: Bruce Momjian br...@momjian.us writes: I have developed the attached patch to report a zero-length file, as you suggested. DIRECTORY_LOCK_FILE is entirely incorrect there. Taking a step back, I don't think this message is much better than the existing behavior of reporting bogus data. Either way, it's not obvious to typical users what the problem is or what to do about it. If we're going to emit a special message I think it should be more user friendly than this. Perhaps something like: FATAL: lock file foo is empty HINT: This may mean that another postmaster was starting at the same time. If not, remove the lock file and try again. The problem with this is that it gives the customer only one remedy, which they will (if experience is any guide) try whether it is actually correct to do so or not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
Excerpts from Robert Haas's message of lun ago 27 18:02:06 -0400 2012: On Mon, Aug 27, 2012 at 4:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: Bruce Momjian br...@momjian.us writes: I have developed the attached patch to report a zero-length file, as you suggested. DIRECTORY_LOCK_FILE is entirely incorrect there. Taking a step back, I don't think this message is much better than the existing behavior of reporting bogus data. Either way, it's not obvious to typical users what the problem is or what to do about it. If we're going to emit a special message I think it should be more user friendly than this. Perhaps something like: FATAL: lock file foo is empty HINT: This may mean that another postmaster was starting at the same time. If not, remove the lock file and try again. The problem with this is that it gives the customer only one remedy, which they will (if experience is any guide) try whether it is actually correct to do so or not. How about having it sleep for a short while, then try again? I would expect that it would cause the second postmaster to fail during the second try, which is okay because the first one is then operational. The problem, of course, is how long to sleep so that this doesn't fail when load is high enough that the first postmaster still hasn't written the file after the sleep. Maybe LOG: lock file foo is empty, sleeping to retry -- sleep 100ms and recheck LOG: lock file foo is empty, sleeping to retry -- sleep, dunno, 1s, recheck LOG: lock file foo is empty, sleeping to retry -- sleep maybe 5s? recheck FATAL: lock file foo is empty HINT: Is another postmaster running on data directory bar? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
Robert Haas robertmh...@gmail.com writes: On Mon, Aug 27, 2012 at 4:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: Perhaps something like: FATAL: lock file foo is empty HINT: This may mean that another postmaster was starting at the same time. If not, remove the lock file and try again. The problem with this is that it gives the customer only one remedy, which they will (if experience is any guide) try whether it is actually correct to do so or not. Which other remedy(s) do you think the hint should suggest? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
Alvaro Herrera alvhe...@2ndquadrant.com writes: How about having it sleep for a short while, then try again? I could get behind that, but I don't think the delay should be more than 100ms or so. It's important for the postmaster to acquire the lock (or not) pretty quickly, or pg_ctl is going to get confused. If we keep it short, we can also dispense with the log spam you were suggesting. (Actually, I wonder if this type of scenario isn't going to confuse pg_ctl already --- it might think the lockfile belongs to the postmaster *it* started, not some pre-existing one. Does that matter?) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
On Mon, Aug 27, 2012 at 07:39:35PM -0400, Tom Lane wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: How about having it sleep for a short while, then try again? I could get behind that, but I don't think the delay should be more than 100ms or so. It's important for the postmaster to acquire the lock (or not) pretty quickly, or pg_ctl is going to get confused. If we keep it short, we can also dispense with the log spam you were suggesting. (Actually, I wonder if this type of scenario isn't going to confuse pg_ctl already --- it might think the lockfile belongs to the postmaster *it* started, not some pre-existing one. Does that matter?) I took Alvaro's approach of a sleep. The file test was already in a loop that went 100 times. Basically, if the lock file exists, this postmaster isn't going to succeed, so I figured there is no reason to rush in the testing. I gave it 5 tries with one second between attempts. Either the file is being populated, or it is stale and empty. I checked pg_ctl and that has a default wait of 60 second, so 5 seconds to exit out of the postmaster should be fine. Patch attached. FYI, I noticed we have a similar 5-second creation time requirement in pg_ctl: /* * The postmaster should create postmaster.pid very soon after being * started. If it's not there after we've waited 5 or more seconds, * assume startup failed and give up waiting. (Note this covers both * cases where the pidfile was never created, and where it was created * and then removed during postmaster exit.) Also, if there *is* a * file there but it appears stale, issue a suitable warning and give * up waiting. */ if (i = 5) This is for the case where the file has an old pid, rather than it is empty. FYI, I fixed the filename problem Tom found. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c new file mode 100644 index 775d71f..0309494 *** a/src/backend/utils/init/miscinit.c --- b/src/backend/utils/init/miscinit.c *** CreateLockFile(const char *filename, boo *** 766,771 --- 766,793 filename))); close(fd); + if (len == 0) + { + /* + * An empty lock file exits; either is it from another postmaster + * that is still starting up, or left from a crash. Check for + * five seconds, then if it still empty, it must be from a crash, + * so fail and recommend lock file removal. + */ + if (ntries 5) + { + sleep(1); + continue; + } + else + ereport(FATAL, + (errcode(ERRCODE_LOCK_FILE_EXISTS), + errmsg(lock file \%s\ is empty, filename), + errhint( + Empty lock file probably left from operating system crash during\n + database startup; file deletion suggested.))); + } + buffer[len] = '\0'; encoded_pid = atoi(buffer); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
Bruce Momjian br...@momjian.us writes: On Mon, Aug 27, 2012 at 07:39:35PM -0400, Tom Lane wrote: I could get behind that, but I don't think the delay should be more than 100ms or so. I took Alvaro's approach of a sleep. The file test was already in a loop that went 100 times. Basically, if the lock file exists, this postmaster isn't going to succeed, so I figured there is no reason to rush in the testing. I gave it 5 tries with one second between attempts. Either the file is being populated, or it is stale and empty. How did 100ms translate to 5 seconds? I checked pg_ctl and that has a default wait of 60 second, so 5 seconds to exit out of the postmaster should be fine. pg_ctl is not the only consideration here. In particular, there are a lot of initscripts out there (all of Red Hat's, for instance) that don't use pg_ctl and expect the postmaster to come up (or not) in a couple of seconds. I don't see a need for more than about one retry with 100ms delay. There is no evidence that the case we're worried about has ever occurred in the real world anyway, so slowing down error failures to make really really really sure there's not a competing postmaster doesn't seem like a good tradeoff. I'm not terribly impressed with that errhint, either. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
On Mon, Aug 27, 2012 at 09:59:10PM -0400, Tom Lane wrote: Bruce Momjian br...@momjian.us writes: On Mon, Aug 27, 2012 at 07:39:35PM -0400, Tom Lane wrote: I could get behind that, but I don't think the delay should be more than 100ms or so. I took Alvaro's approach of a sleep. The file test was already in a loop that went 100 times. Basically, if the lock file exists, this postmaster isn't going to succeed, so I figured there is no reason to rush in the testing. I gave it 5 tries with one second between attempts. Either the file is being populated, or it is stale and empty. How did 100ms translate to 5 seconds? That was the no need to rush, let's just be sure of what we report. I checked pg_ctl and that has a default wait of 60 second, so 5 seconds to exit out of the postmaster should be fine. pg_ctl is not the only consideration here. In particular, there are a lot of initscripts out there (all of Red Hat's, for instance) that don't use pg_ctl and expect the postmaster to come up (or not) in a couple of seconds. I don't see a need for more than about one retry with 100ms delay. There is no evidence that the case we're worried about has ever occurred in the real world anyway, so slowing down error failures to make really really really sure there's not a competing postmaster doesn't seem like a good tradeoff. I'm not terribly impressed with that errhint, either. I am concerned at 100ms that we can't be sure if it is still being created, and if we can't be sure, I am not sure there is much point in trying to clarify the odd error message we omit. FYI, here is what the code does now with a zero-length pid file, with my patch: $ postmaster [ wait 5 seconds ] FATAL: lock file postmaster.pid is empty HINT: Empty lock file probably left from operating system crash during database startup; file deletion suggested. $ pg_ctl start pg_ctl: invalid data in PID file /u/pgsql/data/postmaster.pid $ pg_ctl -w start pg_ctl: invalid data in PID file /u/pgsql/data/postmaster.pid Seems pg_ctl would also need some cleanup if we change the error message and/or timing. I am thinking we should just change the error message in the postmaster and pg_ctl to say the file is empty, and call it done (no hint message). If we do want a hint, say that either the file is stale from a crash or another postmaster is starting up, and let the user diagnose it. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
On Fri, Jan 6, 2012 at 08:28:48AM -0500, Michael Beattie wrote: On Fri, Jan 6, 2012 at 6:13 AM, Magnus Hagander mag...@hagander.net wrote: On Thu, Jan 5, 2012 at 23:19, Tom Lane t...@sss.pgh.pa.us wrote: Magnus Hagander mag...@hagander.net writes: On Thu, Jan 5, 2012 at 17:13, Tom Lane t...@sss.pgh.pa.us wrote: I think link(2) would create race conditions of its own. I'd be inclined to suggest that maybe we should just special-case a zero length postmaster.pid file as meaning okay to proceed. In general, garbage That's pretty much what I meant - but with a warning message. Actually ... wait a minute. If we allow that, don't we create a race condition between two postmasters starting at almost the same instant? The second one could see the lock file when the first has created but not yet filled it. Good point, yeah, it should do that. But I still think it's rare enough that just special-casing the error message should be enough... so just something that does like stat(filename, st); size = st.st_size; if (size == 0) elog(ERROR, lock file \%s\ has 0 length., filename); somewhere in CreateLockFile in miscinit.c? I have developed the attached patch to report a zero-length file, as you suggested. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c new file mode 100644 index 775d71f..7b7c141 *** a/src/backend/utils/init/miscinit.c --- b/src/backend/utils/init/miscinit.c *** CreateLockFile(const char *filename, boo *** 765,770 --- 765,772 errmsg(could not read lock file \%s\: %m, filename))); close(fd); + if (len == 0) + elog(FATAL, lock file \%s\ is empty, DIRECTORY_LOCK_FILE); buffer[len] = '\0'; encoded_pid = atoi(buffer); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
On Thu, Jan 5, 2012 at 23:19, Tom Lane t...@sss.pgh.pa.us wrote: Magnus Hagander mag...@hagander.net writes: On Thu, Jan 5, 2012 at 17:13, Tom Lane t...@sss.pgh.pa.us wrote: I think link(2) would create race conditions of its own. I'd be inclined to suggest that maybe we should just special-case a zero length postmaster.pid file as meaning okay to proceed. In general, garbage That's pretty much what I meant - but with a warning message. Actually ... wait a minute. If we allow that, don't we create a race condition between two postmasters starting at almost the same instant? The second one could see the lock file when the first has created but not yet filled it. Good point, yeah, it should do that. But I still think it's rare enough that just special-casing the error message should be enough... -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
On Fri, Jan 6, 2012 at 6:13 AM, Magnus Hagander mag...@hagander.net wrote: On Thu, Jan 5, 2012 at 23:19, Tom Lane t...@sss.pgh.pa.us wrote: Magnus Hagander mag...@hagander.net writes: On Thu, Jan 5, 2012 at 17:13, Tom Lane t...@sss.pgh.pa.us wrote: I think link(2) would create race conditions of its own. I'd be inclined to suggest that maybe we should just special-case a zero length postmaster.pid file as meaning okay to proceed. In general, garbage That's pretty much what I meant - but with a warning message. Actually ... wait a minute. If we allow that, don't we create a race condition between two postmasters starting at almost the same instant? The second one could see the lock file when the first has created but not yet filled it. Good point, yeah, it should do that. But I still think it's rare enough that just special-casing the error message should be enough... so just something that does like stat(filename, st); size = st.st_size; if (size == 0) elog(ERROR, lock file \%s\ has 0 length., filename); somewhere in CreateLockFile in miscinit.c?
[HACKERS] FATAL: bogus data in lock file postmaster.pid:
My laptop ran out of battery and turned itself off while I was just starting up postmaster. After plugging in the charger and rebooting, I got the following error when I tried to restart PostgreSQL: FATAL: bogus data in lock file postmaster.pid: postmaster.pid file was present in the data directory, but had zero length. Looking at the way the file is created and written, that can happen if you crash after the file is created, but before it's written/fsync'd (my laptop might have write-cache enabled, which would make the window larger). I was a bit surprised by that. That's probably not a big deal in practice, but I wonder if there was some easy way to avoid that. First I thought we could create the new postmaster.pid file with a temporary name and rename it in place, but rename(2) will merrily overwrite any existing file which is not what we want. We could use link(2), I guess. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
On Thu, Jan 5, 2012 at 14:18, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: My laptop ran out of battery and turned itself off while I was just starting up postmaster. After plugging in the charger and rebooting, I got the following error when I tried to restart PostgreSQL: FATAL: bogus data in lock file postmaster.pid: postmaster.pid file was present in the data directory, but had zero length. Looking at the way the file is created and written, that can happen if you crash after the file is created, but before it's written/fsync'd (my laptop might have write-cache enabled, which would make the window larger). I was a bit surprised by that. That's probably not a big deal in practice, but I wonder if there was some easy way to avoid that. First I thought we could create the new postmaster.pid file with a temporary name and rename it in place, but rename(2) will merrily overwrite any existing file which is not what we want. We could use link(2), I guess. Is this really a problem big enough to spend even that much effort on? Perhaps a special-case in the error message instead is enough? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
On Thu, Jan 5, 2012 at 8:23 AM, Magnus Hagander mag...@hagander.net wrote: On Thu, Jan 5, 2012 at 14:18, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: My laptop ran out of battery and turned itself off while I was just starting up postmaster. After plugging in the charger and rebooting, I got the following error when I tried to restart PostgreSQL: FATAL: bogus data in lock file postmaster.pid: postmaster.pid file was present in the data directory, but had zero length. Looking at the way the file is created and written, that can happen if you crash after the file is created, but before it's written/fsync'd (my laptop might have write-cache enabled, which would make the window larger). I was a bit surprised by that. That's probably not a big deal in practice, but I wonder if there was some easy way to avoid that. First I thought we could create the new postmaster.pid file with a temporary name and rename it in place, but rename(2) will merrily overwrite any existing file which is not what we want. We could use link(2), I guess. Is this really a problem big enough to spend even that much effort on? Perhaps a special-case in the error message instead is enough? On a laptop it's not a big deal, but it sort of sucks if your production database fails to start automatically after an unexpected power outage. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: My laptop ran out of battery and turned itself off while I was just starting up postmaster. After plugging in the charger and rebooting, I got the following error when I tried to restart PostgreSQL: FATAL: bogus data in lock file postmaster.pid: postmaster.pid file was present in the data directory, but had zero length. Looking at the way the file is created and written, that can happen if you crash after the file is created, but before it's written/fsync'd (my laptop might have write-cache enabled, which would make the window larger). I was a bit surprised by that. That's probably not a big deal in practice, but I wonder if there was some easy way to avoid that. First I thought we could create the new postmaster.pid file with a temporary name and rename it in place, but rename(2) will merrily overwrite any existing file which is not what we want. We could use link(2), I guess. I think link(2) would create race conditions of its own. I'd be inclined to suggest that maybe we should just special-case a zero length postmaster.pid file as meaning okay to proceed. In general, garbage data in postmaster.pid is something I'm happy to insist on manual recovery from, but maybe we could safely make an exception for empty files. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
On Thu, Jan 5, 2012 at 17:13, Tom Lane t...@sss.pgh.pa.us wrote: Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: My laptop ran out of battery and turned itself off while I was just starting up postmaster. After plugging in the charger and rebooting, I got the following error when I tried to restart PostgreSQL: FATAL: bogus data in lock file postmaster.pid: postmaster.pid file was present in the data directory, but had zero length. Looking at the way the file is created and written, that can happen if you crash after the file is created, but before it's written/fsync'd (my laptop might have write-cache enabled, which would make the window larger). I was a bit surprised by that. That's probably not a big deal in practice, but I wonder if there was some easy way to avoid that. First I thought we could create the new postmaster.pid file with a temporary name and rename it in place, but rename(2) will merrily overwrite any existing file which is not what we want. We could use link(2), I guess. I think link(2) would create race conditions of its own. I'd be inclined to suggest that maybe we should just special-case a zero length postmaster.pid file as meaning okay to proceed. In general, garbage That's pretty much what I meant - but with a warning message. data in postmaster.pid is something I'm happy to insist on manual recovery from, but maybe we could safely make an exception for empty files. +1. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
Magnus Hagander mag...@hagander.net writes: On Thu, Jan 5, 2012 at 17:13, Tom Lane t...@sss.pgh.pa.us wrote: I think link(2) would create race conditions of its own. I'd be inclined to suggest that maybe we should just special-case a zero length postmaster.pid file as meaning okay to proceed. In general, garbage That's pretty much what I meant - but with a warning message. Actually ... wait a minute. If we allow that, don't we create a race condition between two postmasters starting at almost the same instant? The second one could see the lock file when the first has created but not yet filled it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers