Re: RFC: Add 'taint' field to pg_control.
On Wed, Feb 28, 2018 at 04:16:53PM -0600, Justin Pryzby wrote: > On Wed, Feb 28, 2018 at 01:43:11PM -0800, Andres Freund wrote: > > a significant number of times during investigations of bugs I wondered > > whether running the cluster with various settings, or various tools > > could've caused the issue at hand. Therefore I'd like to propose adding > > a 'tainted' field to pg_control, that contains some of the "history" of > > the cluster. Individual bits inside that field that I can think of right > > now are: > > - pg_resetxlog was used non-passively on cluster > > - ran with fsync=off > > - ran with full_page_writes=off > > - pg_upgrade was used + cluster was downgraded (between minor releases) Dregging up a years old thread since I thought that was worth adding.. -- Justin
Re: RFC: Add 'taint' field to pg_control.
On 2018-03-07 23:34:37 -0500, Tom Lane wrote: > Craig Ringer writes: > > As I understand it, because we allow multiple Pg instances on a system, we > > identify the small sysv shmem segment we use by the postmaster's pid. If > > you remove the DirLockFile (postmaster.pid) you remove the interlock > > against starting a new postmaster. It'll think it's a new independent > > instance on the same host, make a new shmem segment and go merrily on its > > way mangling data horribly. > > Yeah. If we realized that the old shmem segment was associated with this > data directory, we could check for processes still attached to it ... but > the lock file is exactly where that association is kept. I'd somehow remembered that we just took the path as the identifier, but that's wrong... Greetings, Andres Freund
Re: RFC: Add 'taint' field to pg_control.
On 8 March 2018 at 12:34, Tom Lane wrote: > Craig Ringer writes: > > As I understand it, because we allow multiple Pg instances on a system, > we > > identify the small sysv shmem segment we use by the postmaster's pid. If > > you remove the DirLockFile (postmaster.pid) you remove the interlock > > against starting a new postmaster. It'll think it's a new independent > > instance on the same host, make a new shmem segment and go merrily on its > > way mangling data horribly. > > Yeah. If we realized that the old shmem segment was associated with this > data directory, we could check for processes still attached to it ... but > the lock file is exactly where that association is kept. > > > It'd be nice if the OS offered us some support here. Something like > opening > > a lockfile in exclusive lock mode, then inheriting the FD and lock on all > > children, with each child inheriting the lock. So the exclusive lock > > wouldn't get released until all FDs associated with it are released. But > > AFAIK nothing like that is present, let alone portable. > > I've wondered about that too. An inheritable lock on the data directory > itself would be ideal, but that doesn't exist anywhere AFAIK. We could > imagine taking a BSD-style-flock(2) lock on some lock file, on systems that > have that; but not all do, so it couldn't be our only protection. Another > problem is that since it'd have to be an ordinary file, there'd still be a > risk of cowboy DBAs removing the lock file. Maybe we could use pg_control > as the lock file? > "The error mentioned that there was another active database using this pg_control file, so I deleted it, and now my database won't start." Like our discussion about pg_resetxlog, there's a limit to how much operator error we can guard against. More importantly we'd have to be _very_ sure it was correct or our attempts to protect the user would risk creating a major outage. That said, flock(2) with LOCK_NB seems to have just the semantics we want: " A single file may not simultaneously have both shared and exclusive locks. Locks created by flock() are associated with an open file description (see open(2)). This means that duplicate file descriptors (created by, for example, fork(2) or dup(2)) refer to the same lock, and this lock may be modified or released using any of these file descriptors. Furthermore, the lock is released either by an explicit LOCK_UN operation on any of these duplicate file descriptors, or when all such file descriptors have been closed. " Worth looking into, but really important not to make a mistake and make things worse. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: RFC: Add 'taint' field to pg_control.
Craig Ringer writes: > As I understand it, because we allow multiple Pg instances on a system, we > identify the small sysv shmem segment we use by the postmaster's pid. If > you remove the DirLockFile (postmaster.pid) you remove the interlock > against starting a new postmaster. It'll think it's a new independent > instance on the same host, make a new shmem segment and go merrily on its > way mangling data horribly. Yeah. If we realized that the old shmem segment was associated with this data directory, we could check for processes still attached to it ... but the lock file is exactly where that association is kept. > It'd be nice if the OS offered us some support here. Something like opening > a lockfile in exclusive lock mode, then inheriting the FD and lock on all > children, with each child inheriting the lock. So the exclusive lock > wouldn't get released until all FDs associated with it are released. But > AFAIK nothing like that is present, let alone portable. I've wondered about that too. An inheritable lock on the data directory itself would be ideal, but that doesn't exist anywhere AFAIK. We could imagine taking a BSD-style-flock(2) lock on some lock file, on systems that have that; but not all do, so it couldn't be our only protection. Another problem is that since it'd have to be an ordinary file, there'd still be a risk of cowboy DBAs removing the lock file. Maybe we could use pg_control as the lock file? regards, tom lane
Re: RFC: Add 'taint' field to pg_control.
On 8 March 2018 at 10:18, Andres Freund wrote: > > > On March 7, 2018 5:51:29 PM PST, Craig Ringer > wrote: > >My favourite remains an organisation that kept "fixing" an issue by > >kill > >-9'ing the postmaster and removing postmaster.pid to make it start up > >again. Without killing all the leftover backends. Of course, the system > >kept getting more unstable and broken, so they did it more and more > >often. > >They were working on scripting it when they gave up and asked for help. > > Maybe I'm missing something, but that ought to not work. The shmem segment > that we keep around would be a conflict, no? > > As I understand it, because we allow multiple Pg instances on a system, we identify the small sysv shmem segment we use by the postmaster's pid. If you remove the DirLockFile (postmaster.pid) you remove the interlock against starting a new postmaster. It'll think it's a new independent instance on the same host, make a new shmem segment and go merrily on its way mangling data horribly. See CreateLockFile(). Also 7e2a18a9161 . In particular src/backend/utils/init/miscinit.c +938, if (isDDLock) { if (PGSharedMemoryIsInUse(id1, id2)) ereport(FATAL, (errcode(ERRCODE_LOCK_FILE_EXISTS), errmsg("pre-existing shared memory block " "(key %lu, ID %lu) is still in use", id1, id2), errhint("If you're sure there are no old " "server processes still running, remove " "the shared memory block " "or just delete the file \"%s\".", filename))); } I still think that error is a bit optimistic, and should really say "make very sure there are no 'postgres' processes associated with this data directory, then ...' It'd be nice if the OS offered us some support here. Something like opening a lockfile in exclusive lock mode, then inheriting the FD and lock on all children, with each child inheriting the lock. So the exclusive lock wouldn't get released until all FDs associated with it are released. But AFAIK nothing like that is present, let alone portable. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: RFC: Add 'taint' field to pg_control.
On March 7, 2018 5:51:29 PM PST, Craig Ringer wrote: >My favourite remains an organisation that kept "fixing" an issue by >kill >-9'ing the postmaster and removing postmaster.pid to make it start up >again. Without killing all the leftover backends. Of course, the system >kept getting more unstable and broken, so they did it more and more >often. >They were working on scripting it when they gave up and asked for help. Maybe I'm missing something, but that ought to not work. The shmem segment that we keep around would be a conflict, no? Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: RFC: Add 'taint' field to pg_control.
On 8 March 2018 at 04:58, Robert Haas wrote: > On Wed, Feb 28, 2018 at 8:03 PM, Craig Ringer > wrote: > > A huge +1 from me for the idea. I can't even count the number of black > box > > "WTF did you DO?!?" servers I've looked at, where bizarre behaviour has > > turned out to be down to the user doing something very silly and not > saying > > anything about it. > > +1 from me, too. > > My favourite remains an organisation that kept "fixing" an issue by kill -9'ing the postmaster and removing postmaster.pid to make it start up again. Without killing all the leftover backends. Of course, the system kept getting more unstable and broken, so they did it more and more often. They were working on scripting it when they gave up and asked for help. The data recovery effort on that one was truly exciting. I remember looking at bash history and having to take a short break to figure out how on earth to communicate what was going on. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: RFC: Add 'taint' field to pg_control.
On Wed, Feb 28, 2018 at 8:03 PM, Craig Ringer wrote: > A huge +1 from me for the idea. I can't even count the number of black box > "WTF did you DO?!?" servers I've looked at, where bizarre behaviour has > turned out to be down to the user doing something very silly and not saying > anything about it. +1 from me, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: RFC: Add 'taint' field to pg_control.
On Thu, Mar 01, 2018 at 09:12:18AM +0800, Craig Ringer wrote: > On 1 March 2018 at 06:28, Justin Pryzby wrote: > > The more fine grained these are the more useful they can be: > > > > Running with fsync=off is common advice while loading, so reporting that > > "fsync=off at some point" is much less useful than reporting "having to run > > recovery from an fsync=off database". > > > > I propose there could be two bits: > > > > 1. fsync was off at some point in history when the DB needed recovery; > > 2. fsync was off during the current instance; I assume the bit would be set > > whenever fsync=off was set, but could be cleared when the server was > > cleanly shut down. If the bit is set during recovery, the first bit > > would > > be permanently set. Not sure but if this bit is set and then fsync > > re-enabled, maybe this could be cleared after any checkpoint and not > > just > > shutdown ? > > > I think that's a very useful way to eliminate false positives and make this > diagnostic field more useful. But we'd need to guarantee that when you've > switched from fsync=off to fsync=on, we've actually fsync'd every extent of > every table and index, whether or not we think they're currently dirty and > whether or not the smgr currently has them open. Do something like initdb > does to flush everything. Without that, we could have WAL-vs-heap ordering > issues and so on *even after a Pg restart* if the system has lots of dirty > writeback to do. I think a similar, "2 bit" scheme *could* be applied to full_page_writes, too. And the 2nd "soft" bit for FPW could also be cleared after checkpoint. I'd think that for FPW, previous checkpoints which had fsync=off wouldn't need special handling - so long as FPW=on, the "soft" bit can be cleared after successful checkpoint (as a micro-optimization, don't clear the "soft" bit if the "hard" bit is set). BTW this scheme has the amusing consequence that setting fsync=off should involve first writing+fsyncing these bits to pgcontrol. Justin
Re: RFC: Add 'taint' field to pg_control.
On 1 March 2018 at 09:00, Justin Pryzby wrote: > > > > - started in single user mode or with system indices disabled? > > why? > > Some of these I suggested just as a datapoint (or other brainstorms I > couldn't > immediately reject). A cluster where someone has UPDATED pg_* (even > pg_statistic) or otherwise hacked on I would tend to think about > differently > than a "pristine" cluster that's never seen anything more interesting than > a > join. How about "has run in a state where system catalog modifications are permitted". Because I've had one seriously frustrating case where that would've been extremely pertinent information. That said, I'm not convinced it's worth the effort and I think this is going off into the weeds a little. Lets focus on Andres's core suggestion (and maybe refine the fsync part to get rid of false positives due to bulk loading) and build on from there in subsequent work. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: RFC: Add 'taint' field to pg_control.
On 1 March 2018 at 06:28, Justin Pryzby wrote: > On Wed, Feb 28, 2018 at 02:18:12PM -0800, Andres Freund wrote: > > On 2018-02-28 17:14:18 -0500, Peter Eisentraut wrote: > > > I can see why you'd want that, but as a DBA, I don't necessarily want > > > all of that recorded, especially in a quasi-permanent way. > > > > Huh? You're arguing that we should make it easier for DBAs to hide > > potential causes of corruption? I fail to see when that's necessary / > > desirable? That a cluster ran with fsync=off isn't an embarassing thing, > > nor does it contain private data... > > The more fine grained these are the more useful they can be: > > Running with fsync=off is common advice while loading, so reporting that > "fsync=off at some point" is much less useful than reporting "having to run > recovery from an fsync=off database". > > I propose there could be two bits: > > 1. fsync was off at some point in history when the DB needed recovery; > 2. fsync was off during the current instance; I assume the bit would be > set > whenever fsync=off was set, but could be cleared when the server was > cleanly shut down. If the bit is set during recovery, the first bit > would > be permanently set. Not sure but if this bit is set and then fsync > re-enabled, maybe this could be cleared after any checkpoint and not > just > shutdown ? > > I think that's a very useful way to eliminate false positives and make this diagnostic field more useful. But we'd need to guarantee that when you've switched from fsync=off to fsync=on, we've actually fsync'd every extent of every table and index, whether or not we think they're currently dirty and whether or not the smgr currently has them open. Do something like initdb does to flush everything. Without that, we could have WAL-vs-heap ordering issues and so on *even after a Pg restart* if the system has lots of dirty writeback to do. A quick look at CheckPointGuts and CheckPointBuffers suggests we don't presently do that but I admit I haven't read in depth. We force a fsync of the current WAL segment when switching fsync on, but don't do anything else AFAICS. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: RFC: Add 'taint' field to pg_control.
On 1 March 2018 at 05:43, Andres Freund wrote: > Hi, > > a significant number of times during investigations of bugs I wondered > whether running the cluster with various settings, or various tools > could've caused the issue at hand. Therefore I'd like to propose adding > a 'tainted' field to pg_control, that contains some of the "history" of > the cluster. Individual bits inside that field that I can think of right > now are: > - pg_resetxlog was used non-passively on cluster > - ran with fsync=off > - ran with full_page_writes=off > - pg_upgrade was used > > What do others think? > > A huge +1 from me for the idea. I can't even count the number of black box "WTF did you DO?!?" servers I've looked at, where bizarre behaviour has turned out to be down to the user doing something very silly and not saying anything about it. It's only some flags, so putting it in pg_control is arguably somewhat wasteful but so minor as to be of no real concern. And that's probably the best way to make sure it follows the cluster around no matter what backup/restore/copy mechanisms are used and how "clever" they try to be. What I'd _really_ love would be to blow the scope of this up a bit and turn it into a key-events cluster journal, recording key param switches, recoveries (and lsn ranges), pg_upgrade's, etc. But then we'd run into people with weird workloads who managed to make it some massive file, we'd have to make sure we had a way to stop it getting left out of copies/backups, and it'd generally be irritating. So lets not do that. Proper support for class-based logging and multiple outputs would be a good solution for this at some future point. What you propose is simple enough to be quick to implement, adds no admin overhead, and will be plenty useful enough. I'd like to add "postmaster.pid was absent when the cluster started" to this list, please. Sure, it's not conclusive, and there are legit reasons why that might be the case, but so often it's somebody kill -9'ing the postmaster, then removing the postmaster.pid and starting up again without killing the workers -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: RFC: Add 'taint' field to pg_control.
On Wed, Feb 28, 2018 at 02:23:19PM -0800, Andres Freund wrote: > Hi, > > On 2018-02-28 16:16:53 -0600, Justin Pryzby wrote: > > - did recovery (you could use "needed recovery" instead, but then there's > > the > >question of how reliable that field would be); > >+ or: timestamp of most recent recovery (attempt?) > What'd that be useful for? Theoretically nothing but conceivably useful if there's an issue with recovery. I recall various historic things weren't but should have been WAL logged. > > - local_preload_libraries? > Hm? Not sure; but in any case I meant *_preload_libraries. > > - started in single user mode or with system indices disabled? > why? Some of these I suggested just as a datapoint (or other brainstorms I couldn't immediately reject). A cluster where someone has UPDATED pg_* (even pg_statistic) or otherwise hacked on I would tend to think about differently than a "pristine" cluster that's never seen anything more interesting than a join. Justin
Re: RFC: Add 'taint' field to pg_control.
On Wed, Feb 28, 2018 at 02:18:12PM -0800, Andres Freund wrote: > On 2018-02-28 17:14:18 -0500, Peter Eisentraut wrote: > > I can see why you'd want that, but as a DBA, I don't necessarily want > > all of that recorded, especially in a quasi-permanent way. > > Huh? You're arguing that we should make it easier for DBAs to hide > potential causes of corruption? I fail to see when that's necessary / > desirable? That a cluster ran with fsync=off isn't an embarassing thing, > nor does it contain private data... The more fine grained these are the more useful they can be: Running with fsync=off is common advice while loading, so reporting that "fsync=off at some point" is much less useful than reporting "having to run recovery from an fsync=off database". I propose there could be two bits: 1. fsync was off at some point in history when the DB needed recovery; 2. fsync was off during the current instance; I assume the bit would be set whenever fsync=off was set, but could be cleared when the server was cleanly shut down. If the bit is set during recovery, the first bit would be permanently set. Not sure but if this bit is set and then fsync re-enabled, maybe this could be cleared after any checkpoint and not just shutdown ? Justin
Re: RFC: Add 'taint' field to pg_control.
Hi, On 2018-02-28 16:16:53 -0600, Justin Pryzby wrote: Unfortunately your list seems to raise the bar to a place I don't see us going soon :( > - pg_control versions used on this cluster (hopefully a full list..obviously >not going back before PG11); That needs arbitrary much space, that's not feasible for pg_control. Needs to be <= 512 bytes > - did recovery (you could use "needed recovery" instead, but then there's the >question of how reliable that field would be); >+ or: timestamp of most recent recovery (attempt?) What'd that be useful for? > - index corruption detected (on system table?); Note that "CREATE IF NOT >EXIST" doesn't avoid unique key errors on system tables, so it's not useful >to log every ERROR on system tables. I don't think we have a easy way to diagnose this specifically enough > - page %u is uninitialized --- fixing Doesn't really indicate a bug / problem. > - here's one I just dug up: ERROR: missing chunk number 0 for toast value > 10270298 in pg_toast_2619 Hm. > - XID wraparound? why? > - autovacuum disabled? why? > - checksum failue (in system relation?) Hm. > - local_preload_libraries? Hm? > - started in single user mode or with system indices disabled? why? > - hit assertion or PANIC ?? We certainly don't write to persistent data in those case. > - UPDATE/DELETE/INSERT on system table ? (or maintenance command like >vacuum/analyze/cluster/reindex?) Hm, the former I could see some use for, but it might end up being pretty ugly locking and layering wise. Greetings, Andres Freund
Re: RFC: Add 'taint' field to pg_control.
Hi, On 2018-02-28 17:14:18 -0500, Peter Eisentraut wrote: > I can see why you'd want that, but as a DBA, I don't necessarily want > all of that recorded, especially in a quasi-permanent way. Huh? You're arguing that we should make it easier for DBAs to hide potential causes of corruption? I fail to see when that's necessary / desirable? That a cluster ran with fsync=off isn't an embarassing thing, nor does it contain private data... Greetings, Andres Freund
Re: RFC: Add 'taint' field to pg_control.
On Wed, Feb 28, 2018 at 01:43:11PM -0800, Andres Freund wrote: > a significant number of times during investigations of bugs I wondered > whether running the cluster with various settings, or various tools > could've caused the issue at hand. Therefore I'd like to propose adding > a 'tainted' field to pg_control, that contains some of the "history" of > the cluster. Individual bits inside that field that I can think of right > now are: > - pg_resetxlog was used non-passively on cluster > - ran with fsync=off > - ran with full_page_writes=off > - pg_upgrade was used What about: - pg_control versions used on this cluster (hopefully a full list..obviously not going back before PG11); - did recovery (you could use "needed recovery" instead, but then there's the question of how reliable that field would be); + or: timestamp of most recent recovery (attempt?) - index corruption detected (on system table?); Note that "CREATE IF NOT EXIST" doesn't avoid unique key errors on system tables, so it's not useful to log every ERROR on system tables. - page %u is uninitialized --- fixing - here's one I just dug up: ERROR: missing chunk number 0 for toast value 10270298 in pg_toast_2619 - XID wraparound? - autovacuum disabled? - checksum failue (in system relation?) - local_preload_libraries? - started in single user mode or with system indices disabled? - hit assertion or PANIC ?? - UPDATE/DELETE/INSERT on system table ? (or maintenance command like vacuum/analyze/cluster/reindex?) Justin
Re: RFC: Add 'taint' field to pg_control.
On 2018-02-28 23:13:44 +0100, Tomas Vondra wrote: > > On 02/28/2018 10:43 PM, Andres Freund wrote: > > Hi, > > > > a significant number of times during investigations of bugs I wondered > > whether running the cluster with various settings, or various tools > > could've caused the issue at hand. Therefore I'd like to propose adding > > a 'tainted' field to pg_control, that contains some of the "history" of > > the cluster. Individual bits inside that field that I can think of right > > now are: > > - pg_resetxlog was used non-passively on cluster > > - ran with fsync=off > > - ran with full_page_writes=off > > - pg_upgrade was used > > > > What do others think? > Isn't the uncertainty usually that you know one of these things happened > on the cluster, but you don't know if that's the actual root cause? > That's my experience, at least. My experience is that you get a bugreport and you have no clue what happened with the cluster. Often the "owner" doesn't have either. Then you go through various theories and end up blaming it on one of these, even though it's quite possibly never happened. Of course this does *not* solve the issue when these actually occurred. It's just a tool to narrow down possible causes / exclude others. Greetings, Andres Freund
Re: RFC: Add 'taint' field to pg_control.
On 2/28/18 16:43, Andres Freund wrote: > a significant number of times during investigations of bugs I wondered > whether running the cluster with various settings, or various tools > could've caused the issue at hand. Therefore I'd like to propose adding > a 'tainted' field to pg_control, that contains some of the "history" of > the cluster. Individual bits inside that field that I can think of right > now are: > - pg_resetxlog was used non-passively on cluster > - ran with fsync=off > - ran with full_page_writes=off > - pg_upgrade was used I can see why you'd want that, but as a DBA, I don't necessarily want all of that recorded, especially in a quasi-permanent way. Maybe this could be stored in a separate location, like a small text/log-like file in the data directory. Then DBAs can save or keep or remove that information as they wish. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: RFC: Add 'taint' field to pg_control.
On 02/28/2018 10:43 PM, Andres Freund wrote: > Hi, > > a significant number of times during investigations of bugs I wondered > whether running the cluster with various settings, or various tools > could've caused the issue at hand. Therefore I'd like to propose adding > a 'tainted' field to pg_control, that contains some of the "history" of > the cluster. Individual bits inside that field that I can think of right > now are: > - pg_resetxlog was used non-passively on cluster > - ran with fsync=off > - ran with full_page_writes=off > - pg_upgrade was used > > What do others think? > Isn't the uncertainty usually that you know one of these things happened on the cluster, but you don't know if that's the actual root cause? That's my experience, at least. That being said, I'm not strongly opposed to adding such field. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services