Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Thu, Mar 9, 2017 at 7:12 PM, Peter Geoghegan wrote: >> Hm - I think it's fair to export RecentGlobalXmin, given that we >> obviously use it across modules in core code. I see very little reason >> not to export it. > > Well, the assertion is completely useless as anything but documentation... I came up with the attached. -- Peter Geoghegan From 05a8c0de7da7cd9ee4d8436abdf8368c1cf8ac19 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Thu, 9 Mar 2017 19:17:07 -0800 Subject: [PATCH] Fix amcheck build on Windows. Remove documenting assertion around RecentGlobalXmin, rather than marking it PGDLLIMPORT. In passing, update some comments that did not reflect the state of amcheck as-committed. --- contrib/amcheck/verify_nbtree.c | 74 +++-- 1 file changed, 35 insertions(+), 39 deletions(-) diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c index 5724aa6..d21b209 100644 --- a/contrib/amcheck/verify_nbtree.c +++ b/contrib/amcheck/verify_nbtree.c @@ -26,7 +26,6 @@ #include "miscadmin.h" #include "storage/lmgr.h" #include "utils/memutils.h" -#include "utils/snapmgr.h" PG_MODULE_MAGIC; @@ -55,8 +54,8 @@ typedef struct BtreeCheckState /* B-Tree Index Relation */ Relation rel; - /* ExclusiveLock held? */ - bool exclusivelylocked; + /* ShareLock held on heap/index, rather than AccessShareLock? */ + bool strongerlock; /* Per-page context */ MemoryContext targetcontext; /* Buffer access strategy */ @@ -94,7 +93,7 @@ PG_FUNCTION_INFO_V1(bt_index_parent_check); static void bt_index_check_internal(Oid indrelid, bool parentcheck); static inline void btree_index_checkable(Relation rel); -static void bt_check_every_level(Relation rel, bool exclusivelylocked); +static void bt_check_every_level(Relation rel, bool strongerlock); static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level); static void bt_target_page_check(BtreeCheckState *state); @@ -256,23 +255,26 @@ btree_index_checkable(Relation rel) * logical order, verifying invariants as it goes. * * It is the caller's responsibility to acquire appropriate heavyweight lock on - * the index relation, and advise us if extra checks are safe when an - * ExclusiveLock is held. + * the index relation, and advise us if extra checks are safe when a ShareLock + * is held. * - * An ExclusiveLock is generally assumed to prevent any kind of physical + * A ShareLock is generally assumed to prevent any kind of physical * modification to the index structure, including modifications that VACUUM may * make. This does not include setting of the LP_DEAD bit by concurrent index * scans, although that is just metadata that is not able to directly affect * any check performed here. Any concurrent process that might act on the * LP_DEAD bit being set (recycle space) requires a heavyweight lock that - * cannot be held while we hold an ExclusiveLock. (Besides, even if that could + * cannot be held while we hold a ShareLock. (Besides, even if that could * happen, the ad-hoc recycling when a page might otherwise split is performed * per-page, and requires an exclusive buffer lock, which wouldn't cause us * trouble. _bt_delitems_vacuum() may only delete leaf items, and so the extra * parent/child check cannot be affected.) + * + * Note that we rely on RecentGlobalXmin interlock against recycling, just like + * standard index scans. See note on RecentGlobalXmin/B-Tree page deletion. */ static void -bt_check_every_level(Relation rel, bool exclusivelylocked) +bt_check_every_level(Relation rel, bool strongerlock) { BtreeCheckState *state; Page metapage; @@ -281,17 +283,11 @@ bt_check_every_level(Relation rel, bool exclusivelylocked) BtreeLevel current; /* - * RecentGlobalXmin assertion matches index_getnext_tid(). See note on - * RecentGlobalXmin/B-Tree page deletion. - */ - Assert(TransactionIdIsValid(RecentGlobalXmin)); - - /* * Initialize state for entire verification operation */ state = palloc(sizeof(BtreeCheckState)); state->rel = rel; - state->exclusivelylocked = exclusivelylocked; + state->strongerlock = strongerlock; /* Create context for page */ state->targetcontext = AllocSetContextCreate(CurrentMemoryContext, "amcheck context", @@ -353,7 +349,7 @@ bt_check_every_level(Relation rel, bool exclusivelylocked) /* * Given a left-most block at some level, move right, verifying each page - * individually (with more verification across pages for "exclusivelylocked" + * individually (with more verification across pages for "strongerlock" * callers). Caller should pass the true root page as the leftmost initially, * working their way down by passing what is returned for the last call here * until level 0 (leaf page level) was reached. @@ -428,7 +424,7 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level) * locking, check that the first valid page meet
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Thu, Mar 9, 2017 at 7:10 PM, Andres Freund wrote: >> verify_nbtree.obj : error LNK2001: unresolved external symbol >> RecentGlobalXmin >> [C:\buildfarm\buildenv\HEAD\pgsql.build\amcheck.vcxproj] >> >> Rather than marking RecentGlobalXmin as PGDLLIMPORT, I'd rather just >> remove the documenting assertion and leave that comment as-is. I'll >> work on a patch for this soon. > > Hm - I think it's fair to export RecentGlobalXmin, given that we > obviously use it across modules in core code. I see very little reason > not to export it. Well, the assertion is completely useless as anything but documentation... -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
Hi, On 2017-03-09 19:04:46 -0800, Peter Geoghegan wrote: > On Thu, Mar 9, 2017 at 4:41 PM, Andres Freund wrote: > > I don't really expect buildfarm fallout, but .. > > Unfortunately, there is some on Windows: Thanks for monitoring. > verify_nbtree.obj : error LNK2001: unresolved external symbol > RecentGlobalXmin > [C:\buildfarm\buildenv\HEAD\pgsql.build\amcheck.vcxproj] > > Rather than marking RecentGlobalXmin as PGDLLIMPORT, I'd rather just > remove the documenting assertion and leave that comment as-is. I'll > work on a patch for this soon. Hm - I think it's fair to export RecentGlobalXmin, given that we obviously use it across modules in core code. I see very little reason not to export it. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Thu, Mar 9, 2017 at 4:41 PM, Andres Freund wrote: > And pushed with these. Thanks. > I don't really expect buildfarm fallout, but .. Unfortunately, there is some on Windows: verify_nbtree.obj : error LNK2001: unresolved external symbol RecentGlobalXmin [C:\buildfarm\buildenv\HEAD\pgsql.build\amcheck.vcxproj] Rather than marking RecentGlobalXmin as PGDLLIMPORT, I'd rather just remove the documenting assertion and leave that comment as-is. I'll work on a patch for this soon. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On 2017-03-09 16:29:24 -0800, Peter Geoghegan wrote: > I am generally in favor of more inclusive Reviewed-By lists. I suggest > that we expand it to: > > "Reviewed-By: Andres Freund, Thomas Vondra, Thomas Munro, Anastasia > Lubennikova, Robert Haas, Amit Langote" And pushed with these. I don't really expect buildfarm fallout, but .. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Thu, Mar 9, 2017 at 3:52 PM, Andres Freund wrote: > > But I also think it's more important to get some initial verification > functionality in, than resolving this conflict. I do, also quite > strongly, think we'll be better served with having something like what > you're proposing than nothing, and I don't have time/energy to turn your > patch into what I'm envisioning, nor necessarily the buyin. Hence I'm > planning to commit the current interface. > > Attached is your original patch, and a patch editorializing things. I do > plan to merge those before pushing. Your revisions all seem fine. No problems with any of it. > Who would you like to see added to Reviewed-By? I am generally in favor of more inclusive Reviewed-By lists. I suggest that we expand it to: "Reviewed-By: Andres Freund, Thomas Vondra, Thomas Munro, Anastasia Lubennikova, Robert Haas, Amit Langote" Thanks for your help with this! -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On 2017-03-06 20:40:59 -0800, Peter Geoghegan wrote: > On Mon, Mar 6, 2017 at 3:57 PM, Andres Freund wrote: > > I'm ok with not immediately doing so, but I think Peter's design isn't > > in line with achieving something like this. > > I would be okay with doing this if we had a grab-bag of expensive > checks, that all pretty much require some combination of > ExclusiveLocks and/or ShareLocks anyway, and are amenable to being > combined. I could see that happening, but we're a long way off from > that. When it does happen, we might have other things that are a bit > like bt_index_parent_check(), in terms of locking requirements and > degree of thoroughness, that might themselves require other parameters > specific to the verification that they perform that we cannot > anticipate. For example, maybe bt_index_parent_check() could have a > parameter that is a probability that any given IndexTuple will be > verified against its heap tuple. Then you could have something like a > PRNG seed argument, so you can check different tuples every day. There > are many possibilities, and I hope that eventually amcheck has this > kind of very rich functionality. > When you use the interface you describe to "stack" several checks at > once, less important parameters, like the ones I suggest are going to > be an awkward fit, and so will probably be excluded. I'm not opposed > to having what amounts to a second interface to what bt_index_check() > does, to make this kind of thing work where it makes sense. Really, > bt_index_parent_check() is almost an alternative interface to the same > functionality today. There can be another one in the future, if it > serves a purpose, and the locking requirements are roughly the same > for all the checks. I'd be fine with that. Let's just get the basic > feature in for now, though. I disagree quite strongly with pretty much all of that. But I also think it's more important to get some initial verification functionality in, than resolving this conflict. I do, also quite strongly, think we'll be better served with having something like what you're proposing than nothing, and I don't have time/energy to turn your patch into what I'm envisioning, nor necessarily the buyin. Hence I'm planning to commit the current interface. Attached is your original patch, and a patch editorializing things. I do plan to merge those before pushing. Who would you like to see added to Reviewed-By? Regards, Andres >From 60c670d251dd4244a09215e4f6c22e444d2bff7c Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Thu, 9 Mar 2017 15:50:40 -0800 Subject: [PATCH 1/2] Add amcheck extension to contrib. This is the beginning of a collection of SQL-callable functions to verify the integrity of data files. For now it only contains code to verify B-Tree indexes. This adds two SQL-callable functions, validating B-Tree consistency to a varying degree. Check the, extensive, docs for details. The goal is to later extend the coverage of the module to further access methods, possibly including the heap. Once checks for additional access methods exist, we'll likely add some "dispatch" functions that cover multiple access methods. Author: Peter Geoghegan, editorialized by Andres Freund Reviewed-By: Andres Freund, Thomas Vondra Discussion: cam3swzqzlmhmwmbqjzk+prkxrnuz4w90wymuwfkev8mz3de...@mail.gmail.com --- contrib/Makefile |1 + contrib/amcheck/Makefile | 19 + contrib/amcheck/amcheck--1.0.sql | 24 + contrib/amcheck/amcheck.control |5 + contrib/amcheck/verify_nbtree.c | 1238 ++ doc/src/sgml/amcheck.sgml| 275 + doc/src/sgml/contrib.sgml|1 + doc/src/sgml/filelist.sgml |1 + src/tools/pgindent/typedefs.list |2 + 9 files changed, 1566 insertions(+) create mode 100644 contrib/amcheck/Makefile create mode 100644 contrib/amcheck/amcheck--1.0.sql create mode 100644 contrib/amcheck/amcheck.control create mode 100644 contrib/amcheck/verify_nbtree.c create mode 100644 doc/src/sgml/amcheck.sgml diff --git a/contrib/Makefile b/contrib/Makefile index 9a74e1b99f..e84eb67008 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -6,6 +6,7 @@ include $(top_builddir)/src/Makefile.global SUBDIRS = \ adminpack \ + amcheck \ auth_delay \ auto_explain \ bloom \ diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile new file mode 100644 index 00..f14c7aa58b --- /dev/null +++ b/contrib/amcheck/Makefile @@ -0,0 +1,19 @@ +# contrib/amcheck/Makefile + +MODULE_big = amcheck +OBJS = verify_nbtree.o $(WIN32RES) + +EXTENSION = amcheck +DATA = amcheck--1.0.sql +PGFILEDESC = "amcheck - functions to verify access method invariants" + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/amcheck +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif di
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Mon, Mar 6, 2017 at 3:57 PM, Andres Freund wrote: > I'm ok with not immediately doing so, but I think Peter's design isn't > in line with achieving something like this. I would be okay with doing this if we had a grab-bag of expensive checks, that all pretty much require some combination of ExclusiveLocks and/or ShareLocks anyway, and are amenable to being combined. I could see that happening, but we're a long way off from that. When it does happen, we might have other things that are a bit like bt_index_parent_check(), in terms of locking requirements and degree of thoroughness, that might themselves require other parameters specific to the verification that they perform that we cannot anticipate. For example, maybe bt_index_parent_check() could have a parameter that is a probability that any given IndexTuple will be verified against its heap tuple. Then you could have something like a PRNG seed argument, so you can check different tuples every day. There are many possibilities, and I hope that eventually amcheck has this kind of very rich functionality. When you use the interface you describe to "stack" several checks at once, less important parameters, like the ones I suggest are going to be an awkward fit, and so will probably be excluded. I'm not opposed to having what amounts to a second interface to what bt_index_check() does, to make this kind of thing work where it makes sense. Really, bt_index_parent_check() is almost an alternative interface to the same functionality today. There can be another one in the future, if it serves a purpose, and the locking requirements are roughly the same for all the checks. I'd be fine with that. Let's just get the basic feature in for now, though. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On 2017-02-13 12:05:21 -0500, Robert Haas wrote: > On Fri, Feb 10, 2017 at 8:39 PM, Peter Geoghegan wrote: > > On Thu, Feb 9, 2017 at 2:32 PM, Andres Freund wrote: > >> Except that the proposed names aren't remotely like that... ;). > > > > Revision attached -- V5. We now REVOKE ALL on both functions, as > > Robert suggested, instead of the previous approach of having a > > hard-coded superuser check with enforcement. > > > >> And I proposed documenting named parameters, and > >> check_btree(performing_check_requiring_exclusive_locks => true) is just > >> about as expressive. > > > > I have not done this, nor have I renamed the functions. I still think > > that this is something that we can fix by adding a boolean argument to > > each function in the future, or something along those lines. I > > *really* hate the idea of having one function with non-obvious, > > variable requirements on locking, with locking implications that are > > not knowable when we PREPARE an SQL statement calling the function. It > > also removes a useful way of have superusers discriminate against the > > stronger locking variant bt_index_parent_check() by not granting > > execute on it (as an anti-footgun measure). > > I think Andres is more or less correct that > "performing_check_requiring_exclusive_locks => true" is just about as > expressive as calling a different function, but I think that your > point that the superuser might want to grant access to one function > but not the other is a good one. On the other hand, I think Andres > has a concern that we might have more modes in the future and we don't > want to end up with 2^n entrypoints. That also seems valid. Hmm. That's part of my concern. The second part is that I really want to be able to have a check_relation() (and check_database())function that I can pass a bunch of arguments determining how expensive checks are going to be performed. E.g. I'd like to be able to do something like SELECT * FROM check_relation( 'my_table'::regclass, test_btree => 'true', test_btree_heap_interlock => 'true', test_gin => 'true'); SELECT * FROM check_current_database( test_heap_update_chains => 'true', test_heap_clog_interlock => 'true', test_btree => 'true', test_gin => 'false'); etc. You can't really trivially replace these with a larger query and/or function, because of the locking considerations (consider what happens if somebody concurrently drops a table/index - your whole query errors out, wasting hours of work). I'm ok with not immediately doing so, but I think Peter's design isn't in line with achieving something like this. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
Hi, On 2017-02-13 11:52:53 -0500, Robert Haas wrote: > On Thu, Feb 9, 2017 at 5:32 PM, Andres Freund wrote: > >> > Again, some parts of the code doing something bad isn't a good argument > >> > for doing again. Releasing locks early is a bad pattern, because backend > >> > code isn't generally safe against it, we have special code-paths for > >> > catalog tables to support it for those. > >> > >> I don't agree with that. The reason we keep relation locks until the > >> end of the transaction is to avoid surprising application code, not > >> because the system can't tolerate it. But here it's more likely that > >> retaining the lock will surprise the user. > > > > It's both. A bunch of code paths rely on early release ony happening on > > catalogs. E.g., and that's just the first thing that comes to mind, > > cache invalidations which really only are guaranteed to be delivered and > > visibile for other cache accesses, if the lock is only released after > > the end of the transaction. Object accesses like relation_open() accept > > invalidations *after* acquiring the lock. Combined with lock releases > > only happening *after* transaction commits that guarantees an up2date > > view of the cache; but if others release the lock earlier and have cache > > invalidations, that doesn't work anymore. Now you can argue that it's > > possibly harmless here because there's presumably no way this can ever > > trigger cache invals - and you're probably right - but this isn't the > > only place with such assumption and you'd definitely have to argue *why* > > it's right. > > I agree that code which issues cache invalidations needs to hold a > blocking lock until commit IF it's critical for other backends to see > the update. Right. And I think that needs to be the default assumption, because otherwise you need to be careful about all the relevant code paths. I'm fine with adding something along the lines of /* * Release lock on relation early, this function doesn't generate any * cache invalidations, and it's convenient for callers to be able to * check multiple functions in one go. */ (after checking that's true of course) and then skipping the lock release. What I'm absolutely not ok with is doing so without consideration and documentation. > I don't really understand your demand for "an argument why it's > right". My argument is simple: I can't think of a single thing that > it would break, and I don't believe there is any such thing. That's pretty much an argument. I just want documentation that calls attention that we're doing something that requires a modicum of care. I fail to see why that's unreasonable. I don't think it's that unlikely will come around adding something like "hey we can update the free page statistic here, because we just read the whole damn thing", suddenly invalidating the logic. Or we add a heap check that then might trigger pruning, which quite conceivably could trigger an invalidation. Or ... > I think you're trying to invent a coding rule that doesn't exist, but > I can't prove a negative. We had bugs around this in the past, so ... > Your argument so far is that "yeah, well, maybe inval doesn't emit > sinval traffic (duh?) but it might do something else that breaks, > prove to me that there is no such thing". I have no idea what you're trying to say with this. The rest of that paragraph just seems like bunch of odd hyperbole ("impossible to exceed the speed of light..."). > From my point of view, it's likely to be common to want to run amcheck > in series on a bunch of indexes, like by writing a SELECT query > against pg_class or pg_index and passing the OIDs to amcheck. With > the retain-locks design, that query accumulates locks on a large > number of objects, possibly blowing out the lock table or blocking > DDL. On the other hand, retaining the locks actually increases the chance to run a check to completion. If you e.g. write a query checking the indexes for a table, you suddenly can't be sure that subsequent indexes for the same table can be checked, because the relation lock has been released inbetween. Which then'd lead to erroring out after doing part of the expensive work. I don't think that refutes your argument, but I also don't think it's as clear cut as you make it out to be. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Feb 10, 2017 at 8:39 PM, Peter Geoghegan wrote: > On Thu, Feb 9, 2017 at 2:32 PM, Andres Freund wrote: >> Except that the proposed names aren't remotely like that... ;). > > Revision attached -- V5. We now REVOKE ALL on both functions, as > Robert suggested, instead of the previous approach of having a > hard-coded superuser check with enforcement. > >> And I proposed documenting named parameters, and >> check_btree(performing_check_requiring_exclusive_locks => true) is just >> about as expressive. > > I have not done this, nor have I renamed the functions. I still think > that this is something that we can fix by adding a boolean argument to > each function in the future, or something along those lines. I > *really* hate the idea of having one function with non-obvious, > variable requirements on locking, with locking implications that are > not knowable when we PREPARE an SQL statement calling the function. It > also removes a useful way of have superusers discriminate against the > stronger locking variant bt_index_parent_check() by not granting > execute on it (as an anti-footgun measure). I think Andres is more or less correct that "performing_check_requiring_exclusive_locks => true" is just about as expressive as calling a different function, but I think that your point that the superuser might want to grant access to one function but not the other is a good one. On the other hand, I think Andres has a concern that we might have more modes in the future and we don't want to end up with 2^n entrypoints. That also seems valid. Hmm. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Thu, Feb 9, 2017 at 5:32 PM, Andres Freund wrote: >> > Again, some parts of the code doing something bad isn't a good argument >> > for doing again. Releasing locks early is a bad pattern, because backend >> > code isn't generally safe against it, we have special code-paths for >> > catalog tables to support it for those. >> >> I don't agree with that. The reason we keep relation locks until the >> end of the transaction is to avoid surprising application code, not >> because the system can't tolerate it. But here it's more likely that >> retaining the lock will surprise the user. > > It's both. A bunch of code paths rely on early release ony happening on > catalogs. E.g., and that's just the first thing that comes to mind, > cache invalidations which really only are guaranteed to be delivered and > visibile for other cache accesses, if the lock is only released after > the end of the transaction. Object accesses like relation_open() accept > invalidations *after* acquiring the lock. Combined with lock releases > only happening *after* transaction commits that guarantees an up2date > view of the cache; but if others release the lock earlier and have cache > invalidations, that doesn't work anymore. Now you can argue that it's > possibly harmless here because there's presumably no way this can ever > trigger cache invals - and you're probably right - but this isn't the > only place with such assumption and you'd definitely have to argue *why* > it's right. I agree that code which issues cache invalidations needs to hold a blocking lock until commit IF it's critical for other backends to see the update. That's not always the case; for example, if an ALTER TABLE statement only changes the fillfactor the worst thing that happens if some other backend fails to see the invalidations is that some transactions continue to use the old value for a while. That's why we now allow fillfactor and some other things using only ShareUpdateExclusiveLock. That's not quite the same thing but the effect is the same: we issue invalidations that other backends aren't guaranteed to see in a timely fashion, and it works OK. Yeah, some other session might not see the change for a while, but we don't view that as a big problem. But I think in this case that argument is basically a straw man. A utility that's called amcheck is obviously not issuing any cache invalidations. It's probably not even changing anything at all; if it is, maybe it should be called amrepair. So the fact that CLUSTER has to hold AEL until commit time really doesn't have anything to do with whether amcheck needs to hold AES until commit time. I don't really understand your demand for "an argument why it's right". My argument is simple: I can't think of a single thing that it would break, and I don't believe there is any such thing. Furthermore, as Peter points out, we do this kind of thing all the time in other places and it indeed does not break things. I think you're trying to invent a coding rule that doesn't exist, but I can't prove a negative. Your argument so far is that "yeah, well, maybe inval doesn't emit sinval traffic (duh?) but it might do something else that breaks, prove to me that there is no such thing". But that's just like asking me to prove that it's impossible to exceed the speed of light. On information and belief, nobody currently knows how to do that and there is good scientific evidence that it cannot be done, but there's no way to prove that someone won't in the future discover a refinement in the physical laws of the universe as we understand them today which sheds a different light on the situation. >From my point of view, it's likely to be common to want to run amcheck in series on a bunch of indexes, like by writing a SELECT query against pg_class or pg_index and passing the OIDs to amcheck. With the retain-locks design, that query accumulates locks on a large number of objects, possibly blowing out the lock table or blocking DDL. I think that readily-forseeable problem ought to trump concerns about purely hypothetical issues caused by releasing locks early. If you can come up with an actual issue with releasing locks early that applies to this particular case, then that's different, of course. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Thu, Feb 9, 2017 at 8:15 PM, Peter Geoghegan wrote: > BTW, aren't there cases where a PARALLEL SAFE function that needs to > acquire locks on some arbitrary relation not referenced in the query > can get locks on its own, which may only last as long as the parallel > worker's lifetime? This could happen *despite* the fact that the > author of the function may have imagined that callers could not > release relation level locks early (i.e., by not releasing them > directly, and so correctly following this "badly documented > assumption"). Yes. > It seems like the existence of PARALLEL RESTRICTED is primarily > motivated by making stuff that definitely needs the locks to last > until xact end being sure that that happens -- the docs say so. This > seems wholly inconsistent with the idea that you're not supposed to > let that happen under any circumstances. I find all this highly > confusing. Have I missed some further subtlety that applies with > PARALLEL RESTRICTED? That's by no means the only thing that could cause you to mark something PARALLEL RESTRICTED. I think you might have missed this bit of the documentation, a few paragraphs up: Similarly, functions must be marked PARALLEL RESTRICTED if they access temporary tables, client connection state, cursors, prepared statements, or miscellaneous backend-local state which the system cannot synchronize across workers. For example, setseed and random are parallel restricted for this last reason. That stuff is the main motivation. The early lock release stuff could come up for user-written functions, but the things listed in the paragraph I've quoted here come up for plenty of built-in functions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Feb 10, 2017 at 6:45 AM, Peter Geoghegan wrote: > On Thu, Feb 9, 2017 at 2:47 PM, Peter Geoghegan wrote: >>> which isn't an issue here, but reinforces my point about the (badly >>> documented) assumption that we don't release locks on user relations >>> early. >> I think even if we don't need to retain locks till transaction end in the present case, but Andres's point has a merit that it is good to follow an undocumented rule of releasing locks at transaction end for user relations unless there is some benefit in releasing the locks early in some situation. I think that can avoid unnecessary worry that whether it safe to do in some situation. >> You are right about the substantive issue, I assume, but I have a hard >> time agreeing with the idea that it's even badly documented when there >> are at least 10 counter-examples of blithely doing this. In any case, >> I find it very confusing when you talk about things as established >> practice/coding standards when they're very clearly not consistently >> adhered to at all. You should at least say "let's not make a bad >> situation any worse", or something, so that I don't need to spend 10 >> minutes pulling my hair out. > > BTW, aren't there cases where a PARALLEL SAFE function that needs to > acquire locks on some arbitrary relation not referenced in the query > can get locks on its own, which may only last as long as the parallel > worker's lifetime? This could happen *despite* the fact that the > author of the function may have imagined that callers could not > release relation level locks early (i.e., by not releasing them > directly, and so correctly following this "badly documented > assumption"). > > It seems like the existence of PARALLEL RESTRICTED is primarily > motivated by making stuff that definitely needs the locks to last > until xact end being sure that that happens -- the docs say so. > I don't think so. I think one of the primary reason of releasing locks in workers is that we don't have any way to transfer them to the leader which can release them at transaction end. > This > seems wholly inconsistent with the idea that you're not supposed to > let that happen under any circumstances. I find all this highly > confusing. Have I missed some further subtlety that applies with > PARALLEL RESTRICTED? > The basic idea of parallel restricted is that any expression will be considered parallel restricted if that can't be executed in the worker, for example, a reference to initplans in a query. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Thu, Feb 9, 2017 at 2:32 PM, Andres Freund wrote: > Except that the proposed names aren't remotely like that... ;). Revision attached -- V5. We now REVOKE ALL on both functions, as Robert suggested, instead of the previous approach of having a hard-coded superuser check with enforcement. > And I proposed documenting named parameters, and > check_btree(performing_check_requiring_exclusive_locks => true) is just > about as expressive. I have not done this, nor have I renamed the functions. I still think that this is something that we can fix by adding a boolean argument to each function in the future, or something along those lines. I *really* hate the idea of having one function with non-obvious, variable requirements on locking, with locking implications that are not knowable when we PREPARE an SQL statement calling the function. It also removes a useful way of have superusers discriminate against the stronger locking variant bt_index_parent_check() by not granting execute on it (as an anti-footgun measure). I don't think that it's too much to ask for you to concede this one point to me (and to Robert). I have acted on every other item you raised, without exception, including retaining the locks for the duration of the transaction. I also made the functions PARALLEL RESTRICTED, which seemed like a good idea. Other notes: * A nice thing about the additional level check that you suggested is that it's a cross-level check that is perfectly safe to perform with only an AccessShareLock, unlike the extra bt_index_parent_check() check, since the number of levels only ever increases, no matter what. There is no possible reason why we should ever find that a child page does not agree with its self-described parent about the level it's on. I'm glad you thought of this. * I kept ereport() DEBUG1 calls, but now use non-error errcode for those. I thought that preserving the various ereport() fields was worth it, since these traces are rather information dense in a way that is harder to deal with when using raw elog()s. If you hate this, you can still change it. * I have not added code to check the relation permissions, which might matter if and when the superuser grants execute privs on amcheck functions. Anyone want to argue that I should have added relation-level permissions checks? I tend to think it's pointless, since the superuser almost certainly won't think to consider that additional factor. I cannot imagine how either function could be used to leak information, in any case. No error message dumps the contents of pages (only item pointers). * pgindent was run against this revision. You were right to want to get that out of the way now, since it needed to be fixed up by hand to look reasonable. typedef list also updated. -- Peter Geoghegan 0001-Add-amcheck-extension-to-contrib.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Thu, Feb 9, 2017 at 2:47 PM, Peter Geoghegan wrote: >> which isn't an issue here, but reinforces my point about the (badly >> documented) assumption that we don't release locks on user relations >> early. > > You are right about the substantive issue, I assume, but I have a hard > time agreeing with the idea that it's even badly documented when there > are at least 10 counter-examples of blithely doing this. In any case, > I find it very confusing when you talk about things as established > practice/coding standards when they're very clearly not consistently > adhered to at all. You should at least say "let's not make a bad > situation any worse", or something, so that I don't need to spend 10 > minutes pulling my hair out. BTW, aren't there cases where a PARALLEL SAFE function that needs to acquire locks on some arbitrary relation not referenced in the query can get locks on its own, which may only last as long as the parallel worker's lifetime? This could happen *despite* the fact that the author of the function may have imagined that callers could not release relation level locks early (i.e., by not releasing them directly, and so correctly following this "badly documented assumption"). It seems like the existence of PARALLEL RESTRICTED is primarily motivated by making stuff that definitely needs the locks to last until xact end being sure that that happens -- the docs say so. This seems wholly inconsistent with the idea that you're not supposed to let that happen under any circumstances. I find all this highly confusing. Have I missed some further subtlety that applies with PARALLEL RESTRICTED? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Thu, Feb 9, 2017 at 2:32 PM, Andres Freund wrote: > Except that the proposed names aren't remotely like that... ;). > > And I proposed documenting named parameters, and > check_btree(performing_check_requiring_exclusive_locks => true) is just > about as expressive. I cannot think of an example, offhand, of where it isn't self-evident what relation level locks will be acquired when a statement is prepared. I guess you could say that VACUUM truncation is a bit like that, but it's not the same IMV. Maybe there is an easy compromise here. We can add an argument with a default, "level". It's an "enum" of the style of pg_prewarm (i.e., it's actually a text argument). The default is 'medium' or something like that. Maybe we don't add any other flags for now, or maybe we add some generic ones that don't actually changed the behavior. Existing names are retained. Doesn't that address your concern about an inflexible API? I really dislike the idea of having the lock strength be determined on the basis of a function arguments value. >> I don't agree with that. The reason we keep relation locks until the >> end of the transaction is to avoid surprising application code, not >> because the system can't tolerate it. But here it's more likely that >> retaining the lock will surprise the user. > There were other issues than this, but right of the top of my head I > only remember: > double > IndexBuildHeapRangeScan(Relation heapRelation, > .. > /* > * Since caller should hold ShareLock > or better, normally > * the only way to see this is if it > was inserted earlier > * in our own transaction. However, > it can happen in > * system catalogs, since we tend to > release write lock > * before commit there. Give a > warning if neither case > * applies. > */ > xwait = > HeapTupleHeaderGetXmin(heapTuple->t_data); > if > (!TransactionIdIsCurrentTransactionId(xwait)) > { > if (!is_system_catalog) > elog(WARNING, > "concurrent insert in progress within table \"%s\"", > > RelationGetRelationName(heapRelation)); > > which isn't an issue here, but reinforces my point about the (badly > documented) assumption that we don't release locks on user relations > early. You are right about the substantive issue, I assume, but I have a hard time agreeing with the idea that it's even badly documented when there are at least 10 counter-examples of blithely doing this. In any case, I find it very confusing when you talk about things as established practice/coding standards when they're very clearly not consistently adhered to at all. You should at least say "let's not make a bad situation any worse", or something, so that I don't need to spend 10 minutes pulling my hair out. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On 2017-02-09 17:12:58 -0500, Robert Haas wrote: > On Thu, Feb 9, 2017 at 4:57 PM, Andres Freund wrote: > > Meh. I don't think that's a serious problem, nor is without precedent > > (say same toplevel DDL with differing lock levels depending on > > options). Nor do the function names confer that any more efficiently > > than parameters. > > Really? check_btree(blah, true) vs. check_btree(blah, false) is no > more or less clear than check_btree_for_plausible_errors(blah) vs. > check_btree_in_unreasonable_detail(blah)? Except that the proposed names aren't remotely like that... ;). And I proposed documenting named parameters, and check_btree(performing_check_requiring_exclusive_locks => true) is just about as expressive. > > Again, some parts of the code doing something bad isn't a good argument > > for doing again. Releasing locks early is a bad pattern, because backend > > code isn't generally safe against it, we have special code-paths for > > catalog tables to support it for those. > > I don't agree with that. The reason we keep relation locks until the > end of the transaction is to avoid surprising application code, not > because the system can't tolerate it. But here it's more likely that > retaining the lock will surprise the user. It's both. A bunch of code paths rely on early release ony happening on catalogs. E.g., and that's just the first thing that comes to mind, cache invalidations which really only are guaranteed to be delivered and visibile for other cache accesses, if the lock is only released after the end of the transaction. Object accesses like relation_open() accept invalidations *after* acquiring the lock. Combined with lock releases only happening *after* transaction commits that guarantees an up2date view of the cache; but if others release the lock earlier and have cache invalidations, that doesn't work anymore. Now you can argue that it's possibly harmless here because there's presumably no way this can ever trigger cache invals - and you're probably right - but this isn't the only place with such assumption and you'd definitely have to argue *why* it's right. There were other issues than this, but right of the top of my head I only remember: double IndexBuildHeapRangeScan(Relation heapRelation, .. /* * Since caller should hold ShareLock or better, normally * the only way to see this is if it was inserted earlier * in our own transaction. However, it can happen in * system catalogs, since we tend to release write lock * before commit there. Give a warning if neither case * applies. */ xwait = HeapTupleHeaderGetXmin(heapTuple->t_data); if (!TransactionIdIsCurrentTransactionId(xwait)) { if (!is_system_catalog) elog(WARNING, "concurrent insert in progress within table \"%s\"", RelationGetRelationName(heapRelation)); which isn't an issue here, but reinforces my point about the (badly documented) assumption that we don't release locks on user relations early. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Thu, Feb 9, 2017 at 4:57 PM, Andres Freund wrote: > On 2017-02-09 13:29:42 -0800, Peter Geoghegan wrote: >> > I'd really like to have something like relation_check(), index_check() >> > which calls the correct functions for the relevan index types (and >> > possibly warns about !nbtree indexes not being checked?). >> >> I feel that that's something for an in-core facility that exposes this >> through DDL. One good reason to not do that just yet is that I'm not >> completely comfortable with what a uniform interface looks like here. >> And frankly, I really just want to get this out of the way, to prove >> the general concept of verification tooling. amcheck v1 is a good >> start, but much work remains. I think that no one will take the idea >> completely seriously until it is usable as a core extension. Let's get >> the ball rolling. > > I'm not convinced at all by that argument. If we want something in > core, having this in contrib doesn't help much. Nor do I see much > point in explicit syntax. I don't see any need for this to be in core, nor do I think we need explicit syntax. But given that, I also don't think that we need the generalized infrastructure you proposed in the quoted section above. >> I strongly disagree. I would agree if the lock strength were the same >> in both cases, but it isn't. Varying the strength of heavyweight lock >> taken based on an argument to a function is a massive foot-gun IMV. > > Meh. I don't think that's a serious problem, nor is without precedent > (say same toplevel DDL with differing lock levels depending on > options). Nor do the function names confer that any more efficiently > than parameters. Really? check_btree(blah, true) vs. check_btree(blah, false) is no more or less clear than check_btree_for_plausible_errors(blah) vs. check_btree_in_unreasonable_detail(blah)? >> I do think that that's the logical way to split out functionality. I >> don't think that there is likely to be a problem with adding stuff in >> the future. It'll either be something that we can do in passing >> anyway, in which case we can just add it to the existing checks >> harmlessly. Or, it will be something like a tool that verifies the >> heap is consistent with a given index, which is vastly more expensive >> in terms of CPU and I/O costs, and yet is probably no worse than >> bt_index_check() in terms of lock strength (AccessShareLock). > > But you quite possibly want to do all three in one pass - and that'll be > ugly with your API. With your API I'll not be able to do the parent > check and the heap check at once. Lest we add fourth function that does > all of it. This however seems like a good point. > Again, some parts of the code doing something bad isn't a good argument > for doing again. Releasing locks early is a bad pattern, because backend > code isn't generally safe against it, we have special code-paths for > catalog tables to support it for those. I don't agree with that. The reason we keep relation locks until the end of the transaction is to avoid surprising application code, not because the system can't tolerate it. But here it's more likely that retaining the lock will surprise the user. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On 2017-02-09 13:29:42 -0800, Peter Geoghegan wrote: > > I'd really like to have something like relation_check(), index_check() > > which calls the correct functions for the relevan index types (and > > possibly warns about !nbtree indexes not being checked?). > > I feel that that's something for an in-core facility that exposes this > through DDL. One good reason to not do that just yet is that I'm not > completely comfortable with what a uniform interface looks like here. > And frankly, I really just want to get this out of the way, to prove > the general concept of verification tooling. amcheck v1 is a good > start, but much work remains. I think that no one will take the idea > completely seriously until it is usable as a core extension. Let's get > the ball rolling. I'm not convinced at all by that argument. If we want something in core, having this in contrib doesn't help much. Nor do I see much point in explicit syntax. > >> +#define CHECK_SUPERUSER() { \ > >> + if (!superuser()) \ > >> + ereport(ERROR, \ > >> + > >> (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \ > >> + (errmsg("must be superuser to use > >> verification functions"; } > > > > Please make this either a function or just copy - the macro doesn't buy > > us anything except weirder looking syntax. > > It's actually from pageinspect, but okay. We have a lot of dodgy stuff in places ;) > >> +PG_FUNCTION_INFO_V1(bt_index_check); > >> +PG_FUNCTION_INFO_V1(bt_index_parent_check); > > > > I think it'd be good to avoid adding a lot of functions for each > > additional type of check we can think of. I think adding a named 'bool > > check_parents = false' argument or such would be better. Then we can > > simply add more and more checks subsequently, and we can have > > 'bool all_checks = false' so people can trivially check everything, > > instead of having to add new checks in each new function. > > I strongly disagree. I would agree if the lock strength were the same > in both cases, but it isn't. Varying the strength of heavyweight lock > taken based on an argument to a function is a massive foot-gun IMV. Meh. I don't think that's a serious problem, nor is without precedent (say same toplevel DDL with differing lock levels depending on options). Nor do the function names confer that any more efficiently than parameters. > I do think that that's the logical way to split out functionality. I > don't think that there is likely to be a problem with adding stuff in > the future. It'll either be something that we can do in passing > anyway, in which case we can just add it to the existing checks > harmlessly. Or, it will be something like a tool that verifies the > heap is consistent with a given index, which is vastly more expensive > in terms of CPU and I/O costs, and yet is probably no worse than > bt_index_check() in terms of lock strength (AccessShareLock). But you quite possibly want to do all three in one pass - and that'll be ugly with your API. With your API I'll not be able to do the parent check and the heap check at once. Lest we add fourth function that does all of it. > My thinking here is that users really ought to use bt_index_check() in > almost all cases, not bt_index_parent_check(), which is more of a tool > for hackers that are developing new B-Tree features. The distinction > is blurry, in fact, but the lock strength requirements of > bt_index_parent_check() make it vastly less useful. A quite reasonable patter is to do this verification on standbys that you can either bring up separately and/or just temporarily pause. Or just backups before declaring them valid. > >> +/* > >> + * bt_index_check(index regclass) > >> + * > >> + * Verify integrity of B-Tree index. > >> + * > >> + * Only acquires AccessShareLock on index relation. Does not consider > >> + * invariants that exist between parent/child pages. > >> + */ > > > I'm *VERY* strongly against releasing locks on user relations early. > > Why? pageinspect does this. Again, some parts of the code doing something bad isn't a good argument for doing again. Releasing locks early is a bad pattern, because backend code isn't generally safe against it, we have special code-paths for catalog tables to support it for those. > >> + /* > >> + * We must lock table before index to avoid deadlocks. However, if > >> the > >> + * passed indrelid isn't an index then IndexGetRelation() will fail. > >> + * Rather than emitting a not-very-helpful error message, postpone > >> + * complaining, expecting that the is-it-an-index test below will > >> fail. > >> + */ > >> + heapid = IndexGetRelation(indrelid, true); > >> + if (OidIsValid(heapid)) > >> + heaprel = heap_open(heapid, ShareLock); > >> + else > >> + heaprel = NULL; > >> + > >> + /* > >> + * Open the target index relations separ
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Thu, Feb 9, 2017 at 12:05 AM, Andres Freund wrote: > reindex_index isn't necessarily comparable, because > RangeVarCallbackForReindexIndex() is, on first blush at least, careful > enough about the locking, and reindex_relation only gets the index list > after building the index: > /* > * Find and lock index, and check permissions on table; use callback > to > * obtain lock on table first, to avoid deadlock hazard. The lock > level > * used here must match the index lock obtained in reindex_index(). > */ > > I think you ought to do something similar. Okay. Looking into it, but see my response to your later remarks below. > I'd really like to have something like relation_check(), index_check() > which calls the correct functions for the relevan index types (and > possibly warns about !nbtree indexes not being checked?). I feel that that's something for an in-core facility that exposes this through DDL. One good reason to not do that just yet is that I'm not completely comfortable with what a uniform interface looks like here. And frankly, I really just want to get this out of the way, to prove the general concept of verification tooling. amcheck v1 is a good start, but much work remains. I think that no one will take the idea completely seriously until it is usable as a core extension. Let's get the ball rolling. >> +#define CHECK_SUPERUSER() { \ >> + if (!superuser()) \ >> + ereport(ERROR, \ >> + >> (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \ >> + (errmsg("must be superuser to use >> verification functions"; } > > Please make this either a function or just copy - the macro doesn't buy > us anything except weirder looking syntax. It's actually from pageinspect, but okay. >> +#define CORRUPTION ERROR >> +#define CONCERN DEBUG1 >> +#define POSITION DEBUG2 >> +#endif > > Please remove this block and use the normal elevels. Okay. >> +/* >> + * State associated with verifying a B-Tree index >> + */ >> +typedef struct BtreeCheckState >> +{ >> + /* >> + * Unchanging state, established at start of verification: >> + * >> + * rel: B-Tree Index Relation >> + * exclusivelock: ExclusiveLock held on rel; else AccessShareLock >> + * checkstrategy: Buffer access strategy >> + * targetcontext: Target page memory context >> + */ >> + Relationrel; >> + boolexclusivelock; >> + BufferAccessStrategycheckstrategy; >> + MemoryContext targetcontext; > > Can you move the field comments to the individual fields? Separating > them makes it more likely to get out of date (obviously keep the > header). I'd also rename 'exclusivelock' to 'exclusivelylocked' or > such. Okay. >> + /* >> + * Mutable state, for verification of particular page: >> + * >> + * target: Main target page >> + * targetblock: Main target page's block number >> + * targetlsn: Main target page's LSN > > Same here. Okay. >> +PG_FUNCTION_INFO_V1(bt_index_check); >> +PG_FUNCTION_INFO_V1(bt_index_parent_check); > > I think it'd be good to avoid adding a lot of functions for each > additional type of check we can think of. I think adding a named 'bool > check_parents = false' argument or such would be better. Then we can > simply add more and more checks subsequently, and we can have > 'bool all_checks = false' so people can trivially check everything, > instead of having to add new checks in each new function. I strongly disagree. I would agree if the lock strength were the same in both cases, but it isn't. Varying the strength of heavyweight lock taken based on an argument to a function is a massive foot-gun IMV. I do think that that's the logical way to split out functionality. I don't think that there is likely to be a problem with adding stuff in the future. It'll either be something that we can do in passing anyway, in which case we can just add it to the existing checks harmlessly. Or, it will be something like a tool that verifies the heap is consistent with a given index, which is vastly more expensive in terms of CPU and I/O costs, and yet is probably no worse than bt_index_check() in terms of lock strength (AccessShareLock). My thinking here is that users really ought to use bt_index_check() in almost all cases, not bt_index_parent_check(), which is more of a tool for hackers that are developing new B-Tree features. The distinction is blurry, in fact, but the lock strength requirements of bt_index_parent_check() make it vastly less useful. Especially considering that it seems fairly unlikely that it will ever catch a problem that bt_index_check() would not have. bt_index_parent_check() will become a lot more i
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On 2017-02-09 14:48:18 -0500, Robert Haas wrote: > On Thu, Feb 9, 2017 at 3:05 AM, Andres Freund wrote: > >> +#define CHECK_SUPERUSER() { \ > >> + if (!superuser()) \ > >> + ereport(ERROR, \ > >> + > >> (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \ > >> + (errmsg("must be superuser to use > >> verification functions"; } > > > > Please make this either a function or just copy - the macro doesn't buy > > us anything except weirder looking syntax. > > Better yet, remove the check altogether and just revoke access from > public. Then the default policy is superuser-only, but the DBA can > grant access to who they wish. That works for me as well. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Thu, Feb 9, 2017 at 3:05 AM, Andres Freund wrote: >> +#define CHECK_SUPERUSER() { \ >> + if (!superuser()) \ >> + ereport(ERROR, \ >> + >> (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \ >> + (errmsg("must be superuser to use >> verification functions"; } > > Please make this either a function or just copy - the macro doesn't buy > us anything except weirder looking syntax. Better yet, remove the check altogether and just revoke access from public. Then the default policy is superuser-only, but the DBA can grant access to who they wish. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
Hi, On 2016-11-22 10:56:07 -0800, Peter Geoghegan wrote: > Andres said: "Theoretically we should recheck that the oids still > match, theoretically the locking here allows the index->table mapping > to change". I don't know how to act on this feedback, since comparable > index + heap locking code should have the same problem, but doesn't > consider it. How is this any different to reindex_index()? reindex_index isn't necessarily comparable, because RangeVarCallbackForReindexIndex() is, on first blush at least, careful enough about the locking, and reindex_relation only gets the index list after building the index: /* * Find and lock index, and check permissions on table; use callback to * obtain lock on table first, to avoid deadlock hazard. The lock level * used here must match the index lock obtained in reindex_index(). */ I think you ought to do something similar. I'd really like to have something like relation_check(), index_check() which calls the correct functions for the relevan index types (and possibly warns about !nbtree indexes not being checked?). > @@ -0,0 +1,1227 @@ > +/*- > + * > + * verify_nbtree.c > + * Verifies the integrity of nbtree indexes based on invariants. > + * > + * For B-Tree indexes, verification includes the invariant that each page in > + * the target index has items in logical order as reported by an insertion > + * scankey (the insertion scankey sort-wise NULL semantics are needed for > + * verification). I'd appreciate some rephrasing of this paragraph. "includes the invariant that" isn't that easy to understand for an ESL person. > + * > + * Copyright (c) 2016, PostgreSQL Global Development Group ... ;) > +#define CHECK_SUPERUSER() { \ > + if (!superuser()) \ > + ereport(ERROR, \ > + > (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \ > + (errmsg("must be superuser to use > verification functions"; } Please make this either a function or just copy - the macro doesn't buy us anything except weirder looking syntax. > +/* > + * Callers to verification functions should never receive a false positive > + * indication of corruption. Therefore, when using amcheck functions for > + * stress testing, it may be useful to temporally change the CORRUPTION > elevel > + * to PANIC, to immediately halt the server in the event of detecting an > + * invariant condition violation. This may preserve more information about > the > + * nature of the underlying problem. Note that modifying the CORRUPTION > + * constant to be an elevel < ERROR is not well tested. > + */ > +#ifdef NOT_USED > +#define CORRUPTION PANIC > +#define CONCERN WARNING > +#define POSITION NOTICE > +#else > +#define CORRUPTION ERROR > +#define CONCERN DEBUG1 > +#define POSITION DEBUG2 > +#endif Please remove this block and use the normal elevels. > +/* > + * State associated with verifying a B-Tree index > + */ > +typedef struct BtreeCheckState > +{ > + /* > + * Unchanging state, established at start of verification: > + * > + * rel: B-Tree Index Relation > + * exclusivelock: ExclusiveLock held on rel; else AccessShareLock > + * checkstrategy: Buffer access strategy > + * targetcontext: Target page memory context > + */ > + Relationrel; > + boolexclusivelock; > + BufferAccessStrategycheckstrategy; > + MemoryContext targetcontext; Can you move the field comments to the individual fields? Separating them makes it more likely to get out of date (obviously keep the header). I'd also rename 'exclusivelock' to 'exclusivelylocked' or such. > + /* > + * Mutable state, for verification of particular page: > + * > + * target: Main target page > + * targetblock: Main target page's block number > + * targetlsn: Main target page's LSN Same here. > +PG_FUNCTION_INFO_V1(bt_index_check); > +PG_FUNCTION_INFO_V1(bt_index_parent_check); I think it'd be good to avoid adding a lot of functions for each additional type of check we can think of. I think adding a named 'bool check_parents = false' argument or such would be better. Then we can simply add more and more checks subsequently, and we can have 'bool all_checks = false' so people can trivially check everything, instead of having to add new checks in each new function. > +/* > + * bt_index_check(index regclass) > + * > + * Verify integrity of B-Tree index. > + * > + * Only acquires AccessShareLock on index relation. Does not consider > + * invariants that exist between parent/child pages. > + */ > +Datum
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
Hi Heikki, On 2016-11-22 10:56:07 -0800, Peter Geoghegan wrote: > I attach V4 of amcheck. Is there any chance you could look at the concurrency related parts of amcheck? I'd like to push this to be mergeable, but that area is a bit outside of my area of expertise... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Mon, Dec 5, 2016 at 2:15 PM, Haribabu Kommi wrote: > Moved to next CF with " needs review" status. Same, to CF 2017-03 this time. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Wed, Nov 30, 2016 at 9:06 AM, Peter Geoghegan wrote: > On Wed, Nov 23, 2016 at 2:47 PM, Thomas Munro > wrote: > > Actually I meant that at T2 the btree is exactly same as it was at T1, > > but the order of the values has changed according to the comparator > > (for example, because a collation definition changed when you updated > > your operating system), so that the btree is now corrupted. Based on > > Goetz Graefe's paper I was wondering if amcheck would be unable to > > detect this due to the cousin problem. > > I love that paper (it's more like a book, really). I'm glad that at > least one other person in the community has read some of it, because I > think it has a lot of clues as to how we can begin to address some of > our problems around bloat without novel redesign. Most of the > techniques are quite complementary. > > I wrote a prototype of suffix truncation for text in two days last > week. The regression tests pass, and the technique works well for > certain cases, particularly with many column indexes without > correlation (the index-only-scan case). Let's see if that becomes a > real patch. > > > Thank you for your patient explanation! > > Your questions were excellent, and so deserved my careful consideration. > > > Please see my humble attempt > > to test this scenario and learn something about btrees in the process, > > attached. amcheck does detect the violation of the high key > > invariant. > > This is a nice pattern for testing out if the keyspace is sane, and > how things work. I might incorporate it into my own testing in the > future. Moved to next CF with " needs review" status. Regards, Hari Babu Fujitsu Australia
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Wed, Nov 23, 2016 at 2:47 PM, Thomas Munro wrote: > Actually I meant that at T2 the btree is exactly same as it was at T1, > but the order of the values has changed according to the comparator > (for example, because a collation definition changed when you updated > your operating system), so that the btree is now corrupted. Based on > Goetz Graefe's paper I was wondering if amcheck would be unable to > detect this due to the cousin problem. I love that paper (it's more like a book, really). I'm glad that at least one other person in the community has read some of it, because I think it has a lot of clues as to how we can begin to address some of our problems around bloat without novel redesign. Most of the techniques are quite complementary. I wrote a prototype of suffix truncation for text in two days last week. The regression tests pass, and the technique works well for certain cases, particularly with many column indexes without correlation (the index-only-scan case). Let's see if that becomes a real patch. > Thank you for your patient explanation! Your questions were excellent, and so deserved my careful consideration. > Please see my humble attempt > to test this scenario and learn something about btrees in the process, > attached. amcheck does detect the violation of the high key > invariant. This is a nice pattern for testing out if the keyspace is sane, and how things work. I might incorporate it into my own testing in the future. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Nov 18, 2016 at 12:17 PM, Peter Geoghegan wrote: > So, the Btree you show is supposed to be good? You've left it up to me > to draw T2, the corrupt version? If so, here is a drawing of T2 for > the sake of clarity: > > Btree =[a|b] >/ \ > / \ > / \ >[a]---[b] > | | > | | >[c]---[d] Actually I meant that at T2 the btree is exactly same as it was at T1, but the order of the values has changed according to the comparator (for example, because a collation definition changed when you updated your operating system), so that the btree is now corrupted. Based on Goetz Graefe's paper I was wondering if amcheck would be unable to detect this due to the cousin problem. >> Now c and b have been switched around. Won't amcheck still pass since >> we only verify immediate downlinks and siblings? Yet searches for b >> will take a wrong turn at the root. Do I have this right? I wonder >> if there is a way to verify that each page's high key < the 'next' key >> for all ancestors. > > I don't think you have it right, because your argument is predicated > on classic L&Y, not the Postgres variant. The high key is literally > guaranteed to be *equal* to the initial first value of the right half > of a split page post-split (at least initially), which is where the > new downlink for the new right half comes from in turn, so it will be > exactly equal as well. There is a comment which says all this within > _bt_insert_parent(), actually. > > In general, I think it's pretty bad that we're so loose about > representing the key space in downlinks compared to classic L&Y, > because it probably makes index bloat a lot worse in internal pages, > messing up the key space long term. As long as we're doing that, > though, we clearly cannot usefully "verify that each page's high key < > the 'next' key for all ancestors", because we deviate from L&Y in a > way that makes that check always fail. It's okay that it always fails > for us (it doesn't actually indicate corruption), because of our > duplicate-aware index scan logic (change to invariant). > > In summary, I'm pretty sure that our "Ki <= v <= Ki+1" thing makes the > cousin problem a non-issue, because we don't follow downlinks that are > equal to our scankey -- we go to their immediate left and start > *there*, in order to support duplicates in leaf pages. Index scans for > 'b' work for both T1 and T2 in Postgres, because we refuse to follow > the 'b' downlinks when doing an index scan for 'b' (relevant to T2). I > would actually like to change things to make the invariant the classic > L&Y "Ki < v <= Ki+1", to avoid bloat in the internal pages and to make > suffix truncation in internal pages work. > > So, we don't have the cousin problem, but since I wish for us to adopt > the stricter classic L&Y invariant at some point, you could say that I > also wished that we had the cousin problem. Thank you for your patient explanation! Please see my humble attempt to test this scenario and learn something about btrees in the process, attached. amcheck does detect the violation of the high key invariant. > Confused yet? :-) Yes. But it's getting better slowly :-) -- Thomas Munro http://www.enterprisedb.com create extension if not exists amcheck; create extension if not exists pageinspect; drop table if exists my_table; drop type if exists my_type; -- define an enum ordered a, b, c, d, but do it in a funky way that results in -- odd OIDs that we can later mess around with create type my_type as enum ('d'); alter type my_type add value 'c' before 'd'; alter type my_type add value 'b' before 'c'; alter type my_type add value 'a' before 'b'; create table my_table (my_column my_type, counter serial); create index my_index on my_table(my_column); create or replace function pretty_print_btree_recurse(pageno int, height int, level int) returns void language plpgsql as $$ declare v_spaces text := rpad('', (height - level) * 2, ' '); v_ctid text; v_subpageno int; v_key_hex text; v_key_oid oid; v_key_label text; v_page_description text; v_item int; begin if height = level then v_page_description := 'root'; elsif level = 0 then v_page_description := 'leaf'; else v_page_description := 'intl'; end if; for v_ctid, v_key_hex, v_item in select ctid::text, data, itemoffset from bt_page_items('my_index', pageno) loop if v_ctid like '(4294967295,%' then continue; end if; v_subpageno := (v_ctid::point)[0]; v_key_oid := ('x' || substring(v_key_hex, 1, 2))::bit(8)::int + (('x' || substring(v_key_hex, 4, 2))::bit(8)::int << 8); v_key_label := coalesce((select enumlabel from pg_enum where oid = v_key_
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Sun, Nov 20, 2016 at 3:42 PM, Robert Haas wrote: > OK. If it's not reasonable to continue checking after an ERROR, then > I think ERROR is the way to go. If somebody really doesn't like that > lack of flexibility (in either direction), they can propose a change > later for separate consideration. That limitation is not, in my view, > a sufficient reason to hold up the patch on the table. I attach V4 of amcheck. I decided to leave things as they are, as far as forcing a different elevel goes (you still need to modify a macro). I didn't change the elevel macro from CONCERN in a couple of cases, as Andres suggested, because in fact that's generally the same as DEBUG1, not WARNING (I agreed with Andres that it was WARNING before, but it isn't unless you #define NOT_USED). Andres said: "Theoretically we should recheck that the oids still match, theoretically the locking here allows the index->table mapping to change". I don't know how to act on this feedback, since comparable index + heap locking code should have the same problem, but doesn't consider it. How is this any different to reindex_index()? Other than that, all feedback items were worked through. I made the functions PARALLEL SAFE, too, since I noticed that that wasn't the case in passing. -- Peter Geoghegan From 21c843ae0193ed17b2e5234d67f2e73f7015b3cd Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Tue, 10 Jun 2014 22:20:40 -0700 Subject: [PATCH] Add amcheck extension to contrib The new extension makes available two SQL-callable functions, each of which accept a single argument (a regclass/nbtree index relation argument) and return void. The SQL-callable function bt_index_check() checks invariants of the target index by walking the tree, performing various checks of the sanity of the page space using insertion scankeys to compare items. The SQL-callable function bt_index_parent_check() performs a superset of the checks performed by bt_index_check(): the same checks, as well as another check verifying that each downlink is actually respected as a lower bound on its child page. bt_index_check() requires only an AccessShareLock on the target. bt_index_parent_check() requires an ExclusiveLock on the target, and ShareLock on its heap relation. --- contrib/Makefile |1 + contrib/amcheck/Makefile | 19 + contrib/amcheck/amcheck--1.0.sql | 20 + contrib/amcheck/amcheck.control |5 + contrib/amcheck/verify_nbtree.c | 1227 ++ doc/src/sgml/amcheck.sgml| 275 + doc/src/sgml/contrib.sgml|1 + doc/src/sgml/filelist.sgml |1 + 8 files changed, 1549 insertions(+) create mode 100644 contrib/amcheck/Makefile create mode 100644 contrib/amcheck/amcheck--1.0.sql create mode 100644 contrib/amcheck/amcheck.control create mode 100644 contrib/amcheck/verify_nbtree.c create mode 100644 doc/src/sgml/amcheck.sgml diff --git a/contrib/Makefile b/contrib/Makefile index 25263c0..4acd50e 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -6,6 +6,7 @@ include $(top_builddir)/src/Makefile.global SUBDIRS = \ adminpack \ + amcheck \ auth_delay \ auto_explain \ bloom \ diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile new file mode 100644 index 000..f14c7aa --- /dev/null +++ b/contrib/amcheck/Makefile @@ -0,0 +1,19 @@ +# contrib/amcheck/Makefile + +MODULE_big = amcheck +OBJS = verify_nbtree.o $(WIN32RES) + +EXTENSION = amcheck +DATA = amcheck--1.0.sql +PGFILEDESC = "amcheck - functions to verify access method invariants" + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/amcheck +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/amcheck/amcheck--1.0.sql b/contrib/amcheck/amcheck--1.0.sql new file mode 100644 index 000..b415f0e --- /dev/null +++ b/contrib/amcheck/amcheck--1.0.sql @@ -0,0 +1,20 @@ +/* contrib/amcheck/amcheck--1.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION amcheck" to load this file. \quit + +-- +-- bt_index_check() +-- +CREATE FUNCTION bt_index_check(index regclass) +RETURNS VOID +AS 'MODULE_PATHNAME', 'bt_index_check' +LANGUAGE C STRICT PARALLEL SAFE; + +-- +-- bt_index_parent_check() +-- +CREATE FUNCTION bt_index_parent_check(index regclass) +RETURNS VOID +AS 'MODULE_PATHNAME', 'bt_index_parent_check' +LANGUAGE C STRICT PARALLEL SAFE; diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control new file mode 100644 index 000..180f150 --- /dev/null +++ b/contrib/amcheck/amcheck.control @@ -0,0 +1,5 @@ +# amcheck extension +comment = 'verify access method invariants' +default_version = '1.0' +module_pathname = '$libdir/amcheck' +relocatable = true diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c new file
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Sat, Nov 19, 2016 at 11:38 PM, Peter Geoghegan wrote: > On Sat, Nov 19, 2016 at 6:45 PM, Robert Haas wrote: >>> What do you think about new argument with default vs. GUC? I guess >>> that the GUC might be a lot less of a foot-gun. We might even give it >>> a suitably scary name, to indicate that it will make the server PANIC. >>> (I gather that you don't care about other aspects of verbosity -- just >>> about the ability to make amcheck PANIC in the event of an invariant >>> violation without recompiling it.) >> >> Yikes. I don't think I want to expose any kind of API that lets the >> user PANIC the server. A value < ERROR sounds far more reasonable >> than a value > ERROR. > > In general, I don't want to get into the business of reasoning about > how well we can limp along when there is a would-be error condition > within amcheck. Once "the impossible" has actually occurred, it's very > difficult to reason about what still works. Also, I actually agree > that making it possible for the tool to force a PANIC through a > user-visible interface is a bad idea. > > Maybe we should just leave it as it is -- experts can recompile the > tool after modifying it to use an elevel that is != ERROR (the thing I > mention about elevel < ERROR is already documented in code comments). > If that breaks, they get to keep both halves. OK. If it's not reasonable to continue checking after an ERROR, then I think ERROR is the way to go. If somebody really doesn't like that lack of flexibility (in either direction), they can propose a change later for separate consideration. That limitation is not, in my view, a sufficient reason to hold up the patch on the table. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Sat, Nov 19, 2016 at 6:45 PM, Robert Haas wrote: >> What do you think about new argument with default vs. GUC? I guess >> that the GUC might be a lot less of a foot-gun. We might even give it >> a suitably scary name, to indicate that it will make the server PANIC. >> (I gather that you don't care about other aspects of verbosity -- just >> about the ability to make amcheck PANIC in the event of an invariant >> violation without recompiling it.) > > Yikes. I don't think I want to expose any kind of API that lets the > user PANIC the server. A value < ERROR sounds far more reasonable > than a value > ERROR. In general, I don't want to get into the business of reasoning about how well we can limp along when there is a would-be error condition within amcheck. Once "the impossible" has actually occurred, it's very difficult to reason about what still works. Also, I actually agree that making it possible for the tool to force a PANIC through a user-visible interface is a bad idea. Maybe we should just leave it as it is -- experts can recompile the tool after modifying it to use an elevel that is != ERROR (the thing I mention about elevel < ERROR is already documented in code comments). If that breaks, they get to keep both halves. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Nov 18, 2016 at 6:51 PM, Peter Geoghegan wrote: > On Thu, Nov 17, 2016 at 12:04 PM, Peter Geoghegan wrote: >>> Hm, if we want that - and it doesn't seem like a bad idea - I think we >>> should be make it available without recompiling. >> >> I suppose, provided it doesn't let CORRUPTION elevel be < ERROR. That >> might be broken if it was allowed. > > What do you think about new argument with default vs. GUC? I guess > that the GUC might be a lot less of a foot-gun. We might even give it > a suitably scary name, to indicate that it will make the server PANIC. > (I gather that you don't care about other aspects of verbosity -- just > about the ability to make amcheck PANIC in the event of an invariant > violation without recompiling it.) Yikes. I don't think I want to expose any kind of API that lets the user PANIC the server. A value < ERROR sounds far more reasonable than a value > ERROR. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Thu, Nov 17, 2016 at 12:04 PM, Peter Geoghegan wrote: >> Hm, if we want that - and it doesn't seem like a bad idea - I think we >> should be make it available without recompiling. > > I suppose, provided it doesn't let CORRUPTION elevel be < ERROR. That > might be broken if it was allowed. What do you think about new argument with default vs. GUC? I guess that the GUC might be a lot less of a foot-gun. We might even give it a suitably scary name, to indicate that it will make the server PANIC. (I gather that you don't care about other aspects of verbosity -- just about the ability to make amcheck PANIC in the event of an invariant violation without recompiling it.) -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Wed, Nov 16, 2016 at 5:06 PM, Thomas Munro wrote: > + * Ideally, we'd compare every item in the index against every other > + * item in the index, and not trust opclass obedience of the transitive > + * law to bridge the gap between children and their grandparents (as > + * well as great-grandparents, and so on). We don't go to those > + * lengths because that would be prohibitively expensive, and probably > + * not markedly more effective in practice. > + */ > > I skimmed the Modern B-Tree Techniques paper recently, and there was a > section on "the cousin problem" when verifying btrees, which this > comment reminded me of. I'm impressed that you made the connection. This comment reminded you of that section because it was directly influenced by it. I actually use the term "cousin page" elsewhere, because it's really useful to distinguish cousins from "true siblings" when talking about page deletion in particular...and I have a lot to say about page deletion. > I tried to come up with an example of a pair > of characters being switched in a collation that would introduce > undetectable corruption: > > T1. Order = a < b < c < d > > Btree =[a|c] > / \ > / \ > / \ > [a]---[c] >| | >| | > [b]---[d] > > T2. Order = a < c < b < d So, the Btree you show is supposed to be good? You've left it up to me to draw T2, the corrupt version? If so, here is a drawing of T2 for the sake of clarity: Btree =[a|b] / \ / \ / \ [a]---[b] | | | | [c]---[d] > Now c and b have been switched around. Won't amcheck still pass since > we only verify immediate downlinks and siblings? Yet searches for b > will take a wrong turn at the root. Do I have this right? I wonder > if there is a way to verify that each page's high key < the 'next' key > for all ancestors. I don't think you have it right, because your argument is predicated on classic L&Y, not the Postgres variant. The high key is literally guaranteed to be *equal* to the initial first value of the right half of a split page post-split (at least initially), which is where the new downlink for the new right half comes from in turn, so it will be exactly equal as well. There is a comment which says all this within _bt_insert_parent(), actually. In general, I think it's pretty bad that we're so loose about representing the key space in downlinks compared to classic L&Y, because it probably makes index bloat a lot worse in internal pages, messing up the key space long term. As long as we're doing that, though, we clearly cannot usefully "verify that each page's high key < the 'next' key for all ancestors", because we deviate from L&Y in a way that makes that check always fail. It's okay that it always fails for us (it doesn't actually indicate corruption), because of our duplicate-aware index scan logic (change to invariant). In summary, I'm pretty sure that our "Ki <= v <= Ki+1" thing makes the cousin problem a non-issue, because we don't follow downlinks that are equal to our scankey -- we go to their immediate left and start *there*, in order to support duplicates in leaf pages. Index scans for 'b' work for both T1 and T2 in Postgres, because we refuse to follow the 'b' downlinks when doing an index scan for 'b' (relevant to T2). I would actually like to change things to make the invariant the classic L&Y "Ki < v <= Ki+1", to avoid bloat in the internal pages and to make suffix truncation in internal pages work. So, we don't have the cousin problem, but since I wish for us to adopt the stricter classic L&Y invariant at some point, you could say that I also wished that we had the cousin problem. Confused yet? :-) [1] https://archive.org/stream/symmetricconcurr00lani#page/6/mode/2up -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Wed, Nov 16, 2016 at 1:03 PM, Andres Freund wrote: > I think the patch could use a pgindent run. Okay. > I'd really want a function that runs all check on a table. Why not just use a custom SQL query? The docs have an example of one such query, that verifies all user indexes, but there could be another example. >> +#define CHECK_SUPERUSER() { \ >> + if (!superuser()) \ >> + ereport(ERROR, \ >> + >> (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \ >> + (errmsg("must be superuser to use >> verification functions"; } > > Why is this a macro? This was just copied from pageinspect, TBH. What would you prefer? >> +#ifdef NOT_USED >> +#define CORRUPTION PANIC >> +#define CONCERN WARNING >> +#define POSITION NOTICE >> +#else >> +#define CORRUPTION ERROR >> +#define CONCERN DEBUG1 >> +#define POSITION DEBUG2 >> +#endif > > Hm, if we want that - and it doesn't seem like a bad idea - I think we > should be make it available without recompiling. I suppose, provided it doesn't let CORRUPTION elevel be < ERROR. That might be broken if it was allowed. > Hm, I think it'd be, for the long term, better if we'd move the btree > check code to amcheck_nbtree.c or such. Agreed. >> +Datum >> +bt_index_parent_check(PG_FUNCTION_ARGS) >> +{ >> + Oid indrelid = PG_GETARG_OID(0); > Theoretically we should recheck that the oids still match, theoretically > the locking here allows the index->table mapping to change. It's not a > huge window tho... >> + /* Check for active uses of the index in the current transaction */ >> + CheckTableNotInUse(indrel, "bt_index_parent_check"); > > Why do we actually care? This was left-over from duplicating REINDEX code. Indeed, we have no reason to care at all. >> + if (!rel->rd_index->indisready) >> + ereport(ERROR, >> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), >> + errmsg("cannot check index \"%s\"", >> + RelationGetRelationName(rel)), >> + errdetail("Index is not yet ready for >> insertions"))); > > Why can't we check this? Seems valuable to allow checking after a > partial CONCURRENTLY index build? Okay. > Why don't you use IndexIsReady et al? Not sure if the patch compiles > against an earlier release (there weren't valid/ready/live dead fields, > but instead it was compressed into fewer)? This version is almost identical to the Github version of amcheck that we've been using with 9.4+. Agreed on IndexIsReady(). >> +/* >> + * bt_check_every_level() >> + * >> + * Main entry point for B-Tree SQL-callable functions. Walks the B-Tree in >> + * logical order, verifying invariants as it goes. >> + * >> + * It is the caller's responsibility to acquire appropriate heavyweight >> lock on >> + * the index relation, and advise us if extra checks are safe when an >> + * ExclusiveLock is held. An ExclusiveLock is generally assumed to prevent >> any >> + * kind of physical modification to the index structure, including >> + * modifications that VACUUM may make. >> + */ > > Hm, what kind of locking does the kill_prior_tuple stuff use? The killing itself, or the setting of LP_DEAD bit? Ordinary index scans could concurrently set LP_DEAD with only an AccessShareLock. But, that's just metadata that amcheck doesn't care about. Anything that needs to actually recyle space based on finding an LP_DEAD bit set is a write operation, that is precluded from running when bt_index_parent_check() is called by its ExclusiveLock. _bt_vacuum_one_page() is only called by one nbtinsert.c code path, when it looks like a page split might be needed. >> +static void >> +bt_check_every_level(Relation rel, bool exclusivelock) >> +{ > >> + /* >> + * Certain deletion patterns can result in "skinny" B-Tree indexes, >> where >> + * the fast root and true root differ. >> + * >> + * Start from the true root, not the fast root, unlike conventional >> index >> + * scans. This approach is more thorough, and removes the risk of >> + * following a stale fast root from the meta page. >> + */ >> + if (metad->btm_fastroot != metad->btm_root) >> + ereport(CONCERN, >> + (errcode(ERRCODE_DUPLICATE_OBJECT), >> + errmsg("fast root mismatch in index %s", >> + RelationGetRelationName(rel)), >> + errdetail_internal("Fast block %u (level %u) " >> + >> "differs from true root " >> + "block >> %u (level %u).", >> +
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Thu, Sep 8, 2016 at 6:44 AM, Peter Geoghegan wrote: > On Fri, Sep 2, 2016 at 11:19 AM, Kevin Grittner wrote: >> IMV the process is to post a patch to this list to certify that it >> is yours to contribute and free of IP encumbrances that would >> prevent us from using it. I will wait for that post. > > I attach my V3 + * Ideally, we'd compare every item in the index against every other + * item in the index, and not trust opclass obedience of the transitive + * law to bridge the gap between children and their grandparents (as + * well as great-grandparents, and so on). We don't go to those + * lengths because that would be prohibitively expensive, and probably + * not markedly more effective in practice. + */ I skimmed the Modern B-Tree Techniques paper recently, and there was a section on "the cousin problem" when verifying btrees, which this comment reminded me of. I tried to come up with an example of a pair of characters being switched in a collation that would introduce undetectable corruption: T1. Order = a < b < c < d Btree =[a|c] / \ / \ / \ [a]---[c] | | | | [b]---[d] T2. Order = a < c < b < d Now c and b have been switched around. Won't amcheck still pass since we only verify immediate downlinks and siblings? Yet searches for b will take a wrong turn at the root. Do I have this right? I wonder if there is a way to verify that each page's high key < the 'next' key for all ancestors. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
Hi, I think the patch could use a pgindent run. On 2016-09-07 11:44:01 -0700, Peter Geoghegan wrote: > diff --git a/contrib/amcheck/amcheck--1.0.sql > b/contrib/amcheck/amcheck--1.0.sql > new file mode 100644 > index 000..ebbd6ac > --- /dev/null > +++ b/contrib/amcheck/amcheck--1.0.sql > @@ -0,0 +1,20 @@ > +/* contrib/amcheck/amcheck--1.0.sql */ > + > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION > +\echo Use "CREATE EXTENSION amcheck" to load this file. \quit > + > +-- > +-- bt_index_check() > +-- > +CREATE FUNCTION bt_index_check(index regclass) > +RETURNS VOID > +AS 'MODULE_PATHNAME', 'bt_index_check' > +LANGUAGE C STRICT; > + > +-- > +-- bt_index_parent_check() > +-- > +CREATE FUNCTION bt_index_parent_check(index regclass) > +RETURNS VOID > +AS 'MODULE_PATHNAME', 'bt_index_parent_check' > +LANGUAGE C STRICT; I'd really want a function that runs all check on a table. > +#define CHECK_SUPERUSER() { \ > + if (!superuser()) \ > + ereport(ERROR, \ > + > (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \ > + (errmsg("must be superuser to use > verification functions"; } Why is this a macro? > +/* > + * Callers to verification functions should never receive a false positive > + * indication of corruption. Therefore, when using amcheck functions for > + * stress testing, it may be useful to temporally change the CORRUPTION > elevel > + * to PANIC, to immediately halt the server in the event of detecting an > + * invariant condition violation. This may preserve more information about > the > + * nature of the underlying problem. Note that modifying the CORRUPTION > + * constant to be an elevel < ERROR is not well tested. > + */ > +#ifdef NOT_USED > +#define CORRUPTION PANIC > +#define CONCERN WARNING > +#define POSITION NOTICE > +#else > +#define CORRUPTION ERROR > +#define CONCERN DEBUG1 > +#define POSITION DEBUG2 > +#endif Hm, if we want that - and it doesn't seem like a bad idea - I think we should be make it available without recompiling. > +/* > + * A B-Tree cannot possibly have this many levels, since there must be one > + * block per level, which is bound by the range of BlockNumber: > + */ > +#define InvalidBtreeLevel((uint32) InvalidBlockNumber) Hm, I think it'd be, for the long term, better if we'd move the btree check code to amcheck_nbtree.c or such. > +Datum > +bt_index_parent_check(PG_FUNCTION_ARGS) > +{ > + Oid indrelid = PG_GETARG_OID(0); > + Oid heapid; > + Relationindrel; > + Relationheaprel; > + > + CHECK_SUPERUSER(); > + > + /* > + * We must lock table before index to avoid deadlocks. However, if the > + * passed indrelid isn't an index then IndexGetRelation() will fail. > + * Rather than emitting a not-very-helpful error message, postpone > + * complaining, expecting that the is-it-an-index test below will fail. > + */ > + heapid = IndexGetRelation(indrelid, true); > + if (OidIsValid(heapid)) > + heaprel = heap_open(heapid, ShareLock); > + else > + heaprel = NULL; > + > + /* > + * Open the target index relations separately (like relation_openrv(), > but > + * with heap relation locked first to prevent deadlocking). In hot > standby > + * mode this will raise an error. > + */ > + indrel = index_open(indrelid, ExclusiveLock); Theoretically we should recheck that the oids still match, theoretically the locking here allows the index->table mapping to change. It's not a huge window tho... > + /* Check for active uses of the index in the current transaction */ > + CheckTableNotInUse(indrel, "bt_index_parent_check"); Why do we actually care? > +static void > +btree_index_checkable(Relation rel) > +{ > + if (rel->rd_rel->relkind != RELKIND_INDEX || > + rel->rd_rel->relam != BTREE_AM_OID) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("only nbtree access method indexes are > supported"), > + errdetail("Relation \"%s\" is not a B-Tree > index.", > + > RelationGetRelationName(rel; > + > + if (RELATION_IS_OTHER_TEMP(rel)) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("cannot access temporary tables of > other sessions"), > + errdetail("Index \"%s\" is associated with > temporary relation.", > + > RelationGetRelationName(rel; > + > + if (!rel->rd_index->indisready) > + ereport(ERROR, >
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Mon, Oct 17, 2016 at 10:46 AM, Noah Misch wrote: > - Detect impossible conditions in the hint bits. A tuple should not have both > HEAP_XMAX_COMMITTED and HEAP_XMAX_INVALID. Every tuple bearing > HEAP_ONLY_TUPLE should bear HEAP_UPDATED. HEAP_HASVARWIDTH should be true > if and only if the tuple has a non-NULL value in a negative-typlen column, > possibly a dropped column. A tuple should not have both HEAP_KEYS_UPDATED > and HEAP_XMAX_LOCK_ONLY. > > - Report evidence of upgrades from older versions. If the tool sees > HEAP_MOVED_IN or HEAP_MOVED_OFF, it can report that the cluster was > binary-upgraded from 8.3 or 8.4. If the user did not upgrade from such a > version, the user should assume corruption. > > - Check VARSIZE() of each variable-length datum. Corrupt lengths might direct > you to seek past the end of the tuple, or they might imply excess free space > at the end of the tuple. > > - Verify agreement between CLOG, MULTIXACT, and hint bits. If the hint bits > include HEAP_XMAX_LOCK_ONLY, the multixact should not contain a > MultiXactStatusUpdate member. README.tuplock documents other invariants. > If the tool sees a tuple passing HEAP_LOCKED_UPGRADED, it can report that > the cluster was binary-upgraded from a version in [8.3, 9.1]. > > - Verify that TOAST pointers (in non-dropped columns of visible tuples) point > to valid data in the TOAST relation. This is much more expensive than the > other checks I've named, so it should be optional. There is a lot of value in that actually. I got bitten lately by an issue where this got corrupted because of an incorrect failover flow. > - If VM_ALL_VISIBLE() or VM_ALL_FROZEN() passes for a particular page, verify > that the visibility data stored in the page is compatible with that claim. > > - Examine PageHeaderData values. If pd_checksum is non-zero in a cluster with > checksums disabled, the cluster was binary-upgraded from [8.3, 9.2]. Agreed basically on all those items, though the first version does not need to do everything :) Honestly just btree checks have enough value to validate the integration of this module. You can get that as "I'll look at your patch and provide a review soon because I have a hell lot of cases where this will be useful". -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Oct 14, 2016 at 04:56:39PM -0700, Peter Geoghegan wrote: > On Mon, Feb 29, 2016 at 4:09 PM, Peter Geoghegan wrote: > > To recap, the extension adds some SQL-callable functions that verify > > certain invariant conditions hold within some particular B-Tree index. > > These are the conditions that index scans rely on always being true. > > The tool's scope may eventually cover other AMs, including heapam, but > > nbtree seems like the best place to start. > > Noah and I discussed possible future directions for amcheck in person > recently. I would like to get Noah's thoughts again here on how a tool > like amcheck might reasonably target other access methods for > verification. In particular, the heapam. MultiXacts were mentioned as > a structure that could receive verification in a future iteration of > this tool, but I lack expertise there. Yes, a heap checker could examine plenty of things. Incomplete list: - Detect impossible conditions in the hint bits. A tuple should not have both HEAP_XMAX_COMMITTED and HEAP_XMAX_INVALID. Every tuple bearing HEAP_ONLY_TUPLE should bear HEAP_UPDATED. HEAP_HASVARWIDTH should be true if and only if the tuple has a non-NULL value in a negative-typlen column, possibly a dropped column. A tuple should not have both HEAP_KEYS_UPDATED and HEAP_XMAX_LOCK_ONLY. - Report evidence of upgrades from older versions. If the tool sees HEAP_MOVED_IN or HEAP_MOVED_OFF, it can report that the cluster was binary-upgraded from 8.3 or 8.4. If the user did not upgrade from such a version, the user should assume corruption. - Check VARSIZE() of each variable-length datum. Corrupt lengths might direct you to seek past the end of the tuple, or they might imply excess free space at the end of the tuple. - Verify agreement between CLOG, MULTIXACT, and hint bits. If the hint bits include HEAP_XMAX_LOCK_ONLY, the multixact should not contain a MultiXactStatusUpdate member. README.tuplock documents other invariants. If the tool sees a tuple passing HEAP_LOCKED_UPGRADED, it can report that the cluster was binary-upgraded from a version in [8.3, 9.1]. - Verify that TOAST pointers (in non-dropped columns of visible tuples) point to valid data in the TOAST relation. This is much more expensive than the other checks I've named, so it should be optional. - If VM_ALL_VISIBLE() or VM_ALL_FROZEN() passes for a particular page, verify that the visibility data stored in the page is compatible with that claim. - Examine PageHeaderData values. If pd_checksum is non-zero in a cluster with checksums disabled, the cluster was binary-upgraded from [8.3, 9.2]. > I've placed a lot of emphasis on the importance of having a > low-overhead verification process, particularly in terms of the > strength of heavyweight lock that the verification process requires. > Ideally, it would be possible to run any new verification process in a > fairly indiscriminate way with only limited impact on live production > systems. I suspect you could keep heap checker overhead similar to the cost of "SELECT count(*) FROM table_name". -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Mon, Feb 29, 2016 at 4:09 PM, Peter Geoghegan wrote: > To recap, the extension adds some SQL-callable functions that verify > certain invariant conditions hold within some particular B-Tree index. > These are the conditions that index scans rely on always being true. > The tool's scope may eventually cover other AMs, including heapam, but > nbtree seems like the best place to start. Noah and I discussed possible future directions for amcheck in person recently. I would like to get Noah's thoughts again here on how a tool like amcheck might reasonably target other access methods for verification. In particular, the heapam. MultiXacts were mentioned as a structure that could receive verification in a future iteration of this tool, but I lack expertise there. I've placed a lot of emphasis on the importance of having a low-overhead verification process, particularly in terms of the strength of heavyweight lock that the verification process requires. Ideally, it would be possible to run any new verification process in a fairly indiscriminate way with only limited impact on live production systems. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Oct 7, 2016 at 8:05 AM, Peter Geoghegan wrote: > Your customer > databases might feature far more use of Japanese collations, for > example, which might be an important factor. Not really :) -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Sun, Oct 2, 2016 at 6:48 PM, Michael Paquier wrote: > Okay, moved to next CF. I may look at it finally I got some use-cases > for it, similar to yours I guess.. Let me know how that goes. One thing I've definitely noticed is that the tool is good at finding corner-case bugs, so even if you can only test a small fraction of the number of databases that I've been able to test, there could still be significant value in your performing your own exercise. Your customer databases might feature far more use of Japanese collations, for example, which might be an important factor. (Well, the use of Arabic code points turned out to be an important factor in the question of whether or not en_US.utf8 could have been affected by the Glibc + abbreviated keys bug.) -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Thu, Sep 8, 2016 at 3:44 AM, Peter Geoghegan wrote: > On Fri, Sep 2, 2016 at 11:19 AM, Kevin Grittner wrote: >> IMV the process is to post a patch to this list to certify that it >> is yours to contribute and free of IP encumbrances that would >> prevent us from using it. I will wait for that post. > > I attach my V3. There are only very minor code changes here, such as > breaking out a separate elevel constant for DEBUG2 traces that show > information of how the B-Tree is accessed (e.g. current level, block > number). Tiny tweaks to about 3 comments were also performed. These > changes are trivial. > > I've also removed the tests added, since external sort regression test > coverage is in a much better place now. Okay, moved to next CF. I may look at it finally I got some use-cases for it, similar to yours I guess.. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Wed, Sep 7, 2016 at 11:44 AM, Peter Geoghegan wrote: > I attach my V3. I should point out that I'm leaving to go on Vacation this weekend. I'll be away for a week, and will not be answering mail during that period. If anyone has any questions on this for me, it might be preferable to ask them before I leave. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Sep 2, 2016 at 11:19 AM, Kevin Grittner wrote: > IMV the process is to post a patch to this list to certify that it > is yours to contribute and free of IP encumbrances that would > prevent us from using it. I will wait for that post. I attach my V3. There are only very minor code changes here, such as breaking out a separate elevel constant for DEBUG2 traces that show information of how the B-Tree is accessed (e.g. current level, block number). Tiny tweaks to about 3 comments were also performed. These changes are trivial. I've also removed the tests added, since external sort regression test coverage is in a much better place now. Finally, the documentation was updated, to make it closer to the Github project's documentation. I expanded what was started as the original amcheck sgml documentation based on the experience of using amcheck on production databases at Heroku. This isn't a big change, but it's the biggest in V3. The documentation now emphasizes the likelihood of amcheck finding software errors, which is all we found. Jim Gray predicted that fault tolerant systems will tend to have almost all problems arise from operator error and software faults in practice, so perhaps I should have predicted that that would be the general trend. -- Peter Geoghegan From 03046b49c893c7695bbae365c33e3d7b27a1f9a3 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Tue, 10 Jun 2014 22:20:40 -0700 Subject: [PATCH] Add amcheck extension to contrib The new extension makes available two SQL-callable functions, each of which accept a single argument (a regclass/nbtree index relation argument) and return void. The SQL-callable function bt_index_check() checks invariants of the target index by walking the tree, performing various checks of the sanity of the page space using insertion scankeys to compare items. The SQL-callable function bt_index_parent_check() performs a superset of the checks performed by bt_index_check(): the same checks, as well as another check verifying that each downlink is actually respected as a lower bound on its child page. bt_index_check() requires only an AccessShareLock on the target. bt_index_parent_check() requires an ExclusiveLock on the target, and ShareLock on its heap relation. --- contrib/Makefile |1 + contrib/amcheck/Makefile | 19 + contrib/amcheck/amcheck--1.0.sql | 20 + contrib/amcheck/amcheck.c| 1267 ++ contrib/amcheck/amcheck.control |5 + doc/src/sgml/amcheck.sgml| 278 + doc/src/sgml/contrib.sgml|1 + doc/src/sgml/filelist.sgml |1 + 8 files changed, 1592 insertions(+) create mode 100644 contrib/amcheck/Makefile create mode 100644 contrib/amcheck/amcheck--1.0.sql create mode 100644 contrib/amcheck/amcheck.c create mode 100644 contrib/amcheck/amcheck.control create mode 100644 doc/src/sgml/amcheck.sgml diff --git a/contrib/Makefile b/contrib/Makefile index 25263c0..4acd50e 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -6,6 +6,7 @@ include $(top_builddir)/src/Makefile.global SUBDIRS = \ adminpack \ + amcheck \ auth_delay \ auto_explain \ bloom \ diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile new file mode 100644 index 000..7bb29aa --- /dev/null +++ b/contrib/amcheck/Makefile @@ -0,0 +1,19 @@ +# contrib/amcheck/Makefile + +MODULE_big = amcheck +OBJS = amcheck.o $(WIN32RES) + +EXTENSION = amcheck +DATA = amcheck--1.0.sql +PGFILEDESC = "amcheck - functions to verify access method invariants" + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/amcheck +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/amcheck/amcheck--1.0.sql b/contrib/amcheck/amcheck--1.0.sql new file mode 100644 index 000..ebbd6ac --- /dev/null +++ b/contrib/amcheck/amcheck--1.0.sql @@ -0,0 +1,20 @@ +/* contrib/amcheck/amcheck--1.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION amcheck" to load this file. \quit + +-- +-- bt_index_check() +-- +CREATE FUNCTION bt_index_check(index regclass) +RETURNS VOID +AS 'MODULE_PATHNAME', 'bt_index_check' +LANGUAGE C STRICT; + +-- +-- bt_index_parent_check() +-- +CREATE FUNCTION bt_index_parent_check(index regclass) +RETURNS VOID +AS 'MODULE_PATHNAME', 'bt_index_parent_check' +LANGUAGE C STRICT; diff --git a/contrib/amcheck/amcheck.c b/contrib/amcheck/amcheck.c new file mode 100644 index 000..5c4d02d --- /dev/null +++ b/contrib/amcheck/amcheck.c @@ -0,0 +1,1267 @@ +/*- + * + * amcheck.c + * Verifies the integrity of access methods based on invariants. + * + * Provides SQL-callable functions for verifying that various logical + * invariants in the structure of index access method
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Sep 2, 2016 at 11:16 AM, Peter Geoghegan wrote: > There are only tiny differences, which in any case you can see in the > commit log on Github. There is no reason why code review needs to > block on this V3, IMV. Also, I don't think that we need to include V2's tests now that there is independent test coverage for external sorts, which is a recent development. I imagine that the V2 test run duration might be problematic on some buildfarm animals. I'll omit that in V3. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Sep 2, 2016 at 1:16 PM, Peter Geoghegan wrote: > On Fri, Sep 2, 2016 at 11:11 AM, Kevin Grittner wrote: >> I changed the CF entry to "Waiting on Author" pending that. (I was >> starting to review the patch for commit, so it's none to early to >> mention that the last posted version is not what I should be >> looking at.) > > There are only tiny differences, which in any case you can see in the > commit log on Github. There is no reason why code review needs to > block on this V3, IMV. IMV the process is to post a patch to this list to certify that it is yours to contribute and free of IP encumbrances that would prevent us from using it. I will wait for that post. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Sep 2, 2016 at 11:11 AM, Kevin Grittner wrote: > I changed the CF entry to "Waiting on Author" pending that. (I was > starting to review the patch for commit, so it's none to early to > mention that the last posted version is not what I should be > looking at.) There are only tiny differences, which in any case you can see in the commit log on Github. There is no reason why code review needs to block on this V3, IMV. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Sep 2, 2016 at 11:31 AM, Peter Geoghegan wrote: > On Thu, Aug 18, 2016 at 5:14 PM, Peter Geoghegan wrote: >> I'd certainly welcome that. There are Debian packages available from >> the Github version of amcheck, which is otherwise practically >> identical to the most recent version of the patch posted here: >> >> https://github.com/petergeoghegan/amcheck >> >> (This version also targets Postgres 9.4+). > > There were some extremely minor changes made here compared to my V2 > patch, which, ISTM, means that a V3 patch is necessary. I'll get to > that next week. I changed the CF entry to "Waiting on Author" pending that. (I was starting to review the patch for commit, so it's none to early to mention that the last posted version is not what I should be looking at.) -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Thu, Aug 18, 2016 at 5:14 PM, Peter Geoghegan wrote: > I'd certainly welcome that. There are Debian packages available from > the Github version of amcheck, which is otherwise practically > identical to the most recent version of the patch posted here: > > https://github.com/petergeoghegan/amcheck > > (This version also targets Postgres 9.4+). There were some extremely minor changes made here compared to my V2 patch, which, ISTM, means that a V3 patch is necessary. I'll get to that next week. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Thu, Aug 18, 2016 at 5:35 PM, Michael Paquier wrote: > This would be packaged from source in my case, but that's no big deal > :) At least I can see that it is added in the next CF, and that's > marked as ready for committer for a couple of months now... If you consider how the code is written, you'll see that it's very unlikely to destabilize production systems. Every page is copied into local memory once before being operated on, and only one buffer lock/pin is held at once (for long enough to do that copying). If there were bugs in amcheck, it's much more likely that they'd be "false positive" bugs. In any case, I haven't seen any issue with the tool itself yet, having now run the tool on thousands of servers. I think I'll have a lot more information in about a week, when I've had time to work through more data. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Aug 19, 2016 at 9:14 AM, Peter Geoghegan wrote: > On Thu, Aug 18, 2016 at 5:06 PM, Michael Paquier > wrote: >> Cool. I have been honestly wondering about deploying this tool as well >> to allow some of the QE tests to perform live checks of btree indexes >> as we use a bunch of them. > > I'd certainly welcome that. There are Debian packages available from > the Github version of amcheck, which is otherwise practically > identical to the most recent version of the patch posted here: > > https://github.com/petergeoghegan/amcheck This would be packaged from source in my case, but that's no big deal :) At least I can see that it is added in the next CF, and that's marked as ready for committer for a couple of months now... -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Thu, Aug 18, 2016 at 5:06 PM, Michael Paquier wrote: > Cool. I have been honestly wondering about deploying this tool as well > to allow some of the QE tests to perform live checks of btree indexes > as we use a bunch of them. I'd certainly welcome that. There are Debian packages available from the Github version of amcheck, which is otherwise practically identical to the most recent version of the patch posted here: https://github.com/petergeoghegan/amcheck (This version also targets Postgres 9.4+). > By the way, I have not looked at the patch, > but this supports just btree, right? Wouldn't btree_check be a better > name, or do you think that the interface you provide is generic enough > that it could be extended in the future for gin, gist, etc.? Yes, the idea of calling it amcheck was support for other AMs could be added later. Personally, I would like to make amcheck verifying the heap through a B-Tree index as a next step. There is also a good case for the tool directly verifying heap relations, without involving any index, but that can come later. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Aug 19, 2016 at 2:40 AM, Peter Geoghegan wrote: > Heroku began a selective roll-out of amcheck yesterday. amcheck > already found a bug in the PostGiS Geography B-Tree opclass: > [...] > I'll go report this to the PostGiS people. Cool. I have been honestly wondering about deploying this tool as well to allow some of the QE tests to perform live checks of btree indexes as we use a bunch of them. By the way, I have not looked at the patch, but this supports just btree, right? Wouldn't btree_check be a better name, or do you think that the interface you provide is generic enough that it could be extended in the future for gin, gist, etc.? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Sat, Mar 12, 2016 at 12:38 PM, Peter Geoghegan wrote: > Only insofar as it helps diagnose the underlying issue, when it is a > more subtle issue. Actually fixing the index is almost certainly a > REINDEX. Once you're into the messy business of diagnosing a > problematic opclass, you have to be an expert, and tweaking amcheck > for your requirements (i.e. rebuilding from source) becomes > reasonable. Part of the reason that the code is so heavily commented > is to make it hackable, because I do not feel optimistic that I can > get an expert-orientated interface right, but I still want to make the > tool as useful as possible to experts. Heroku began a selective roll-out of amcheck yesterday. amcheck already found a bug in the PostGiS Geography B-Tree opclass: https://github.com/postgis/postgis/blob/svn-trunk/postgis/geography_btree.c#L260 The issue is not that PG_RETURN_BOOL() is used here (that's just a problem of style). Rather, the issue is that the Geography opclass support function 1 considers an "empty geometry" equal to all other possible values (values not limited to other empty geometries). This breaks the transitive law, which nbtree requires and relies on. I'll go report this to the PostGiS people. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
29.03.2016 06:13, Peter Geoghegan: On Tue, Mar 22, 2016 at 10:57 AM, Peter Geoghegan wrote: That's right - I have a small number of feedback items to work through. I also determined myself that there could be a very low probability race condition when checking the key space across sibling pages, and will work to address that. I attach a V2 revision. Behavioral changes: * Fixes very low probability race condition when verifying ordering across sibling pages with bt_index_check()/AccessShareLock. Notably, the fix does not involve using the high key from the next page instead of the first item, despite my recent suggestion that it would; I changed my mind. In this revision, bt_right_page_check_scankey() continues to return the first item on the next/right page (as a scankey), just as in the first version. The locking considerations here are more complicated than before, and hopefully correct. I abandoned the high key idea when I realized that a concurrent page split could happen in the right sibling, making the high key generally no more safe in the face of concurrent target page deletion than continuing to use the first data item. Using the first data item from the right page is better than using the high key item from the right page verification-wise anyway, so that's good. I found a new way of locking down and recovering from any race conditions (the low probability bug in V1 for bt_index_check()/AccessShareLock calls to the C function bt_right_page_check_scankey()). I won't go into the gory details here, but I will mention that my new approach is inspired by how backwards scans work already. * Adds DEBUG1 traces that indicate the level and page currently being checked. Advanced users will probably find these traces useful on occasion. It's nice to be able to see how amcheck walks the B-Tree in practice. Non-behavioral changes: * Explains why an ExclusiveLock is needed on the target index by bt_index_parent_check() (and a ShareLock on its heap relation). This new explanation talks about why an ExclusiveLock *is* necessary when checking downlinks against child pages, per Tomas Vondra's request. (Note that nothing changes about child/downlink verification itself). Before now, I focused on why they were not necessary, but Tomas was right to point out that that isn't enough information. * No longer uses the word "comport" in an error message, per Amit Langote. Similarly, minor typos fixed, per Tomas. * Adds a variety of pg_regress-based tests for external sorting. (Via various CREATE INDEX statements involving a variety of datatypes. Data is generated pseudo-randomly.) Notably, the regression tests *never* test external sorting today. ISTM that amcheck is a convenient, reliable, and portable way of testing the sorting of collatable types like text, so this is the best way of testing external sorting. External sorting uses abbreviated keys for runs, but discards them as runs are written out, and so the merging of runs does not use abbreviated keys. Abbreviated comparisons and authoritative comparisons are therefore reliably used within the same sort. One particular goal here is that something like the recent strxfrm() debacle might be caught by this testing on some platforms (if we ever get back to trusting strxfrm() at all, that is). The style of these new tests is unorthodox, because a variety of collations are tested, which varies from system to system; we should discuss all this. The tests can take a minute or two to run on my laptop, but that could vary significantly. And so, whether or not that needs to be targeted by something other than "check" and "installcheck" is a discussion that needs to happen. I imagine a lot of buildfarm animals have slow storage, but on the other hand I don't think hackers run the contrib regression tests so often. I prefer to not follow the example of test_decoding if possible; test_decoding's Makefile only runs tests when wal_level=logical is expected (that's how the relevant Makefile targets are scoped), but it's a complicated special case. Review -- As already noted after the first bullet point, there are some tricky considerations on fixing the race when the user calls bt_index_check(). Reviewers should pay close attention to this; could I be wrong about that being race-safe even now? This seems to me to be by far the best place for reviewers to look for problems in this patch, as it's the only bt_index_check()/AccessShareLock case that compares anything *across* pages. The measures taken to prevent false positives described within bt_target_page_check() require expert review, especially because this is the one case where a faulty rationale may result in under-protection from false positives, and not just harmless over-protection. The rationale for why we can reliably recover from a race described within the C function bt_right_page_check_scankey() is *complicated*. I may have failed to make it as simple as possible despite significant effort. It's
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Tue, Mar 22, 2016 at 10:57 AM, Peter Geoghegan wrote: > That's right - I have a small number of feedback items to work > through. I also determined myself that there could be a very low > probability race condition when checking the key space across sibling > pages, and will work to address that. I attach a V2 revision. Behavioral changes: * Fixes very low probability race condition when verifying ordering across sibling pages with bt_index_check()/AccessShareLock. Notably, the fix does not involve using the high key from the next page instead of the first item, despite my recent suggestion that it would; I changed my mind. In this revision, bt_right_page_check_scankey() continues to return the first item on the next/right page (as a scankey), just as in the first version. The locking considerations here are more complicated than before, and hopefully correct. I abandoned the high key idea when I realized that a concurrent page split could happen in the right sibling, making the high key generally no more safe in the face of concurrent target page deletion than continuing to use the first data item. Using the first data item from the right page is better than using the high key item from the right page verification-wise anyway, so that's good. I found a new way of locking down and recovering from any race conditions (the low probability bug in V1 for bt_index_check()/AccessShareLock calls to the C function bt_right_page_check_scankey()). I won't go into the gory details here, but I will mention that my new approach is inspired by how backwards scans work already. * Adds DEBUG1 traces that indicate the level and page currently being checked. Advanced users will probably find these traces useful on occasion. It's nice to be able to see how amcheck walks the B-Tree in practice. Non-behavioral changes: * Explains why an ExclusiveLock is needed on the target index by bt_index_parent_check() (and a ShareLock on its heap relation). This new explanation talks about why an ExclusiveLock *is* necessary when checking downlinks against child pages, per Tomas Vondra's request. (Note that nothing changes about child/downlink verification itself). Before now, I focused on why they were not necessary, but Tomas was right to point out that that isn't enough information. * No longer uses the word "comport" in an error message, per Amit Langote. Similarly, minor typos fixed, per Tomas. * Adds a variety of pg_regress-based tests for external sorting. (Via various CREATE INDEX statements involving a variety of datatypes. Data is generated pseudo-randomly.) Notably, the regression tests *never* test external sorting today. ISTM that amcheck is a convenient, reliable, and portable way of testing the sorting of collatable types like text, so this is the best way of testing external sorting. External sorting uses abbreviated keys for runs, but discards them as runs are written out, and so the merging of runs does not use abbreviated keys. Abbreviated comparisons and authoritative comparisons are therefore reliably used within the same sort. One particular goal here is that something like the recent strxfrm() debacle might be caught by this testing on some platforms (if we ever get back to trusting strxfrm() at all, that is). The style of these new tests is unorthodox, because a variety of collations are tested, which varies from system to system; we should discuss all this. The tests can take a minute or two to run on my laptop, but that could vary significantly. And so, whether or not that needs to be targeted by something other than "check" and "installcheck" is a discussion that needs to happen. I imagine a lot of buildfarm animals have slow storage, but on the other hand I don't think hackers run the contrib regression tests so often. I prefer to not follow the example of test_decoding if possible; test_decoding's Makefile only runs tests when wal_level=logical is expected (that's how the relevant Makefile targets are scoped), but it's a complicated special case. Review -- As already noted after the first bullet point, there are some tricky considerations on fixing the race when the user calls bt_index_check(). Reviewers should pay close attention to this; could I be wrong about that being race-safe even now? This seems to me to be by far the best place for reviewers to look for problems in this patch, as it's the only bt_index_check()/AccessShareLock case that compares anything *across* pages. The measures taken to prevent false positives described within bt_target_page_check() require expert review, especially because this is the one case where a faulty rationale may result in under-protection from false positives, and not just harmless over-protection. The rationale for why we can reliably recover from a race described within the C function bt_right_page_check_scankey() is *complicated*. I may have failed to make it as simple as possible despite significant effort. It's hard to describe these things
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Tue, Mar 22, 2016 at 10:57 AM, Peter Geoghegan wrote: > That's right - I have a small number of feedback items to work > through. I also determined myself that there could be a very low > probability race condition when checking the key space across sibling > pages, and will work to address that. If I'm right about that then > it's not a lot of work to fix; I'm probably just going to use the > right page's high key rather than its first data item. I also want to use amcheck to test sorting, especially external sorting which is currently totally untested. It would be nice to see testing of strxfrm() on the buildfarm, too -- amcheck provides a nice way to make sure strxfrm() and strcoll() are in agreement for at least those cases that are tested, without having to worry about portability in the same way as a simple pg_regress approach would require. Note that there are reportedly systems in mainstream use where strxfrm() is broken; it doesn't agree with strcoll(). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Tue, Mar 22, 2016 at 9:33 AM, David Steele wrote: > It looks like an updated patch is expected here, though it seems that > the only requests are for updates to comments. That's right - I have a small number of feedback items to work through. I also determined myself that there could be a very low probability race condition when checking the key space across sibling pages, and will work to address that. If I'm right about that then it's not a lot of work to fix; I'm probably just going to use the right page's high key rather than its first data item. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On 3/15/16 3:42 AM, Peter Geoghegan wrote: > On Tue, Mar 15, 2016 at 12:31 AM, Amit Langote > wrote: >> Ah, I see the nuance. Thanks for the explanation. Maybe, >> bt_index_check() and bt_index_parent_child_check() / >> bt_index_check_parent_child(). IMHO, the latter more clearly highlights >> the fact that parent/child relationships in the form of down-links are >> checked. > > Well, the downlink is in the parent, because there is no such thing as > an "uplink". So I prefer bt_index_parent_check(), since it usefully > hints at starting from the parent. It's also more concise. > >> By the way, one request (as a non-native speaker of English language, who >> ends up looking up quite a few words regularly) - >> >> Could we use "conform" or "correspond" instead of "comport" in the >> following error message: >> >> "left link/right link pair in index \"%s\" don't comport" > > OK. I'll do something about that. It looks like an updated patch is expected here, though it seems that the only requests are for updates to comments. -- -David da...@pgmasters.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On 2016/03/15 16:42, Peter Geoghegan wrote: > On Tue, Mar 15, 2016 at 12:31 AM, Amit Langote > wrote: >> By the way, one request (as a non-native speaker of English language, who >> ends up looking up quite a few words regularly) - >> >> Could we use "conform" or "correspond" instead of "comport" in the >> following error message: >> >> "left link/right link pair in index \"%s\" don't comport" > > OK. I'll do something about that. Thanks a lot. - Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Tue, Mar 15, 2016 at 12:31 AM, Amit Langote wrote: > Ah, I see the nuance. Thanks for the explanation. Maybe, > bt_index_check() and bt_index_parent_child_check() / > bt_index_check_parent_child(). IMHO, the latter more clearly highlights > the fact that parent/child relationships in the form of down-links are > checked. Well, the downlink is in the parent, because there is no such thing as an "uplink". So I prefer bt_index_parent_check(), since it usefully hints at starting from the parent. It's also more concise. > By the way, one request (as a non-native speaker of English language, who > ends up looking up quite a few words regularly) - > > Could we use "conform" or "correspond" instead of "comport" in the > following error message: > > "left link/right link pair in index \"%s\" don't comport" OK. I'll do something about that. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
Hi Peter, On 2016/03/15 16:11, Peter Geoghegan wrote: > On Mon, Mar 14, 2016 at 11:48 PM, Amit Langote > wrote: >>> Dunno about that. It's defining characteristic is that it checks child >>> pages against their parent IMV. Things are not often defined in terms >>> of their locking requirements. >> >> At the risk of sounding a bit verbose, do bt_check_level() for a check >> that inspects a level at a time and bt_check_multi_level() for a check >> that spans levels sound descriptive? > > Hmm. But all functions verify multiple levels. What distinguishes > bt_index_parent_check()'s verification is that the downlinks in > internal pages are checked against actual child pages (every item in > the child page, in fact). It's the parent/child relationship that is > verified in addition to the standard checks of every page on and > across (not between) every level. Ah, I see the nuance. Thanks for the explanation. Maybe, bt_index_check() and bt_index_parent_child_check() / bt_index_check_parent_child(). IMHO, the latter more clearly highlights the fact that parent/child relationships in the form of down-links are checked. By the way, one request (as a non-native speaker of English language, who ends up looking up quite a few words regularly) - Could we use "conform" or "correspond" instead of "comport" in the following error message: "left link/right link pair in index \"%s\" don't comport" Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Mon, Mar 14, 2016 at 11:48 PM, Amit Langote wrote: >> Dunno about that. It's defining characteristic is that it checks child >> pages against their parent IMV. Things are not often defined in terms >> of their locking requirements. > > At the risk of sounding a bit verbose, do bt_check_level() for a check > that inspects a level at a time and bt_check_multi_level() for a check > that spans levels sound descriptive? Hmm. But all functions verify multiple levels. What distinguishes bt_index_parent_check()'s verification is that the downlinks in internal pages are checked against actual child pages (every item in the child page, in fact). It's the parent/child relationship that is verified in addition to the standard checks of every page on and across (not between) every level. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On 2016/03/12 6:31, Peter Geoghegan wrote: > On Thu, Mar 10, 2016 at 9:18 AM, Tomas Vondra > wrote: >> I've looked at this patch today, mostly to educate myself, so this >> probably should not count as a full review. Anyway, the patch seems in >> excellent shape - it'd be great if all patches (including those written >> by me) had this level of comments/docs. +1 >> Can we come up with names that more clearly identify the difference >> between those two functions? I mean, _parent_ does not make it >> particularly obvious that the second function acquires exclusive lock >> and performs more thorough checks. > > Dunno about that. It's defining characteristic is that it checks child > pages against their parent IMV. Things are not often defined in terms > of their locking requirements. At the risk of sounding a bit verbose, do bt_check_level() for a check that inspects a level at a time and bt_check_multi_level() for a check that spans levels sound descriptive? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Mar 11, 2016 at 8:19 AM, Anastasia Lubennikova wrote: > But I have some concerns about compatibility with my patches. > I've tried to call bt_index_check() over my "including" patch [1] and caught > a segfault. > > LOG: server process (PID 31794) was terminated by signal 11: Segmentation > fault > DETAIL: Failed process was running: select bt_index_check('idx'); > > I do hope that my patch will be accepted in 9.6, so this conflict looks > really bad. > I think that error is caused by changes in pages layout. To save some space > nonkey attributes are truncated > [1] https://commitfest.postgresql.org/9/433/ I posted a review of your "Covering + unique indexes" patch, where I made an educated guess about what the problem is here (I sort of hinted at what I thought it was already, in this thread, actually). I haven't actually tested this theory of mine myself just yet, but let me know what you think of it on the thread for your patch. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Mon, Mar 14, 2016 at 10:39 AM, Robert Haas wrote: > I don't particularly like that interface. I also suggest that it > would be better to leave throttling to a future commit, and focus on > getting the basic feature in first. Works for me. I don't think throttling is an especially compelling feature. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Mar 11, 2016 at 7:45 PM, Peter Geoghegan wrote: > On Fri, Mar 11, 2016 at 4:30 PM, Jim Nasby wrote: >> Right, but you still have the option to enable them if you don't want to >> swamp your IO system. That's why CIC obeys it too. If I was running a >> consistency check on a production system I'd certainly want the option to >> throttle it. Without that option, I don't see running this on production >> systems as being an option. If that's not a goal then fine, but if it is a >> goal I think it needs to be there. >> >> Isn't it just a few extra lines of code to support it? > > I see your point. > > I'll add that if people like the interface you propose. (Overloading > the VACUUM cost delay functions to cause a delay for amcheck > functions, too). Note that the functions already use an appropriate > buffer access strategy (it avoids blowing shared_buffers, much like > VACUUM itself). I don't particularly like that interface. I also suggest that it would be better to leave throttling to a future commit, and focus on getting the basic feature in first. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On 3/11/16 6:45 PM, Peter Geoghegan wrote: I'll add that if people like the interface you propose. (Overloading the VACUUM cost delay functions to cause a delay for amcheck functions, too). I thought that had already been overloaded by CIC, but I'm not finding reference to it... ANALYZE does use it though, so the ship has already sorta sailed. I'm actually a bit surprised cost delay isn't used anywhere else. As more background operations are added I suspect users will want it at some point. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Sat, Mar 12, 2016 at 7:46 PM, Matt Kelly wrote: > You can actually pretty easily produce a test case by setting up streaming > replication between servers running two different version of glibc. > > I actually wrote a tool that spins up a pair of VMs using vagrant and then > sets them up as streaming replica's using ansible. It provides a nice one > liner to get a streaming replica test environment going and it will easily > provide the cross glibc test case. Technically, though it belongs to Trip > because I wrote it on company time. Let me see if I can open source a > version of it later this week that way you can use it for testing. That could be interesting. The earlier prototypes of this tool are known to have detected glibc collation incompatibilities in real production systems. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
> > On Fri, 2016-03-11 at 17:24 +0100, Michael Paquier wrote: > > On Fri, Mar 11, 2016 at 5:19 PM, Anastasia Lubennikova wrote: > > > > > > BTW, if you know a good way to corrupt index (and do it > > > reproducible) I'd be > > > very glad to see it. > > You can use for example dd in non-truncate mode to corrupt on-disk > > page data, say that for example: > > dd if=/dev/random bs=8192 count=1 \ > > seek=$BLOCK_ID of=base/$DBOID/$RELFILENODE \ > > conv=notrunc > I guess /dev/random is not very compatible with the "reproducible" > requirement. I mean, it will reproducibly break the page, but pretty > much completely, which is mostly what checksums are for. You can actually pretty easily produce a test case by setting up streaming replication between servers running two different version of glibc. I actually wrote a tool that spins up a pair of VMs using vagrant and then sets them up as streaming replica's using ansible. It provides a nice one liner to get a streaming replica test environment going and it will easily provide the cross glibc test case. Technically, though it belongs to Trip because I wrote it on company time. Let me see if I can open source a version of it later this week that way you can use it for testing. - Matt K.
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Mar 11, 2016 at 6:33 PM, Tomas Vondra wrote: > Right, but isn't there a difference between the two functions in this > respect? Once you find corruption involving relationship between > multiple pages, then I agree it's complicated to do any reasoning about > what additional checks are safe. > > But does that problem also apply to bt_index_check, which checks pages > independently? I think so, yes. > Admittedly, this also depends on the use case. If all we need to do is > answering a question "Is the index corrupted?" then sure, bailing out > on the first error is perfectly appropriate. > > But perhaps this would be useful for some recovery/forensics tasks? Maybe, but I feel like making it possible to change the CORRUPTION elevel macro was the right trade-off. I don't want to have to reason about the universe of possible problems that could occur when the tool must limp on in the event of corruption. For example, I don't want to have to deal with infinite loops. In practice, an expert would probably be fine to change the constant themselves if they needed to. Indexes can always be rebuilt. The tool is for identifying and diagnosing corruption, but if you want to diagnose a faulty opclass or something, then I think you need to get out pageinspect. You need human judgement for that. > From time to time we need to investigate corruption in a database, i.e. > see how much of the data is actually corrupted, list pages that need to > be zeroed to get the cluster up to salvage as much as possible, etc. > Currently this is tedious because we essentially find/fix the pages one > by one. It'd be very useful to list all broken pages in one go and then > fix all of them. > > Obviously, that's about heapam checks, but perhaps it would be useful > for an index too? Only insofar as it helps diagnose the underlying issue, when it is a more subtle issue. Actually fixing the index is almost certainly a REINDEX. Once you're into the messy business of diagnosing a problematic opclass, you have to be an expert, and tweaking amcheck for your requirements (i.e. rebuilding from source) becomes reasonable. Part of the reason that the code is so heavily commented is to make it hackable, because I do not feel optimistic that I can get an expert-orientated interface right, but I still want to make the tool as useful as possible to experts. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, 2016-03-11 at 16:40 -0800, Peter Geoghegan wrote: > On Fri, Mar 11, 2016 at 4:17 PM, Peter Geoghegan > wrote: > > > > If you want the tool to limp on when it finds an error, that can be > > done by changing the constant for the CORRUPTION macro in > > amcheck.c. > > But having that be dynamically configurable is not really > > compatible > > with the goal of having amcheck be run routinely. > Also, it's just really hard to reason about what remains OK to check > after the first problem is encountered in the general case. It's > "unreasonable" for any of the checks to ever fail. So, by that > standard, assuming that they might fail at all could be called > paranoid. Who can say what "paranoid enough" should be? I think it's > useful to have a low-impact, generic check function for B-Tree > indexes, but I don't think we need to hold back on being exhaustive > elsewhere. Right, but isn't there a difference between the two functions in this respect? Once you find corruption involving relationship between multiple pages, then I agree it's complicated to do any reasoning about what additional checks are safe. But does that problem also apply to bt_index_check, which checks pages independently? Admittedly, this also depends on the use case. If all we need to do is answering a question "Is the index corrupted?" then sure, bailing out on the first error is perfectly appropriate. But perhaps this would be useful for some recovery/forensics tasks? >From time to time we need to investigate corruption in a database, i.e. see how much of the data is actually corrupted, list pages that need to be zeroed to get the cluster up to salvage as much as possible, etc. Currently this is tedious because we essentially find/fix the pages one by one. It'd be very useful to list all broken pages in one go and then fix all of them. Obviously, that's about heapam checks, but perhaps it would be useful for an index too? regard -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Mar 11, 2016 at 4:30 PM, Jim Nasby wrote: > Right, but you still have the option to enable them if you don't want to > swamp your IO system. That's why CIC obeys it too. If I was running a > consistency check on a production system I'd certainly want the option to > throttle it. Without that option, I don't see running this on production > systems as being an option. If that's not a goal then fine, but if it is a > goal I think it needs to be there. > > Isn't it just a few extra lines of code to support it? I see your point. I'll add that if people like the interface you propose. (Overloading the VACUUM cost delay functions to cause a delay for amcheck functions, too). Note that the functions already use an appropriate buffer access strategy (it avoids blowing shared_buffers, much like VACUUM itself). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Mar 11, 2016 at 4:17 PM, Peter Geoghegan wrote: > If you want the tool to limp on when it finds an error, that can be > done by changing the constant for the CORRUPTION macro in amcheck.c. > But having that be dynamically configurable is not really compatible > with the goal of having amcheck be run routinely. Also, it's just really hard to reason about what remains OK to check after the first problem is encountered in the general case. It's "unreasonable" for any of the checks to ever fail. So, by that standard, assuming that they might fail at all could be called paranoid. Who can say what "paranoid enough" should be? I think it's useful to have a low-impact, generic check function for B-Tree indexes, but I don't think we need to hold back on being exhaustive elsewhere. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On 3/11/16 6:17 PM, Peter Geoghegan wrote: Not sure about the cost delay thing. Delays are disabled by default for manually issued VACUUM, so have doubts that that's useful. Right, but you still have the option to enable them if you don't want to swamp your IO system. That's why CIC obeys it too. If I was running a consistency check on a production system I'd certainly want the option to throttle it. Without that option, I don't see running this on production systems as being an option. If that's not a goal then fine, but if it is a goal I think it needs to be there. Isn't it just a few extra lines of code to support it? -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Mar 11, 2016 at 3:50 PM, Jim Nasby wrote: > I also agree that the nmodule name isn't very clear. If this is meant to be > the start of a generic consistency checker, lets call it that. Otherwise, it > should be marked as being specific to btrees, because presumably we might > eventually want similar tools for GIN, etc. (FWIW I'd vote for a general > consistency checker). It's a generic consistency checker -- that's the intent. Robert wanted to go that way about a year ago, and I think it makes sense (this module started out life as btreecheck). As I said, I don't really care what the name ends up being. amcheck seems fine to me, but I'll go with any reasonable suggestion that reflects the scope of the tool as a generic consistency checker for AMs, including heapam. I don't know that I understand the concern about the naming of bt_index_parent_check(). I called it something that accurately reflects what it does. That particular SQL-callable function is more orientated towards helping hackers than towards helping users detect routine problems, as the docs say, but it could help users working with hackers, and it might even help users acting on their own. I don't want to make it sound scary, because it isn't. I don't know what problems will be detected by this tool most often, and I don't think it would be wise to try to predict how that will work out. And so, concealing the internal workings of each function/check by giving them all totally non-technical names seems like a bad idea. It's not that hard to understand that a B-Tree has multiple levels, and checking the levels against each other requires more restrictive/stronger heavyweight locks. I've only added 2 SQL-callable functions, each of which has one argument, and the first of which has a generic name and very light locking requirements, like a SELECT. It's not that hard to get the general idea, as an ordinary user. > I know the vacuum race condition would be very rare, but I don't think it > can be ignored. Last thing you want out of a consistency checker is false > negatives/positives. That's what I said. And the docs also say that there should never be a false positive. That's definitely an important design goal here, because it enables routine testing. > I do think it would be reasonable to just wholesale > block against concurrent vacuums, but I don't think there's any reasonable > way to do that. Actually, that's exactly what bt_index_parent_check() does already. VACUUM requires a SHARE UPDATE EXCLUSIVE lock on the heap relation, which cannot be concurrently acquired due to bt_index_parent_check()'s acquisition of a SHARE lock. The locking for bt_index_parent_check() is almost the same as REINDEX, except that that acquires an ACCESS EXCLUSIVE lock on the index relation; bt_index_parent_check() only requires an EXCLUSIVE lock on the index relation. Not sure about the cost delay thing. Delays are disabled by default for manually issued VACUUM, so have doubts that that's useful. If you want the tool to limp on when it finds an error, that can be done by changing the constant for the CORRUPTION macro in amcheck.c. But having that be dynamically configurable is not really compatible with the goal of having amcheck be run routinely. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On 3/11/16 3:31 PM, Peter Geoghegan wrote: Can we come up with names that more clearly identify the difference >between those two functions? I mean,_parent_ does not make it >particularly obvious that the second function acquires exclusive lock >and performs more thorough checks. Dunno about that. It's defining characteristic is that it checks child pages against their parent IMV. Things are not often defined in terms of their locking requirements. First, thanks for your work on this. I've wanted it in the past. I agree the name isn't very clear. Perhaps _recurse? I also agree that the nmodule name isn't very clear. If this is meant to be the start of a generic consistency checker, lets call it that. Otherwise, it should be marked as being specific to btrees, because presumably we might eventually want similar tools for GIN, etc. (FWIW I'd vote for a general consistency checker). I know the vacuum race condition would be very rare, but I don't think it can be ignored. Last thing you want out of a consistency checker is false negatives/positives. I do think it would be reasonable to just wholesale block against concurrent vacuums, but I don't think there's any reasonable way to do that. I would prefer the ability to do something other than raising an error when corruption is found, so that you could find all corruption in an index. Obviously could log to a different level. Another option would be SRFs that return info about all the corruption found, but that's probably overkill. It'd be nice if you had the option to obey vacuum_cost_delay when running this, but that's clearly just a nice-to-have (or maybe just obey it all the time, since it defaults to 0). -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Mar 11, 2016 at 8:19 AM, Anastasia Lubennikova wrote: > I do hope that my patch will be accepted in 9.6, so this conflict looks > really bad. I hope so too. I'll have to look into this issue. > I think that error is caused by changes in pages layout. To save some space > nonkey attributes are truncated > when btree copies the indexed value into inner pages or into high key. You > can look at index_reform_tuple() calls. That seems like the kind of problem that would be expected when things like that change. I think it's going to be hard to add new B-Tree features without a tool like this, which was a big reason to work on it; to make these new projects possible to test and review. I see many opportunities to improve the B-Tree code, just as I imagine you do. These projects are quite strategically important, because B-Trees are so frequently used. I think that Postgres B-Trees produce excessive random I/O for checkpoints for various reasons, and this disadvantages Postgres when it is compared to other major systems. As things get better in other areas of Postgres, these hard problems become more important to solve. > I wonder, whether you can add some check of catalog version to your module > or this case requires more changes? I think that it's just going to be tied to the Postgres version. So, if your B-Tree patches are committed first, it's on me to make sure they're handled correctly. Or vice-versa. Not worried that that will be a problem. We already take special steps to avoid the "minus infinity" item on internal pages. I think that in the future, if Postgres B-Trees get suffix truncation for internal page items, there are new problems for amcheck (suffix truncation remove unneeded information from internal page items, sometimes greatly increasing B-Tree fan-out. Internal page items need only be sufficient to guide index scans correctly.). Specially, with suffix truncation there might be "minus infinity" *attributes*, too (these could make it safe to completely remove attributes/columns past the first distinguishing/distinct attribute on each item on internal pages). That's a case that amcheck then needs to care about, just like it currently cares about the existing concept of minus infinity items. That's how it goes for amcheck. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Mar 11, 2016 at 1:31 PM, Peter Geoghegan wrote: > You could have a race, where > there was a concurrent page deletion of the left sibling of the child > page, then a concurrent insertion into the newly expanded keyspace of > the parent. Therefore, the downlink in the parent (which is the > "target", to use the patch's terminology) would not be a lower bound > on items in the page. Excuse me: I meant the newly expanded keyspace of the *child*. (The parent's keyspace would have covered everything. It's naturally far larger than either child's keyspace, since it typically has several hundred pages.) -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Thu, Mar 10, 2016 at 9:18 AM, Tomas Vondra wrote: > I've looked at this patch today, mostly to educate myself, so this > probably should not count as a full review. Anyway, the patch seems in > excellent shape - it'd be great if all patches (including those written > by me) had this level of comments/docs. Thanks. I try. This patch is something I've been working on slowly and steadily for over 2 years. It was time to see it through. >> Note that no function currently checks that the index is consistent >> with the heap, which would be very useful (that's probably how I'd >> eventually target the heapam, actually). > > I'm afraid I don't understand what "target the heapam" means. Could you > elaborate? Just that that's how I'd make amcheck verify that the heap was sane, if I was going to undertake that project. Or, I'd start there. > I agree. It'd be nice to have a tool that also checks for data > corruption the a lower level (e.g. that varlena headers are not > corrupted etc.) but that seems like a task for some other tool. I'm not sure that that's a task for another tool. I think that this tool is scoped at detecting corruption, and that could work well for the heap, too. There might be fewer interesting things to test about the heap when indexes aren't involved. Following HOT chains through an index, and verifying things like that about the heap as we go could detect a lot of problematic cases. > Can we come up with names that more clearly identify the difference > between those two functions? I mean, _parent_ does not make it > particularly obvious that the second function acquires exclusive lock > and performs more thorough checks. Dunno about that. It's defining characteristic is that it checks child pages against their parent IMV. Things are not often defined in terms of their locking requirements. > This also applies to the name of the contrib module - it's not > particularly obvious what "amcheck" unless you're familiar with the > concept of access methods. Which is quite unlikely for regular users. > Maybe something like "check_index" would be more illustrative? I think that given the heap may be targeted in the future, amcheck is more future-proof. I think Robert might have liked that name a year ago or something. In general, I'm not too worried about what the contrib module is called in the end. > Well, I wouldn't count myself as an expert here, but do we actually need > such protection? I mean, we only do such checks when holding an > exclusive lock on the parent relation, no? And even if we don't vacuum > can only remove entries from the pages - why should that cause > violations of any invariants? I think that a more worked out explanation for why the ExclusiveLock is needed is appropriate. I meant to do that. Basically, a heavier lock is needed because of page deletion by VACUUM, which is the major source of complexity (much more than page splits, I'd say). In general, the key space can be consolidated by VACUUM in a way that breaks child page checking because the downlink we followed from our target page is no longer the current downlink. Page deletion deletes the right sibling's downlink, and the deleted page's downlink is used for its sibling. You could have a race, where there was a concurrent page deletion of the left sibling of the child page, then a concurrent insertion into the newly expanded keyspace of the parent. Therefore, the downlink in the parent (which is the "target", to use the patch's terminology) would not be a lower bound on items in the page. That's a very low probability race, because it involves deletion a page representing a range in the keyspace that has no live items, but then immediately getting one just in time for our check. But, I'm pretty determined that false positives like that need to be impossible (or else it's a bug). I have a paranoid feeling that there is a similar very low probability race with the left-right keyspace checks, which don't have relation ExclusiveLock protection (IOW, I think that that might be buggy). I need to think about that some more, but current thinking is that it would hardly matter if we used the highkey from right page rather than the first data item, which definitely would be correct. And it would be simpler. > A few minor comments: Thanks for catching those issues. Will fix. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Mar 11, 2016 at 1:09 PM, Peter Geoghegan wrote: > Or, you could add code like this to comparetup_index_btree(), to > simulate a broken opclass: > > diff --git a/src/backend/utils/sort/tuplesort.c > b/src/backend/utils/sort/tuplesort.c > index 67d86ed..23712ff 100644 > --- a/src/backend/utils/sort/tuplesort.c > +++ b/src/backend/utils/sort/tuplesort.c > @@ -3562,6 +3562,9 @@ comparetup_index_btree(const SortTuple *a, const > SortTuple *b, > compare = ApplySortComparator(a->datum1, a->isnull1, > > b->datum1, b->isnull1, > sortKey); > + > + if (random() <= (MAX_RANDOM_VALUE / 1000)) > + compare = -compare; > if (compare != 0) > return compare; Note that this patch that I sketched would make CREATE INDEX produce corrupt indexes, but the tool's verification itself would not be affected. Although even if it was (even if _bt_compare() gave the same wrong answers), it would still very likely detect corruption. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Mar 11, 2016 at 8:24 AM, Michael Paquier wrote: > You can use for example dd in non-truncate mode to corrupt on-disk > page data, say that for example: > dd if=/dev/random bs=8192 count=1 \ > seek=$BLOCK_ID of=base/$DBOID/$RELFILENODE \ > conv=notrunc Sure, but that would probably fail at the first hurdle -- the page header would be corrupt. Which is a valid test, but not all that interesting. One testing workflow I tried is overwriting some page in a B-Tree relfilenode with some other page in the same file: $:~/pgdata/base/12413$ dd if=somefile of=somefile conv=notrunc bs=8192 count=1 skip=2 seek=3 That should fail due to the key space not being in order across pages, which is slightly interesting. Or, you could selectively change one item with a hex editor, as Anastasia did. Or, you could add code like this to comparetup_index_btree(), to simulate a broken opclass: diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 67d86ed..23712ff 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -3562,6 +3562,9 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b, compare = ApplySortComparator(a->datum1, a->isnull1, b->datum1, b->isnull1, sortKey); + + if (random() <= (MAX_RANDOM_VALUE / 1000)) + compare = -compare; if (compare != 0) return compare; There are many options when you want to produce a corrupt B-Tree index! -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
Hi, On Fri, 2016-03-11 at 17:24 +0100, Michael Paquier wrote: > On Fri, Mar 11, 2016 at 5:19 PM, Anastasia Lubennikova wrote: > > > > BTW, if you know a good way to corrupt index (and do it > > reproducible) I'd be > > very glad to see it. > You can use for example dd in non-truncate mode to corrupt on-disk > page data, say that for example: > dd if=/dev/random bs=8192 count=1 \ > seek=$BLOCK_ID of=base/$DBOID/$RELFILENODE \ > conv=notrunc I guess /dev/random is not very compatible with the "reproducible" requirement. I mean, it will reproducibly break the page, but pretty much completely, which is mostly what checksums are for. I think the main goal of the amcheck module is to identify issues that are much more subtle - perhaps a coding error or unhandled cornercase which results in page with a valid structure and checksum, but broken page. Or perhaps issues coming from outside of PostgreSQL, like a subtle change in the locales. Simulating those issues requires a tool that allows minor tweaks to the page, dd is a bit too dull for that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Mar 11, 2016 at 5:19 PM, Anastasia Lubennikova wrote: > BTW, if you know a good way to corrupt index (and do it reproducible) I'd be > very glad to see it. You can use for example dd in non-truncate mode to corrupt on-disk page data, say that for example: dd if=/dev/random bs=8192 count=1 \ seek=$BLOCK_ID of=base/$DBOID/$RELFILENODE \ conv=notrunc -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
01.03.2016 03:09, Peter Geoghegan: I was assigned an "action point" during the FOSDEM developer meeting: "Post new version of btree consistency checker patch". I attach a new WIP version of my consistency checker tool, amcheck. This patch is proposed for 9.6, as an extension in contrib -- maybe we can still get it in. This is the first time I've added any version of this to a commitfest, although I've posted a couple of rough versions of this in the past couple of years. The attached version has received a major overhaul, and is primarily aimed at finding corruption in production systems, although I think it will still have significant value for debugging too. Maybe it can help with some of the B-Tree patches in the final commitfest, for example. I also have some hope that it will become a learning tool for people interested in how B-Tree indexes work. To recap, the extension adds some SQL-callable functions that verify certain invariant conditions hold within some particular B-Tree index. These are the conditions that index scans rely on always being true. The tool's scope may eventually cover other AMs, including heapam, but nbtree seems like the best place to start. Note that no function currently checks that the index is consistent with the heap, which would be very useful (that's probably how I'd eventually target the heapam, actually). Hi, thank you for the contrib module. The patch itself looks great. It was really useful to have such a tool while working on b-tree patches. Invariants nbtree invariants that the tool verifies with just an AccessShareLock on the relation are: * That all items are in the correct, opclass order on each page. * That the page "high key", if any, actually bounds the items on the page. * That the last item on a page is less than or equal to the first item on the next page (the page to its right). The idea here is that the key space spans multiple pages, not just one page, so it make sense to check the last item where we can. With an ExclusiveLock + ShareLock, some addition invariants are verified: * That child pages actually have their parent's downlink as a lower bound. * Sane right links and left links at each level. Obviously, this tool is all about distrusting the structure of a B-Tree index. That being the case, it's not always totally clear where to draw the line. I think I have the balance about right, though. Interface === There are only 2 SQL callable functions in the extension, which are very similar: bt_index_check(index regclass) bt_index_parent_check(index regclass) The latter is more thorough than the former -- it performs all checks, including those checks that I mentioned required an ExclusiveLock. So, bt_index_check() only requires an AccessShareLock. bt_index_parent_check() requires an ExclusiveLock on the index relation, and a ShareLock on its heap relation, almost like REINDEX. bt_index_parent_check() performs verification that is a superset of the verification performed by bt_index_check() -- mostly, the extra verification/work is that it verifies downlinks against child pages. Both functions raise an error in the event of observing that an invariant in a B-Tree was violated, such as items being out of order on a page. I've written extensive documentation, which goes into practical aspects of using amcheck effectively. It doesn't go into significant detail about every invariant that is checked, but gives a good idea of roughly what checks are performed. I could almost justify only having one function with an argument about the downlink/child verification, but that would be a significant footgun given the varying locking requirements that such a function would have. I've done some testing. And I can say that both announced functions work well. BTW, if you know a good way to corrupt index (and do it reproducible) I'd be very glad to see it. I created an index, then opened the file and changed the order of elements. And bt_index_check() found the error successfully. /* Ensure that the data is changed. */ SELECT * FROM bt_page_items('idx', 1); itemoffset | ctid | itemlen | nulls | vars | data +---+-+---+--+- 1 | (0,3) | 16 | f | f| 03 00 00 00 00 00 00 00 2 | (0,2) | 16 | f | f| 02 00 00 00 00 00 00 00 3 | (0,1) | 16 | f | f| 01 00 00 00 00 00 00 00 (3 rows) /* Great, amcheck found an error. */ select bt_index_check('idx'); ERROR: page order invariant violated for index "idx" DETAIL: Lower index tid=(1,1) (points to heap tid=(0,3)) higher index tid=(1,2) (points to heap tid=(0,2)) page lsn=0/17800D0. Locking == We never rely on something like holding on to a buffer pin as an interlock for correctness (the vacuum interlock thing isn't generally necessary, because we don't look at the heap at all). We simply pin + BT_READ lock a buffer, copy it into local memory allocated by pal
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
Hi, I've looked at this patch today, mostly to educate myself, so this probably should not count as a full review. Anyway, the patch seems in excellent shape - it'd be great if all patches (including those written by me) had this level of comments/docs. On Mon, 2016-02-29 at 16:09 -0800, Peter Geoghegan wrote: > I was assigned an "action point" during the FOSDEM developer meeting: > "Post new version of btree consistency checker patch". I attach a new > WIP version of my consistency checker tool, amcheck. ... > To recap, the extension adds some SQL-callable functions that verify > certain invariant conditions hold within some particular B-Tree index. > These are the conditions that index scans rely on always being true. > The tool's scope may eventually cover other AMs, including heapam, but > nbtree seems like the best place to start. > > Note that no function currently checks that the index is consistent > with the heap, which would be very useful (that's probably how I'd > eventually target the heapam, actually). I'm afraid I don't understand what "target the heapam" means. Could you elaborate? > Invariants > ... > Obviously, this tool is all about distrusting the structure of a > B-Tree index. That being the case, it's not always totally clear where > to draw the line. I think I have the balance about right, though. I agree. It'd be nice to have a tool that also checks for data corruption the a lower level (e.g. that varlena headers are not corrupted etc.) but that seems like a task for some other tool. > Interface > === > > There are only 2 SQL callable functions in the extension, which are > very similar: > > bt_index_check(index regclass) > > bt_index_parent_check(index regclass) > Can we come up with names that more clearly identify the difference between those two functions? I mean, _parent_ does not make it particularly obvious that the second function acquires exclusive lock and performs more thorough checks. This also applies to the name of the contrib module - it's not particularly obvious what "amcheck" unless you're familiar with the concept of access methods. Which is quite unlikely for regular users. Maybe something like "check_index" would be more illustrative? > Locking > == ... > > We do, on the other hand, have a detailed rationale for why it's okay > that we don't have an ExclusiveLock on the index relation for checks > that span the key space of more than one page by following right links > to compare items across sibling pages. This isn't the same thing as > having an explicit interlock like a pin -- our interlock is one > against *recycling* by vacuum, which is based on recentGlobalXmin. > This rationale requires expert review. Well, I wouldn't count myself as an expert here, but do we actually need such protection? I mean, we only do such checks when holding an exclusive lock on the parent relation, no? And even if we don't vacuum can only remove entries from the pages - why should that cause violations of any invariants? A few minor comments: 1) This should probably say "one block per level": /* * A B-Tree cannot possibly have this many levels, since there must be * one block per page, and that is bound by the range of BlockNumber: */ 2) comment before bt_check_every_level() says: ... assumed to prevent any kind of physically modification ... probably should be "physical modification" instead. 3) s/targecontext/targetcontext/ in BtreeCheckState comment regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] amcheck (B-Tree integrity checking tool)
I was assigned an "action point" during the FOSDEM developer meeting: "Post new version of btree consistency checker patch". I attach a new WIP version of my consistency checker tool, amcheck. This patch is proposed for 9.6, as an extension in contrib -- maybe we can still get it in. This is the first time I've added any version of this to a commitfest, although I've posted a couple of rough versions of this in the past couple of years. The attached version has received a major overhaul, and is primarily aimed at finding corruption in production systems, although I think it will still have significant value for debugging too. Maybe it can help with some of the B-Tree patches in the final commitfest, for example. I also have some hope that it will become a learning tool for people interested in how B-Tree indexes work. To recap, the extension adds some SQL-callable functions that verify certain invariant conditions hold within some particular B-Tree index. These are the conditions that index scans rely on always being true. The tool's scope may eventually cover other AMs, including heapam, but nbtree seems like the best place to start. Note that no function currently checks that the index is consistent with the heap, which would be very useful (that's probably how I'd eventually target the heapam, actually). Invariants nbtree invariants that the tool verifies with just an AccessShareLock on the relation are: * That all items are in the correct, opclass order on each page. * That the page "high key", if any, actually bounds the items on the page. * That the last item on a page is less than or equal to the first item on the next page (the page to its right). The idea here is that the key space spans multiple pages, not just one page, so it make sense to check the last item where we can. With an ExclusiveLock + ShareLock, some addition invariants are verified: * That child pages actually have their parent's downlink as a lower bound. * Sane right links and left links at each level. Obviously, this tool is all about distrusting the structure of a B-Tree index. That being the case, it's not always totally clear where to draw the line. I think I have the balance about right, though. Interface === There are only 2 SQL callable functions in the extension, which are very similar: bt_index_check(index regclass) bt_index_parent_check(index regclass) The latter is more thorough than the former -- it performs all checks, including those checks that I mentioned required an ExclusiveLock. So, bt_index_check() only requires an AccessShareLock. bt_index_parent_check() requires an ExclusiveLock on the index relation, and a ShareLock on its heap relation, almost like REINDEX. bt_index_parent_check() performs verification that is a superset of the verification performed by bt_index_check() -- mostly, the extra verification/work is that it verifies downlinks against child pages. Both functions raise an error in the event of observing that an invariant in a B-Tree was violated, such as items being out of order on a page. I've written extensive documentation, which goes into practical aspects of using amcheck effectively. It doesn't go into significant detail about every invariant that is checked, but gives a good idea of roughly what checks are performed. I could almost justify only having one function with an argument about the downlink/child verification, but that would be a significant footgun given the varying locking requirements that such a function would have. Locking == We never rely on something like holding on to a buffer pin as an interlock for correctness (the vacuum interlock thing isn't generally necessary, because we don't look at the heap at all). We simply pin + BT_READ lock a buffer, copy it into local memory allocated by palloc(), and then immediately release the buffer lock and drop the pin. This is the same in all instances. There is never more than one buffer lock or pin held at a time. We do, on the other hand, have a detailed rationale for why it's okay that we don't have an ExclusiveLock on the index relation for checks that span the key space of more than one page by following right links to compare items across sibling pages. This isn't the same thing as having an explicit interlock like a pin -- our interlock is one against *recycling* by vacuum, which is based on recentGlobalXmin. This rationale requires expert review. Performance == Trying to keep the tool as simple as possible, while still making it do verification that is as useful as possible was my priority here, not performance. Still, verification completes fairly quickly. Certainly, it takes far less time than having to REINDEX the index, and doesn't need too much memory. I think that in practice most problems that can be detected by the B-Tree checker functions will be detected with the lighter variant. -- Peter Geoghegan From 573810d8d3c994ce1a16ecffb2f5d208c0ff93e3 Mon Sep 17 00:00:00 2001