subject:"\[HACKERS\] amcheck \(B\-Tree integrity checking tool\)"

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-03-09 Thread Peter Geoghegan

On Thu, Mar 9, 2017 at 7:12 PM, Peter Geoghegan  wrote:
>> Hm - I think it's fair to export RecentGlobalXmin, given that we
>> obviously use it across modules in core code.  I see very little reason
>> not to export it.
>
> Well, the assertion is completely useless as anything but documentation...

I came up with the attached.

-- 
Peter Geoghegan
From 05a8c0de7da7cd9ee4d8436abdf8368c1cf8ac19 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Thu, 9 Mar 2017 19:17:07 -0800
Subject: [PATCH] Fix amcheck build on Windows.

Remove documenting assertion around RecentGlobalXmin, rather than
marking it PGDLLIMPORT.

In passing, update some comments that did not reflect the state of
amcheck as-committed.
---
 contrib/amcheck/verify_nbtree.c | 74 +++--
 1 file changed, 35 insertions(+), 39 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 5724aa6..d21b209 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -26,7 +26,6 @@
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "utils/memutils.h"
-#include "utils/snapmgr.h"
 
 
 PG_MODULE_MAGIC;
@@ -55,8 +54,8 @@ typedef struct BtreeCheckState
 
 	/* B-Tree Index Relation */
 	Relation	rel;
-	/* ExclusiveLock held? */
-	bool		exclusivelylocked;
+	/* ShareLock held on heap/index, rather than AccessShareLock? */
+	bool		strongerlock;
 	/* Per-page context */
 	MemoryContext targetcontext;
 	/* Buffer access strategy */
@@ -94,7 +93,7 @@ PG_FUNCTION_INFO_V1(bt_index_parent_check);
 
 static void bt_index_check_internal(Oid indrelid, bool parentcheck);
 static inline void btree_index_checkable(Relation rel);
-static void bt_check_every_level(Relation rel, bool exclusivelylocked);
+static void bt_check_every_level(Relation rel, bool strongerlock);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 			 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
@@ -256,23 +255,26 @@ btree_index_checkable(Relation rel)
  * logical order, verifying invariants as it goes.
  *
  * It is the caller's responsibility to acquire appropriate heavyweight lock on
- * the index relation, and advise us if extra checks are safe when an
- * ExclusiveLock is held.
+ * the index relation, and advise us if extra checks are safe when a ShareLock
+ * is held.
  *
- * An ExclusiveLock is generally assumed to prevent any kind of physical
+ * A ShareLock is generally assumed to prevent any kind of physical
  * modification to the index structure, including modifications that VACUUM may
  * make.  This does not include setting of the LP_DEAD bit by concurrent index
  * scans, although that is just metadata that is not able to directly affect
  * any check performed here.  Any concurrent process that might act on the
  * LP_DEAD bit being set (recycle space) requires a heavyweight lock that
- * cannot be held while we hold an ExclusiveLock.  (Besides, even if that could
+ * cannot be held while we hold a ShareLock.  (Besides, even if that could
  * happen, the ad-hoc recycling when a page might otherwise split is performed
  * per-page, and requires an exclusive buffer lock, which wouldn't cause us
  * trouble.  _bt_delitems_vacuum() may only delete leaf items, and so the extra
  * parent/child check cannot be affected.)
+ *
+ * Note that we rely on RecentGlobalXmin interlock against recycling, just like
+ * standard index scans.  See note on RecentGlobalXmin/B-Tree page deletion.
  */
 static void
-bt_check_every_level(Relation rel, bool exclusivelylocked)
+bt_check_every_level(Relation rel, bool strongerlock)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -281,17 +283,11 @@ bt_check_every_level(Relation rel, bool exclusivelylocked)
 	BtreeLevel	current;
 
 	/*
-	 * RecentGlobalXmin assertion matches index_getnext_tid().  See note on
-	 * RecentGlobalXmin/B-Tree page deletion.
-	 */
-	Assert(TransactionIdIsValid(RecentGlobalXmin));
-
-	/*
 	 * Initialize state for entire verification operation
 	 */
 	state = palloc(sizeof(BtreeCheckState));
 	state->rel = rel;
-	state->exclusivelylocked = exclusivelylocked;
+	state->strongerlock = strongerlock;
 	/* Create context for page */
 	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
  "amcheck context",
@@ -353,7 +349,7 @@ bt_check_every_level(Relation rel, bool exclusivelylocked)
 
 /*
  * Given a left-most block at some level, move right, verifying each page
- * individually (with more verification across pages for "exclusivelylocked"
+ * individually (with more verification across pages for "strongerlock"
  * callers).  Caller should pass the true root page as the leftmost initially,
  * working their way down by passing what is returned for the last call here
  * until level 0 (leaf page level) was reached.
@@ -428,7 +424,7 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 			 * locking, check that the first valid page meet

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-03-09 Thread Peter Geoghegan

On Thu, Mar 9, 2017 at 7:10 PM, Andres Freund  wrote:
>>   verify_nbtree.obj : error LNK2001: unresolved external symbol
>> RecentGlobalXmin
>> [C:\buildfarm\buildenv\HEAD\pgsql.build\amcheck.vcxproj]
>>
>> Rather than marking RecentGlobalXmin as PGDLLIMPORT, I'd rather just
>> remove the documenting assertion and leave that comment as-is. I'll
>> work on a patch for this soon.
>
> Hm - I think it's fair to export RecentGlobalXmin, given that we
> obviously use it across modules in core code.  I see very little reason
> not to export it.

Well, the assertion is completely useless as anything but documentation...


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-03-09 Thread Andres Freund

Hi,

On 2017-03-09 19:04:46 -0800, Peter Geoghegan wrote:
> On Thu, Mar 9, 2017 at 4:41 PM, Andres Freund  wrote:
> > I don't really expect buildfarm fallout, but ..
> 
> Unfortunately, there is some on Windows:

Thanks for monitoring.


>   verify_nbtree.obj : error LNK2001: unresolved external symbol
> RecentGlobalXmin
> [C:\buildfarm\buildenv\HEAD\pgsql.build\amcheck.vcxproj]
> 
> Rather than marking RecentGlobalXmin as PGDLLIMPORT, I'd rather just
> remove the documenting assertion and leave that comment as-is. I'll
> work on a patch for this soon.

Hm - I think it's fair to export RecentGlobalXmin, given that we
obviously use it across modules in core code.  I see very little reason
not to export it.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-03-09 Thread Peter Geoghegan

On Thu, Mar 9, 2017 at 4:41 PM, Andres Freund  wrote:
> And pushed with these.

Thanks.

> I don't really expect buildfarm fallout, but ..

Unfortunately, there is some on Windows:

  verify_nbtree.obj : error LNK2001: unresolved external symbol
RecentGlobalXmin
[C:\buildfarm\buildenv\HEAD\pgsql.build\amcheck.vcxproj]

Rather than marking RecentGlobalXmin as PGDLLIMPORT, I'd rather just
remove the documenting assertion and leave that comment as-is. I'll
work on a patch for this soon.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-03-09 Thread Andres Freund

On 2017-03-09 16:29:24 -0800, Peter Geoghegan wrote:
> I am generally in favor of more inclusive Reviewed-By lists. I suggest
> that we expand it to:
> 
> "Reviewed-By: Andres Freund, Thomas Vondra, Thomas Munro, Anastasia
> Lubennikova, Robert Haas, Amit Langote"

And pushed with these.

I don't really expect buildfarm fallout, but ..

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-03-09 Thread Peter Geoghegan

On Thu, Mar 9, 2017 at 3:52 PM, Andres Freund  wrote:
>
> But I also think it's more important to get some initial verification
> functionality in, than resolving this conflict. I do, also quite
> strongly, think we'll be better served with having something like what
> you're proposing than nothing, and I don't have time/energy to turn your
> patch into what I'm envisioning, nor necessarily the buyin.  Hence I'm
> planning to commit the current interface.
>
> Attached is your original patch, and a patch editorializing things. I do
> plan to merge those before pushing.

Your revisions all seem fine. No problems with any of it.

> Who would you like to see added to Reviewed-By?

I am generally in favor of more inclusive Reviewed-By lists. I suggest
that we expand it to:

"Reviewed-By: Andres Freund, Thomas Vondra, Thomas Munro, Anastasia
Lubennikova, Robert Haas, Amit Langote"

Thanks for your help with this!

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-03-09 Thread Andres Freund

On 2017-03-06 20:40:59 -0800, Peter Geoghegan wrote:
> On Mon, Mar 6, 2017 at 3:57 PM, Andres Freund  wrote:
> > I'm ok with not immediately doing so, but I think Peter's design isn't
> > in line with achieving something like this.
>
> I would be okay with doing this if we had a grab-bag of expensive
> checks, that all pretty much require some combination of
> ExclusiveLocks and/or ShareLocks anyway, and are amenable to being
> combined. I could see that happening, but we're a long way off from
> that. When it does happen, we might have other things that are a bit
> like bt_index_parent_check(), in terms of locking requirements and
> degree of thoroughness, that might themselves require other parameters
> specific to the verification that they perform that we cannot
> anticipate. For example, maybe bt_index_parent_check() could have a
> parameter that is a probability that any given IndexTuple will be
> verified against its heap tuple. Then you could have something like a
> PRNG seed argument, so you can check different tuples every day. There
> are many possibilities, and I hope that eventually amcheck has this
> kind of very rich functionality.

> When you use the interface you describe to "stack" several checks at
> once, less important parameters, like the ones I suggest are going to
> be an awkward fit, and so will probably be excluded. I'm not opposed
> to having what amounts to a second interface to what bt_index_check()
> does, to make this kind of thing work where it makes sense. Really,
> bt_index_parent_check() is almost an alternative interface to the same
> functionality today. There can be another one in the future, if it
> serves a purpose, and the locking requirements are roughly the same
> for all the checks. I'd be fine with that. Let's just get the basic
> feature in for now, though.

I disagree quite strongly with pretty much all of that.


But I also think it's more important to get some initial verification
functionality in, than resolving this conflict. I do, also quite
strongly, think we'll be better served with having something like what
you're proposing than nothing, and I don't have time/energy to turn your
patch into what I'm envisioning, nor necessarily the buyin.  Hence I'm
planning to commit the current interface.

Attached is your original patch, and a patch editorializing things. I do
plan to merge those before pushing.

Who would you like to see added to Reviewed-By?

Regards,

Andres
>From 60c670d251dd4244a09215e4f6c22e444d2bff7c Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 9 Mar 2017 15:50:40 -0800
Subject: [PATCH 1/2] Add amcheck extension to contrib.

This is the beginning of a collection of SQL-callable functions to
verify the integrity of data files.  For now it only contains code to
verify B-Tree indexes.

This adds two SQL-callable functions, validating B-Tree consistency to
a varying degree.  Check the, extensive, docs for details.

The goal is to later extend the coverage of the module to further
access methods, possibly including the heap.  Once checks for
additional access methods exist, we'll likely add some "dispatch"
functions that cover multiple access methods.

Author: Peter Geoghegan, editorialized by Andres Freund
Reviewed-By: Andres Freund, Thomas Vondra
Discussion: cam3swzqzlmhmwmbqjzk+prkxrnuz4w90wymuwfkev8mz3de...@mail.gmail.com
---
 contrib/Makefile |1 +
 contrib/amcheck/Makefile |   19 +
 contrib/amcheck/amcheck--1.0.sql |   24 +
 contrib/amcheck/amcheck.control  |5 +
 contrib/amcheck/verify_nbtree.c  | 1238 ++
 doc/src/sgml/amcheck.sgml|  275 +
 doc/src/sgml/contrib.sgml|1 +
 doc/src/sgml/filelist.sgml   |1 +
 src/tools/pgindent/typedefs.list |2 +
 9 files changed, 1566 insertions(+)
 create mode 100644 contrib/amcheck/Makefile
 create mode 100644 contrib/amcheck/amcheck--1.0.sql
 create mode 100644 contrib/amcheck/amcheck.control
 create mode 100644 contrib/amcheck/verify_nbtree.c
 create mode 100644 doc/src/sgml/amcheck.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index 9a74e1b99f..e84eb67008 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -6,6 +6,7 @@ include $(top_builddir)/src/Makefile.global
 
 SUBDIRS = \
 		adminpack	\
+		amcheck		\
 		auth_delay	\
 		auto_explain	\
 		bloom		\
diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
new file mode 100644
index 00..f14c7aa58b
--- /dev/null
+++ b/contrib/amcheck/Makefile
@@ -0,0 +1,19 @@
+# contrib/amcheck/Makefile
+
+MODULE_big	= amcheck
+OBJS		= verify_nbtree.o $(WIN32RES)
+
+EXTENSION = amcheck
+DATA = amcheck--1.0.sql
+PGFILEDESC = "amcheck - functions to verify access method invariants"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/amcheck
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
di

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-03-06 Thread Peter Geoghegan

On Mon, Mar 6, 2017 at 3:57 PM, Andres Freund  wrote:
> I'm ok with not immediately doing so, but I think Peter's design isn't
> in line with achieving something like this.

I would be okay with doing this if we had a grab-bag of expensive
checks, that all pretty much require some combination of
ExclusiveLocks and/or ShareLocks anyway, and are amenable to being
combined. I could see that happening, but we're a long way off from
that. When it does happen, we might have other things that are a bit
like bt_index_parent_check(), in terms of locking requirements and
degree of thoroughness, that might themselves require other parameters
specific to the verification that they perform that we cannot
anticipate. For example, maybe bt_index_parent_check() could have a
parameter that is a probability that any given IndexTuple will be
verified against its heap tuple. Then you could have something like a
PRNG seed argument, so you can check different tuples every day. There
are many possibilities, and I hope that eventually amcheck has this
kind of very rich functionality.

When you use the interface you describe to "stack" several checks at
once, less important parameters, like the ones I suggest are going to
be an awkward fit, and so will probably be excluded. I'm not opposed
to having what amounts to a second interface to what bt_index_check()
does, to make this kind of thing work where it makes sense. Really,
bt_index_parent_check() is almost an alternative interface to the same
functionality today. There can be another one in the future, if it
serves a purpose, and the locking requirements are roughly the same
for all the checks. I'd be fine with that. Let's just get the basic
feature in for now, though.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-03-06 Thread Andres Freund

On 2017-02-13 12:05:21 -0500, Robert Haas wrote:
> On Fri, Feb 10, 2017 at 8:39 PM, Peter Geoghegan  wrote:
> > On Thu, Feb 9, 2017 at 2:32 PM, Andres Freund  wrote:
> >> Except that the proposed names aren't remotely like that... ;).
> >
> > Revision attached -- V5. We now REVOKE ALL on both functions, as
> > Robert suggested, instead of the previous approach of having a
> > hard-coded superuser check with enforcement.
> >
> >> And I proposed documenting named parameters, and
> >> check_btree(performing_check_requiring_exclusive_locks => true) is just
> >> about as expressive.
> >
> > I have not done this, nor have I renamed the functions. I still think
> > that this is something that we can fix by adding a boolean argument to
> > each function in the future, or something along those lines. I
> > *really* hate the idea of having one function with non-obvious,
> > variable requirements on locking, with locking implications that are
> > not knowable when we PREPARE an SQL statement calling the function. It
> > also removes a useful way of have superusers discriminate against the
> > stronger locking variant bt_index_parent_check() by not granting
> > execute on it (as an anti-footgun measure).
>
> I think Andres is more or less correct that
> "performing_check_requiring_exclusive_locks => true" is just about as
> expressive as calling a different function, but I think that your
> point that the superuser might want to grant access to one function
> but not the other is a good one.  On the other hand, I think Andres
> has a concern that we might have more modes in the future and we don't
> want to end up with 2^n entrypoints.  That also seems valid.  Hmm.

That's part of my concern.  The second part is that I really want to be
able to have a check_relation() (and check_database())function that I
can pass a bunch of arguments determining how expensive checks are going
to be performed.

E.g. I'd like to be able to do something like

SELECT *
FROM check_relation(
   'my_table'::regclass,
   test_btree => 'true',
   test_btree_heap_interlock => 'true',
   test_gin => 'true');

SELECT *
FROM check_current_database(
   test_heap_update_chains => 'true',
   test_heap_clog_interlock => 'true',
   test_btree => 'true',
   test_gin => 'false');

etc.

You can't really trivially replace these with a larger query and/or
function, because of the locking considerations (consider what happens
if somebody concurrently drops a table/index - your whole query errors
out, wasting hours of work).

I'm ok with not immediately doing so, but I think Peter's design isn't
in line with achieving something like this.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-03-06 Thread Andres Freund

Hi,

On 2017-02-13 11:52:53 -0500, Robert Haas wrote:
> On Thu, Feb 9, 2017 at 5:32 PM, Andres Freund  wrote:
> >> > Again, some parts of the code doing something bad isn't a good argument
> >> > for doing again. Releasing locks early is a bad pattern, because backend
> >> > code isn't generally safe against it, we have special code-paths for
> >> > catalog tables to support it for those.
> >>
> >> I don't agree with that.  The reason we keep relation locks until the
> >> end of the transaction is to avoid surprising application code, not
> >> because the system can't tolerate it.  But here it's more likely that
> >> retaining the lock will surprise the user.
> >
> > It's both.  A bunch of code paths rely on early release ony happening on
> > catalogs. E.g., and that's just the first thing that comes to mind,
> > cache invalidations which really only are guaranteed to be delivered and
> > visibile for other cache accesses, if the lock is only released after
> > the end of the transaction.  Object accesses like relation_open() accept
> > invalidations *after* acquiring the lock. Combined with lock releases
> > only happening *after* transaction commits that guarantees an up2date
> > view of the cache; but if others release the lock earlier and have cache
> > invalidations, that doesn't work anymore.  Now you can argue that it's
> > possibly harmless here because there's presumably no way this can ever
> > trigger cache invals - and you're probably right - but this isn't the
> > only place with such assumption and you'd definitely have to argue *why*
> > it's right.
> 
> I agree that code which issues cache invalidations needs to hold a
> blocking lock until commit IF it's critical for other backends to see
> the update.

Right. And I think that needs to be the default assumption, because
otherwise you need to be careful about all the relevant code paths.  I'm
fine with adding something along the lines of
/*
 * Release lock on relation early, this function doesn't generate any
 * cache invalidations, and it's convenient for callers to be able to
 * check multiple functions in one go.
 */
(after checking that's true of course) and then skipping the lock
release.  What I'm absolutely not ok with is doing so without
consideration and documentation.

> I don't really understand your demand for "an argument why it's
> right".  My argument is simple: I can't think of a single thing that
> it would break, and I don't believe there is any such thing.

That's pretty much an argument.  I just want documentation that calls
attention that we're doing something that requires a modicum of care.  I
fail to see why that's unreasonable.  I don't think it's that unlikely
will come around adding something like "hey we can update the free page
statistic here, because we just read the whole damn thing", suddenly
invalidating the logic.  Or we add a heap check that then might trigger
pruning, which quite conceivably could trigger an invalidation. Or ...

> I think you're trying to invent a coding rule that doesn't exist, but
> I can't prove a negative.

We had bugs around this in the past, so ...

> Your argument so far is that "yeah, well, maybe inval doesn't emit
> sinval traffic (duh?) but it might do something else that breaks,
> prove to me that there is no such thing".

I have no idea what you're trying to say with this.  The rest of that
paragraph just seems like bunch of odd hyperbole ("impossible to exceed
the speed of light...").

> From my point of view, it's likely to be common to want to run amcheck
> in series on a bunch of indexes, like by writing a SELECT query
> against pg_class or pg_index and passing the OIDs to amcheck.  With
> the retain-locks design, that query accumulates locks on a large
> number of objects, possibly blowing out the lock table or blocking
> DDL.

On the other hand, retaining the locks actually increases the chance to
run a check to completion.  If you e.g. write a query checking the
indexes for a table, you suddenly can't be sure that subsequent indexes
for the same table can be checked, because the relation lock has been
released inbetween.  Which then'd lead to erroring out after doing part
of the expensive work.

I don't think that refutes your argument, but I also don't think it's
as clear cut as you make it out to be.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-02-13 Thread Robert Haas

On Fri, Feb 10, 2017 at 8:39 PM, Peter Geoghegan  wrote:
> On Thu, Feb 9, 2017 at 2:32 PM, Andres Freund  wrote:
>> Except that the proposed names aren't remotely like that... ;).
>
> Revision attached -- V5. We now REVOKE ALL on both functions, as
> Robert suggested, instead of the previous approach of having a
> hard-coded superuser check with enforcement.
>
>> And I proposed documenting named parameters, and
>> check_btree(performing_check_requiring_exclusive_locks => true) is just
>> about as expressive.
>
> I have not done this, nor have I renamed the functions. I still think
> that this is something that we can fix by adding a boolean argument to
> each function in the future, or something along those lines. I
> *really* hate the idea of having one function with non-obvious,
> variable requirements on locking, with locking implications that are
> not knowable when we PREPARE an SQL statement calling the function. It
> also removes a useful way of have superusers discriminate against the
> stronger locking variant bt_index_parent_check() by not granting
> execute on it (as an anti-footgun measure).

I think Andres is more or less correct that
"performing_check_requiring_exclusive_locks => true" is just about as
expressive as calling a different function, but I think that your
point that the superuser might want to grant access to one function
but not the other is a good one.  On the other hand, I think Andres
has a concern that we might have more modes in the future and we don't
want to end up with 2^n entrypoints.  That also seems valid.  Hmm.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-02-13 Thread Robert Haas

On Thu, Feb 9, 2017 at 5:32 PM, Andres Freund  wrote:
>> > Again, some parts of the code doing something bad isn't a good argument
>> > for doing again. Releasing locks early is a bad pattern, because backend
>> > code isn't generally safe against it, we have special code-paths for
>> > catalog tables to support it for those.
>>
>> I don't agree with that.  The reason we keep relation locks until the
>> end of the transaction is to avoid surprising application code, not
>> because the system can't tolerate it.  But here it's more likely that
>> retaining the lock will surprise the user.
>
> It's both.  A bunch of code paths rely on early release ony happening on
> catalogs. E.g., and that's just the first thing that comes to mind,
> cache invalidations which really only are guaranteed to be delivered and
> visibile for other cache accesses, if the lock is only released after
> the end of the transaction.  Object accesses like relation_open() accept
> invalidations *after* acquiring the lock. Combined with lock releases
> only happening *after* transaction commits that guarantees an up2date
> view of the cache; but if others release the lock earlier and have cache
> invalidations, that doesn't work anymore.  Now you can argue that it's
> possibly harmless here because there's presumably no way this can ever
> trigger cache invals - and you're probably right - but this isn't the
> only place with such assumption and you'd definitely have to argue *why*
> it's right.

I agree that code which issues cache invalidations needs to hold a
blocking lock until commit IF it's critical for other backends to see
the update.  That's not always the case; for example, if an ALTER
TABLE statement only changes the fillfactor the worst thing that
happens if some other backend fails to see the invalidations is that
some transactions continue to use the old value for a while.  That's
why we now allow fillfactor and some other things using only
ShareUpdateExclusiveLock.  That's not quite the same thing but the
effect is the same: we issue invalidations that other backends aren't
guaranteed to see in a timely fashion, and it works OK.  Yeah, some
other session might not see the change for a while, but we don't view
that as a big problem.

But I think in this case that argument is basically a straw man.  A
utility that's called amcheck is obviously not issuing any cache
invalidations.  It's probably not even changing anything at all; if it
is, maybe it should be called amrepair.  So the fact that CLUSTER has
to hold AEL until commit time really doesn't have anything to do with
whether amcheck needs to hold AES until commit time.

I don't really understand your demand for "an argument why it's
right".  My argument is simple: I can't think of a single thing that
it would break, and I don't believe there is any such thing.
Furthermore, as Peter points out, we do this kind of thing all the
time in other places and it indeed does not break things.  I think
you're trying to invent a coding rule that doesn't exist, but I can't
prove a negative.  Your argument so far is that "yeah, well, maybe
inval doesn't emit sinval traffic (duh?) but it might do something
else that breaks, prove to me that there is no such thing".  But
that's just like asking me to prove that it's impossible to exceed the
speed of light.  On information and belief, nobody currently knows how
to do that and there is good scientific evidence that it cannot be
done, but there's no way to prove that someone won't in the future
discover a refinement in the physical laws of the universe as we
understand them today which sheds a different light on the situation.

>From my point of view, it's likely to be common to want to run amcheck
in series on a bunch of indexes, like by writing a SELECT query
against pg_class or pg_index and passing the OIDs to amcheck.  With
the retain-locks design, that query accumulates locks on a large
number of objects, possibly blowing out the lock table or blocking
DDL.  I think that readily-forseeable problem ought to trump concerns
about purely hypothetical issues caused by releasing locks early.  If
you can come up with an actual issue with releasing locks early that
applies to this particular case, then that's different, of course.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-02-13 Thread Robert Haas

On Thu, Feb 9, 2017 at 8:15 PM, Peter Geoghegan  wrote:
> BTW, aren't there cases where a PARALLEL SAFE function that needs to
> acquire locks on some arbitrary relation not referenced in the query
> can get locks on its own, which may only last as long as the parallel
> worker's lifetime? This could happen *despite* the fact that the
> author of the function may have imagined that callers could not
> release relation level locks early (i.e., by not releasing them
> directly, and so correctly following this "badly documented
> assumption").

Yes.

> It seems like the existence of PARALLEL RESTRICTED is primarily
> motivated by making stuff that definitely needs the locks to last
> until xact end being sure that that happens -- the docs say so. This
> seems wholly inconsistent with the idea that you're not supposed to
> let that happen under any circumstances. I find all this highly
> confusing. Have I missed some further subtlety that applies with
> PARALLEL RESTRICTED?

That's by no means the only thing that could cause you to mark
something PARALLEL RESTRICTED.  I think you might have missed this bit
of the documentation, a few paragraphs up:

Similarly, functions must be marked PARALLEL
RESTRICTED if they access temporary tables, client connection state,
cursors, prepared statements, or miscellaneous backend-local state which
the system cannot synchronize across workers. For example,
setseed and random are parallel restricted for
this last reason.

That stuff is the main motivation.  The early lock release stuff could
come up for user-written functions, but the things listed in the
paragraph I've quoted here come up for plenty of built-in functions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-02-11 Thread Amit Kapila

On Fri, Feb 10, 2017 at 6:45 AM, Peter Geoghegan  wrote:
> On Thu, Feb 9, 2017 at 2:47 PM, Peter Geoghegan  wrote:
>>> which isn't an issue here, but reinforces my point about the (badly
>>> documented) assumption that we don't release locks on user relations
>>> early.
>>

I think even if we don't need to retain locks till transaction end in
the present case, but Andres's point has a merit that it is good to
follow an undocumented rule of releasing locks at transaction end for
user relations unless there is some benefit in releasing the locks
early in some situation.  I think that can avoid unnecessary worry
that whether it safe to do in some situation.

>> You are right about the substantive issue, I assume, but I have a hard
>> time agreeing with the idea that it's even badly documented when there
>> are at least 10 counter-examples of blithely doing this. In any case,
>> I find it very confusing when you talk about things as established
>> practice/coding standards when they're very clearly not consistently
>> adhered to at all. You should at least say "let's not make a bad
>> situation any worse", or something, so that I don't need to spend 10
>> minutes pulling my hair out.
>
> BTW, aren't there cases where a PARALLEL SAFE function that needs to
> acquire locks on some arbitrary relation not referenced in the query
> can get locks on its own, which may only last as long as the parallel
> worker's lifetime? This could happen *despite* the fact that the
> author of the function may have imagined that callers could not
> release relation level locks early (i.e., by not releasing them
> directly, and so correctly following this "badly documented
> assumption").
>
> It seems like the existence of PARALLEL RESTRICTED is primarily
> motivated by making stuff that definitely needs the locks to last
> until xact end being sure that that happens -- the docs say so.
>

I don't think so.  I think one of the primary reason of releasing
locks in workers is that we don't have any way to transfer them to the
leader which can release them at transaction end.

> This
> seems wholly inconsistent with the idea that you're not supposed to
> let that happen under any circumstances. I find all this highly
> confusing. Have I missed some further subtlety that applies with
> PARALLEL RESTRICTED?
>

The basic idea of parallel restricted is that any expression will be
considered parallel restricted if that can't be executed in the
worker, for example, a reference to initplans in a query.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-02-10 Thread Peter Geoghegan

On Thu, Feb 9, 2017 at 2:32 PM, Andres Freund  wrote:
> Except that the proposed names aren't remotely like that... ;).

Revision attached -- V5. We now REVOKE ALL on both functions, as
Robert suggested, instead of the previous approach of having a
hard-coded superuser check with enforcement.

> And I proposed documenting named parameters, and
> check_btree(performing_check_requiring_exclusive_locks => true) is just
> about as expressive.

I have not done this, nor have I renamed the functions. I still think
that this is something that we can fix by adding a boolean argument to
each function in the future, or something along those lines. I
*really* hate the idea of having one function with non-obvious,
variable requirements on locking, with locking implications that are
not knowable when we PREPARE an SQL statement calling the function. It
also removes a useful way of have superusers discriminate against the
stronger locking variant bt_index_parent_check() by not granting
execute on it (as an anti-footgun measure). I don't think that it's
too much to ask for you to concede this one point to me (and to
Robert).

I have acted on every other item you raised, without exception,
including retaining the locks for the duration of the transaction. I
also made the functions PARALLEL RESTRICTED, which seemed like a good
idea.

Other notes:

* A nice thing about the additional level check that you suggested is
that it's a cross-level check that is perfectly safe to perform with
only an AccessShareLock, unlike the extra bt_index_parent_check()
check, since the number of levels only ever increases, no matter what.
There is no possible reason why we should ever find that a child page
does not agree with its self-described parent about the level it's on.
I'm glad you thought of this.

* I kept ereport() DEBUG1 calls, but now use non-error errcode for
those. I thought that preserving the various ereport() fields was
worth it, since these traces are rather information dense in a way
that is harder to deal with when using raw elog()s. If you hate this,
you can still change it.

* I have not added code to check the relation permissions, which might
matter if and when the superuser grants execute privs on amcheck
functions. Anyone want to argue that I should have added
relation-level permissions checks? I tend to think it's pointless,
since the superuser almost certainly won't think to consider that
additional factor. I cannot imagine how either function could be used
to leak information, in any case. No error message dumps the contents
of pages (only item pointers).

* pgindent was run against this revision. You were right to want to
get that out of the way now, since it needed to be fixed up by hand to
look reasonable. typedef list also updated.

-- 
Peter Geoghegan

0001-Add-amcheck-extension-to-contrib.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-02-09 Thread Peter Geoghegan

On Thu, Feb 9, 2017 at 2:47 PM, Peter Geoghegan  wrote:
>> which isn't an issue here, but reinforces my point about the (badly
>> documented) assumption that we don't release locks on user relations
>> early.
>
> You are right about the substantive issue, I assume, but I have a hard
> time agreeing with the idea that it's even badly documented when there
> are at least 10 counter-examples of blithely doing this. In any case,
> I find it very confusing when you talk about things as established
> practice/coding standards when they're very clearly not consistently
> adhered to at all. You should at least say "let's not make a bad
> situation any worse", or something, so that I don't need to spend 10
> minutes pulling my hair out.

BTW, aren't there cases where a PARALLEL SAFE function that needs to
acquire locks on some arbitrary relation not referenced in the query
can get locks on its own, which may only last as long as the parallel
worker's lifetime? This could happen *despite* the fact that the
author of the function may have imagined that callers could not
release relation level locks early (i.e., by not releasing them
directly, and so correctly following this "badly documented
assumption").

It seems like the existence of PARALLEL RESTRICTED is primarily
motivated by making stuff that definitely needs the locks to last
until xact end being sure that that happens -- the docs say so. This
seems wholly inconsistent with the idea that you're not supposed to
let that happen under any circumstances. I find all this highly
confusing. Have I missed some further subtlety that applies with
PARALLEL RESTRICTED?

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-02-09 Thread Peter Geoghegan

On Thu, Feb 9, 2017 at 2:32 PM, Andres Freund  wrote:
> Except that the proposed names aren't remotely like that... ;).
>
> And I proposed documenting named parameters, and
> check_btree(performing_check_requiring_exclusive_locks => true) is just
> about as expressive.

I cannot think of an example, offhand, of where it isn't self-evident
what relation level locks will be acquired when a statement is
prepared. I guess you could say that VACUUM truncation is a bit like
that, but it's not the same IMV.

Maybe there is an easy compromise here. We can add an argument with a
default, "level". It's an "enum" of the style of pg_prewarm (i.e.,
it's actually a text argument). The default is 'medium' or something
like that. Maybe we don't add any other flags for now, or maybe we add
some generic ones that don't actually changed the behavior. Existing
names are retained.

Doesn't that address your concern about an inflexible API? I really
dislike the idea of having the lock strength be determined on the
basis of a function arguments value.

>> I don't agree with that.  The reason we keep relation locks until the
>> end of the transaction is to avoid surprising application code, not
>> because the system can't tolerate it.  But here it's more likely that
>> retaining the lock will surprise the user.

> There were other issues than this, but right of the top of my head I
> only remember:
> double
> IndexBuildHeapRangeScan(Relation heapRelation,
> ..
> /*
>  * Since caller should hold ShareLock 
> or better, normally
>  * the only way to see this is if it 
> was inserted earlier
>  * in our own transaction.  However, 
> it can happen in
>  * system catalogs, since we tend to 
> release write lock
>  * before commit there.  Give a 
> warning if neither case
>  * applies.
>  */
> xwait = 
> HeapTupleHeaderGetXmin(heapTuple->t_data);
> if 
> (!TransactionIdIsCurrentTransactionId(xwait))
> {
> if (!is_system_catalog)
> elog(WARNING, 
> "concurrent insert in progress within table \"%s\"",
>  
> RelationGetRelationName(heapRelation));
>
> which isn't an issue here, but reinforces my point about the (badly
> documented) assumption that we don't release locks on user relations
> early.

You are right about the substantive issue, I assume, but I have a hard
time agreeing with the idea that it's even badly documented when there
are at least 10 counter-examples of blithely doing this. In any case,
I find it very confusing when you talk about things as established
practice/coding standards when they're very clearly not consistently
adhered to at all. You should at least say "let's not make a bad
situation any worse", or something, so that I don't need to spend 10
minutes pulling my hair out.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-02-09 Thread Andres Freund

On 2017-02-09 17:12:58 -0500, Robert Haas wrote:
> On Thu, Feb 9, 2017 at 4:57 PM, Andres Freund  wrote:
> > Meh.  I don't think that's a serious problem, nor is without precedent
> > (say same toplevel DDL with differing lock levels depending on
> > options). Nor do the function names confer that any more efficiently
> > than parameters.
> 
> Really?  check_btree(blah, true) vs. check_btree(blah, false) is no
> more or less clear than check_btree_for_plausible_errors(blah) vs.
> check_btree_in_unreasonable_detail(blah)?

Except that the proposed names aren't remotely like that... ;).

And I proposed documenting named parameters, and
check_btree(performing_check_requiring_exclusive_locks => true) is just
about as expressive.

> > Again, some parts of the code doing something bad isn't a good argument
> > for doing again. Releasing locks early is a bad pattern, because backend
> > code isn't generally safe against it, we have special code-paths for
> > catalog tables to support it for those.
> 
> I don't agree with that.  The reason we keep relation locks until the
> end of the transaction is to avoid surprising application code, not
> because the system can't tolerate it.  But here it's more likely that
> retaining the lock will surprise the user.

It's both.  A bunch of code paths rely on early release ony happening on
catalogs. E.g., and that's just the first thing that comes to mind,
cache invalidations which really only are guaranteed to be delivered and
visibile for other cache accesses, if the lock is only released after
the end of the transaction.  Object accesses like relation_open() accept
invalidations *after* acquiring the lock. Combined with lock releases
only happening *after* transaction commits that guarantees an up2date
view of the cache; but if others release the lock earlier and have cache
invalidations, that doesn't work anymore.  Now you can argue that it's
possibly harmless here because there's presumably no way this can ever
trigger cache invals - and you're probably right - but this isn't the
only place with such assumption and you'd definitely have to argue *why*
it's right.

There were other issues than this, but right of the top of my head I
only remember:
double
IndexBuildHeapRangeScan(Relation heapRelation,
..
/*
 * Since caller should hold ShareLock 
or better, normally
 * the only way to see this is if it 
was inserted earlier
 * in our own transaction.  However, it 
can happen in
 * system catalogs, since we tend to 
release write lock
 * before commit there.  Give a warning 
if neither case
 * applies.
 */
xwait = 
HeapTupleHeaderGetXmin(heapTuple->t_data);
if 
(!TransactionIdIsCurrentTransactionId(xwait))
{
if (!is_system_catalog)
elog(WARNING, 
"concurrent insert in progress within table \"%s\"",

RelationGetRelationName(heapRelation));

which isn't an issue here, but reinforces my point about the (badly
documented) assumption that we don't release locks on user relations
early.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-02-09 Thread Robert Haas

On Thu, Feb 9, 2017 at 4:57 PM, Andres Freund  wrote:
> On 2017-02-09 13:29:42 -0800, Peter Geoghegan wrote:
>> > I'd really like to have something like relation_check(), index_check()
>> > which calls the correct functions for the relevan index types (and
>> > possibly warns about !nbtree indexes not being checked?).
>>
>> I feel that that's something for an in-core facility that exposes this
>> through DDL. One good reason to not do that just yet is that I'm not
>> completely comfortable with what a uniform interface looks like here.
>> And frankly, I really just want to get this out of the way, to prove
>> the general concept of verification tooling. amcheck v1 is a good
>> start, but much work remains. I think that no one will take the idea
>> completely seriously until it is usable as a core extension. Let's get
>> the ball rolling.
>
> I'm not convinced at all by that argument.  If we want something in
> core, having this in contrib doesn't help much.  Nor do I see much
> point in explicit syntax.

I don't see any need for this to be in core, nor do I think we need
explicit syntax.  But given that, I also don't think that we need the
generalized infrastructure you proposed in the quoted section above.

>> I strongly disagree. I would agree if the lock strength were the same
>> in both cases, but it isn't. Varying the strength of heavyweight lock
>> taken based on an argument to a function is a massive foot-gun IMV.
>
> Meh.  I don't think that's a serious problem, nor is without precedent
> (say same toplevel DDL with differing lock levels depending on
> options). Nor do the function names confer that any more efficiently
> than parameters.

Really?  check_btree(blah, true) vs. check_btree(blah, false) is no
more or less clear than check_btree_for_plausible_errors(blah) vs.
check_btree_in_unreasonable_detail(blah)?

>> I do think that that's the logical way to split out functionality. I
>> don't think that there is likely to be a problem with adding stuff in
>> the future. It'll either be something that we can do in passing
>> anyway, in which case we can just add it to the existing checks
>> harmlessly. Or, it will be something like a tool that verifies the
>> heap is consistent with a given index, which is vastly more expensive
>> in terms of CPU and I/O costs, and yet is probably no worse than
>> bt_index_check() in terms of lock strength (AccessShareLock).
>
> But you quite possibly want to do all three in one pass - and that'll be
> ugly with your API.  With your API I'll not be able to do the parent
> check and the heap check at once.  Lest we add fourth function that does
> all of it.

This however seems like a good point.

> Again, some parts of the code doing something bad isn't a good argument
> for doing again. Releasing locks early is a bad pattern, because backend
> code isn't generally safe against it, we have special code-paths for
> catalog tables to support it for those.

I don't agree with that.  The reason we keep relation locks until the
end of the transaction is to avoid surprising application code, not
because the system can't tolerate it.  But here it's more likely that
retaining the lock will surprise the user.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-02-09 Thread Andres Freund

On 2017-02-09 13:29:42 -0800, Peter Geoghegan wrote:
> > I'd really like to have something like relation_check(), index_check()
> > which calls the correct functions for the relevan index types (and
> > possibly warns about !nbtree indexes not being checked?).
> 
> I feel that that's something for an in-core facility that exposes this
> through DDL. One good reason to not do that just yet is that I'm not
> completely comfortable with what a uniform interface looks like here.
> And frankly, I really just want to get this out of the way, to prove
> the general concept of verification tooling. amcheck v1 is a good
> start, but much work remains. I think that no one will take the idea
> completely seriously until it is usable as a core extension. Let's get
> the ball rolling.

I'm not convinced at all by that argument.  If we want something in
core, having this in contrib doesn't help much.  Nor do I see much
point in explicit syntax.


> >> +#define CHECK_SUPERUSER() { \
> >> + if (!superuser()) \
> >> + ereport(ERROR, \
> >> + 
> >> (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
> >> +  (errmsg("must be superuser to use 
> >> verification functions"; }
> >
> > Please make this either a function or just copy - the macro doesn't buy
> > us anything except weirder looking syntax.
> 
> It's actually from pageinspect, but okay.

We have a lot of dodgy stuff in places ;)


> >> +PG_FUNCTION_INFO_V1(bt_index_check);
> >> +PG_FUNCTION_INFO_V1(bt_index_parent_check);
> >
> > I think it'd be good to avoid adding a lot of functions for each
> > additional type of check we can think of.  I think adding a named 'bool
> > check_parents = false' argument or such would be better.  Then we can
> > simply add more and more checks subsequently, and we can have
> > 'bool all_checks = false' so people can trivially check everything,
> > instead of having to add new checks in each new function.
> 
> I strongly disagree. I would agree if the lock strength were the same
> in both cases, but it isn't. Varying the strength of heavyweight lock
> taken based on an argument to a function is a massive foot-gun IMV.

Meh.  I don't think that's a serious problem, nor is without precedent
(say same toplevel DDL with differing lock levels depending on
options). Nor do the function names confer that any more efficiently
than parameters.


> I do think that that's the logical way to split out functionality. I
> don't think that there is likely to be a problem with adding stuff in
> the future. It'll either be something that we can do in passing
> anyway, in which case we can just add it to the existing checks
> harmlessly. Or, it will be something like a tool that verifies the
> heap is consistent with a given index, which is vastly more expensive
> in terms of CPU and I/O costs, and yet is probably no worse than
> bt_index_check() in terms of lock strength (AccessShareLock).

But you quite possibly want to do all three in one pass - and that'll be
ugly with your API.  With your API I'll not be able to do the parent
check and the heap check at once.  Lest we add fourth function that does
all of it.


> My thinking here is that users really ought to use bt_index_check() in
> almost all cases, not bt_index_parent_check(), which is more of a tool
> for hackers that are developing new B-Tree features. The distinction
> is blurry, in fact, but the lock strength requirements of
> bt_index_parent_check() make it vastly less useful.

A quite reasonable patter is to do this verification on standbys that
you can either bring up separately and/or just temporarily pause.  Or
just backups before declaring them valid.


> >> +/*
> >> + * bt_index_check(index regclass)
> >> + *
> >> + * Verify integrity of B-Tree index.
> >> + *
> >> + * Only acquires AccessShareLock on index relation.  Does not consider
> >> + * invariants that exist between parent/child pages.
> >> + */
> 
> > I'm *VERY* strongly against releasing locks on user relations early.
> 
> Why? pageinspect does this.

Again, some parts of the code doing something bad isn't a good argument
for doing again. Releasing locks early is a bad pattern, because backend
code isn't generally safe against it, we have special code-paths for
catalog tables to support it for those.


> >> + /*
> >> +  * We must lock table before index to avoid deadlocks.  However, if 
> >> the
> >> +  * passed indrelid isn't an index then IndexGetRelation() will fail.
> >> +  * Rather than emitting a not-very-helpful error message, postpone
> >> +  * complaining, expecting that the is-it-an-index test below will 
> >> fail.
> >> +  */
> >> + heapid = IndexGetRelation(indrelid, true);
> >> + if (OidIsValid(heapid))
> >> + heaprel = heap_open(heapid, ShareLock);
> >> + else
> >> + heaprel = NULL;
> >> +
> >> + /*
> >> +  * Open the target index relations separ

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-02-09 Thread Peter Geoghegan

On Thu, Feb 9, 2017 at 12:05 AM, Andres Freund  wrote:
> reindex_index isn't necessarily comparable, because
> RangeVarCallbackForReindexIndex() is, on first blush at least, careful
> enough about the locking, and reindex_relation only gets the index list
> after building the index:
> /*
>  * Find and lock index, and check permissions on table; use callback 
> to
>  * obtain lock on table first, to avoid deadlock hazard.  The lock 
> level
>  * used here must match the index lock obtained in reindex_index().
>  */
>
> I think you ought to do something similar.

Okay. Looking into it, but see my response to your later remarks below.

> I'd really like to have something like relation_check(), index_check()
> which calls the correct functions for the relevan index types (and
> possibly warns about !nbtree indexes not being checked?).

I feel that that's something for an in-core facility that exposes this
through DDL. One good reason to not do that just yet is that I'm not
completely comfortable with what a uniform interface looks like here.
And frankly, I really just want to get this out of the way, to prove
the general concept of verification tooling. amcheck v1 is a good
start, but much work remains. I think that no one will take the idea
completely seriously until it is usable as a core extension. Let's get
the ball rolling.

>> +#define CHECK_SUPERUSER() { \
>> + if (!superuser()) \
>> + ereport(ERROR, \
>> + 
>> (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
>> +  (errmsg("must be superuser to use 
>> verification functions"; }
>
> Please make this either a function or just copy - the macro doesn't buy
> us anything except weirder looking syntax.

It's actually from pageinspect, but okay.

>> +#define CORRUPTION   ERROR
>> +#define CONCERN  DEBUG1
>> +#define POSITION DEBUG2
>> +#endif
>
> Please remove this block and use the normal elevels.

Okay.

>> +/*
>> + * State associated with verifying a B-Tree index
>> + */
>> +typedef struct BtreeCheckState
>> +{
>> + /*
>> +  * Unchanging state, established at start of verification:
>> +  *
>> +  * rel: B-Tree Index Relation
>> +  * exclusivelock:   ExclusiveLock held on rel; else AccessShareLock
>> +  * checkstrategy:   Buffer access strategy
>> +  * targetcontext:   Target page memory context
>> +  */
>> + Relationrel;
>> + boolexclusivelock;
>> + BufferAccessStrategycheckstrategy;
>> + MemoryContext   targetcontext;
>
> Can you move the field comments to the individual fields? Separating
> them makes it more likely to get out of date (obviously keep the
> header).  I'd also rename 'exclusivelock' to 'exclusivelylocked' or
> such.

Okay.

>> + /*
>> +  * Mutable state, for verification of particular page:
>> +  *
>> +  * target:  Main target page
>> +  * targetblock: Main target page's block number
>> +  * targetlsn:   Main target page's LSN
>
> Same here.

Okay.

>> +PG_FUNCTION_INFO_V1(bt_index_check);
>> +PG_FUNCTION_INFO_V1(bt_index_parent_check);
>
> I think it'd be good to avoid adding a lot of functions for each
> additional type of check we can think of.  I think adding a named 'bool
> check_parents = false' argument or such would be better.  Then we can
> simply add more and more checks subsequently, and we can have
> 'bool all_checks = false' so people can trivially check everything,
> instead of having to add new checks in each new function.

I strongly disagree. I would agree if the lock strength were the same
in both cases, but it isn't. Varying the strength of heavyweight lock
taken based on an argument to a function is a massive foot-gun IMV. I
do think that that's the logical way to split out functionality. I
don't think that there is likely to be a problem with adding stuff in
the future. It'll either be something that we can do in passing
anyway, in which case we can just add it to the existing checks
harmlessly. Or, it will be something like a tool that verifies the
heap is consistent with a given index, which is vastly more expensive
in terms of CPU and I/O costs, and yet is probably no worse than
bt_index_check() in terms of lock strength (AccessShareLock).

My thinking here is that users really ought to use bt_index_check() in
almost all cases, not bt_index_parent_check(), which is more of a tool
for hackers that are developing new B-Tree features. The distinction
is blurry, in fact, but the lock strength requirements of
bt_index_parent_check() make it vastly less useful. Especially
considering that it seems fairly unlikely that it will ever catch a
problem that bt_index_check() would not have. bt_index_parent_check()
will become a lot more i

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-02-09 Thread Andres Freund

On 2017-02-09 14:48:18 -0500, Robert Haas wrote:
> On Thu, Feb 9, 2017 at 3:05 AM, Andres Freund  wrote:
> >> +#define CHECK_SUPERUSER() { \
> >> + if (!superuser()) \
> >> + ereport(ERROR, \
> >> + 
> >> (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
> >> +  (errmsg("must be superuser to use 
> >> verification functions"; }
> >
> > Please make this either a function or just copy - the macro doesn't buy
> > us anything except weirder looking syntax.
>
> Better yet, remove the check altogether and just revoke access from
> public.  Then the default policy is superuser-only, but the DBA can
> grant access to who they wish.

That works for me as well.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-02-09 Thread Robert Haas

On Thu, Feb 9, 2017 at 3:05 AM, Andres Freund  wrote:
>> +#define CHECK_SUPERUSER() { \
>> + if (!superuser()) \
>> + ereport(ERROR, \
>> + 
>> (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
>> +  (errmsg("must be superuser to use 
>> verification functions"; }
>
> Please make this either a function or just copy - the macro doesn't buy
> us anything except weirder looking syntax.

Better yet, remove the check altogether and just revoke access from
public.  Then the default policy is superuser-only, but the DBA can
grant access to who they wish.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-02-09 Thread Andres Freund

Hi,

On 2016-11-22 10:56:07 -0800, Peter Geoghegan wrote:
> Andres said: "Theoretically we should recheck that the oids still
> match, theoretically the locking here allows the index->table mapping
> to change". I don't know how to act on this feedback, since comparable
> index + heap locking code should have the same problem, but doesn't
> consider it. How is this any different to reindex_index()?

reindex_index isn't necessarily comparable, because
RangeVarCallbackForReindexIndex() is, on first blush at least, careful
enough about the locking, and reindex_relation only gets the index list
after building the index:
/*
 * Find and lock index, and check permissions on table; use callback to
 * obtain lock on table first, to avoid deadlock hazard.  The lock level
 * used here must match the index lock obtained in reindex_index().
 */

I think you ought to do something similar.

I'd really like to have something like relation_check(), index_check()
which calls the correct functions for the relevan index types (and
possibly warns about !nbtree indexes not being checked?).

> @@ -0,0 +1,1227 @@
> +/*-
> + *
> + * verify_nbtree.c
> + *   Verifies the integrity of nbtree indexes based on invariants.
> + *
> + * For B-Tree indexes, verification includes the invariant that each page in
> + * the target index has items in logical order as reported by an insertion
> + * scankey (the insertion scankey sort-wise NULL semantics are needed for
> + * verification).

I'd appreciate some rephrasing of this paragraph. "includes the
invariant that" isn't that easy to understand for an ESL person.

> + *
> + * Copyright (c) 2016, PostgreSQL Global Development Group

... ;)

> +#define CHECK_SUPERUSER() { \
> + if (!superuser()) \
> + ereport(ERROR, \
> + 
> (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
> +  (errmsg("must be superuser to use 
> verification functions"; }

Please make this either a function or just copy - the macro doesn't buy
us anything except weirder looking syntax.

> +/*
> + * Callers to verification functions should never receive a false positive
> + * indication of corruption.  Therefore, when using amcheck functions for
> + * stress testing, it may be useful to temporally change the CORRUPTION 
> elevel
> + * to PANIC, to immediately halt the server in the event of detecting an
> + * invariant condition violation.  This may preserve more information about 
> the
> + * nature of the underlying problem.  Note that modifying the CORRUPTION
> + * constant to be an elevel < ERROR is not well tested.
> + */
> +#ifdef NOT_USED
> +#define CORRUPTION   PANIC
> +#define CONCERN  WARNING
> +#define POSITION NOTICE
> +#else
> +#define CORRUPTION   ERROR
> +#define CONCERN  DEBUG1
> +#define POSITION DEBUG2
> +#endif

Please remove this block and use the normal elevels.

> +/*
> + * State associated with verifying a B-Tree index
> + */
> +typedef struct BtreeCheckState
> +{
> + /*
> +  * Unchanging state, established at start of verification:
> +  *
> +  * rel: B-Tree Index Relation
> +  * exclusivelock:   ExclusiveLock held on rel; else AccessShareLock
> +  * checkstrategy:   Buffer access strategy
> +  * targetcontext:   Target page memory context
> +  */
> + Relationrel;
> + boolexclusivelock;
> + BufferAccessStrategycheckstrategy;
> + MemoryContext   targetcontext;

Can you move the field comments to the individual fields? Separating
them makes it more likely to get out of date (obviously keep the
header).  I'd also rename 'exclusivelock' to 'exclusivelylocked' or
such.

> + /*
> +  * Mutable state, for verification of particular page:
> +  *
> +  * target:  Main target page
> +  * targetblock: Main target page's block number
> +  * targetlsn:   Main target page's LSN

Same here.

> +PG_FUNCTION_INFO_V1(bt_index_check);
> +PG_FUNCTION_INFO_V1(bt_index_parent_check);

I think it'd be good to avoid adding a lot of functions for each
additional type of check we can think of.  I think adding a named 'bool
check_parents = false' argument or such would be better.  Then we can
simply add more and more checks subsequently, and we can have
'bool all_checks = false' so people can trivially check everything,
instead of having to add new checks in each new function.

> +/*
> + * bt_index_check(index regclass)
> + *
> + * Verify integrity of B-Tree index.
> + *
> + * Only acquires AccessShareLock on index relation.  Does not consider
> + * invariants that exist between parent/child pages.
> + */
> +Datum

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-01-30 Thread Andres Freund

Hi Heikki,

On 2016-11-22 10:56:07 -0800, Peter Geoghegan wrote:
> I attach V4 of amcheck.

Is there any chance you could look at the concurrency related parts of
amcheck?  I'd like to push this to be mergeable, but that area is a bit
outside of my area of expertise...

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2017-01-30 Thread Michael Paquier

On Mon, Dec 5, 2016 at 2:15 PM, Haribabu Kommi  wrote:
> Moved to next CF with " needs review" status.

Same, to CF 2017-03 this time.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-12-04 Thread Haribabu Kommi

On Wed, Nov 30, 2016 at 9:06 AM, Peter Geoghegan  wrote:

> On Wed, Nov 23, 2016 at 2:47 PM, Thomas Munro
>  wrote:
> > Actually I meant that at T2 the btree is exactly same as it was at T1,
> > but the order of the values has changed according to the comparator
> > (for example, because a collation definition changed when you updated
> > your operating system), so that the btree is now corrupted.  Based on
> > Goetz Graefe's paper I was wondering if amcheck would be unable to
> > detect this due to the cousin problem.
>
> I love that paper (it's more like a book, really). I'm glad that at
> least one other person in the community has read some of it, because I
> think it has a lot of clues as to how we can begin to address some of
> our problems around bloat without novel redesign. Most of the
> techniques are quite complementary.
>
> I wrote a prototype of suffix truncation for text in two days last
> week. The regression tests pass, and the technique works well for
> certain cases, particularly with many column indexes without
> correlation (the index-only-scan case). Let's see if that becomes a
> real patch.
>
> > Thank you for your patient explanation!
>
> Your questions were excellent, and so deserved my careful consideration.
>
> > Please see my humble attempt
> > to test this scenario and learn something about btrees in the process,
> > attached.  amcheck does detect the violation of the high key
> > invariant.
>
> This is a nice pattern for testing out if the keyspace is sane, and
> how things work. I might incorporate it into my own testing in the
> future.


Moved to next CF with " needs review" status.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-11-29 Thread Peter Geoghegan

On Wed, Nov 23, 2016 at 2:47 PM, Thomas Munro
 wrote:
> Actually I meant that at T2 the btree is exactly same as it was at T1,
> but the order of the values has changed according to the comparator
> (for example, because a collation definition changed when you updated
> your operating system), so that the btree is now corrupted.  Based on
> Goetz Graefe's paper I was wondering if amcheck would be unable to
> detect this due to the cousin problem.

I love that paper (it's more like a book, really). I'm glad that at
least one other person in the community has read some of it, because I
think it has a lot of clues as to how we can begin to address some of
our problems around bloat without novel redesign. Most of the
techniques are quite complementary.

I wrote a prototype of suffix truncation for text in two days last
week. The regression tests pass, and the technique works well for
certain cases, particularly with many column indexes without
correlation (the index-only-scan case). Let's see if that becomes a
real patch.

> Thank you for your patient explanation!

Your questions were excellent, and so deserved my careful consideration.

> Please see my humble attempt
> to test this scenario and learn something about btrees in the process,
> attached.  amcheck does detect the violation of the high key
> invariant.

This is a nice pattern for testing out if the keyspace is sane, and
how things work. I might incorporate it into my own testing in the
future.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-11-23 Thread Thomas Munro

On Fri, Nov 18, 2016 at 12:17 PM, Peter Geoghegan  wrote:
> So, the Btree you show is supposed to be good? You've left it up to me
> to draw T2, the corrupt version? If so, here is a drawing of T2 for
> the sake of clarity:
>
>   Btree   =[a|b]
>/   \
>   / \
>  /   \
>[a]---[b]
> | |
> | |
>[c]---[d]

Actually I meant that at T2 the btree is exactly same as it was at T1,
but the order of the values has changed according to the comparator
(for example, because a collation definition changed when you updated
your operating system), so that the btree is now corrupted.  Based on
Goetz Graefe's paper I was wondering if amcheck would be unable to
detect this due to the cousin problem.

>> Now c and b have been switched around.  Won't amcheck still pass since
>> we only verify immediate downlinks and siblings?  Yet searches for b
>> will take a wrong turn at the root.  Do I have this right?  I wonder
>> if there is a way to verify that each page's high key < the 'next' key
>> for all ancestors.
>
> I don't think you have it right, because your argument is predicated
> on classic L&Y, not the Postgres variant. The high key is literally
> guaranteed to be *equal* to the initial first value of the right half
> of a split page post-split (at least initially), which is where the
> new downlink for the new right half comes from in turn, so it will be
> exactly equal as well. There is a comment which says all this within
> _bt_insert_parent(), actually.
>
> In general, I think it's pretty bad that we're so loose about
> representing the key space in downlinks compared to classic L&Y,
> because it probably makes index bloat a lot worse in internal pages,
> messing up the key space long term. As long as we're doing that,
> though, we clearly cannot usefully "verify that each page's high key <
> the 'next' key for all ancestors", because we deviate from L&Y in a
> way that makes that check always fail. It's okay that it always fails
> for us (it doesn't actually indicate corruption), because of our
> duplicate-aware index scan logic (change to invariant).
>
> In summary, I'm pretty sure that our "Ki <= v <= Ki+1" thing makes the
> cousin problem a non-issue, because we don't follow downlinks that are
> equal to our scankey -- we go to their immediate left and start
> *there*, in order to support duplicates in leaf pages. Index scans for
> 'b' work for both T1 and T2 in Postgres, because we refuse to follow
> the 'b' downlinks when doing an index scan for 'b' (relevant to T2). I
> would actually like to change things to make the invariant the classic
> L&Y "Ki < v <= Ki+1", to avoid bloat in the internal pages and to make
> suffix truncation in internal pages work.
>
> So, we don't have the cousin problem, but since I wish for us to adopt
> the stricter classic L&Y invariant at some point, you could say that I
> also wished that we had the cousin problem.

Thank you for your patient explanation!  Please see my humble attempt
to test this scenario and learn something about btrees in the process,
attached.  amcheck does detect the violation of the high key
invariant.

> Confused yet? :-)

Yes.  But it's getting better slowly :-)

-- 
Thomas Munro
http://www.enterprisedb.com
create extension if not exists amcheck;
create extension if not exists pageinspect;

drop table if exists my_table;
drop type if exists my_type;

-- define an enum ordered a, b, c, d, but do it in a funky way that results in
-- odd OIDs that we can later mess around with
create type my_type as enum ('d');
alter type my_type add value 'c' before 'd';
alter type my_type add value 'b' before 'c';
alter type my_type add value 'a' before 'b';

create table my_table (my_column my_type, counter serial);
create index my_index on my_table(my_column);

create or replace function pretty_print_btree_recurse(pageno int,
  height int,
  level int)
returns void language plpgsql as
$$
declare
  v_spaces text := rpad('', (height - level) * 2, ' ');
  v_ctid text;
  v_subpageno int;
  v_key_hex text;
  v_key_oid oid;
  v_key_label text;
  v_page_description text;
  v_item int;
begin
  if height = level then
v_page_description := 'root';
  elsif level = 0 then
v_page_description := 'leaf';
  else
v_page_description := 'intl';
  end if;
  for v_ctid, v_key_hex, v_item in
  select ctid::text, data, itemoffset
  from bt_page_items('my_index', pageno)
  loop
if v_ctid like '(4294967295,%' then
  continue;
end if;
v_subpageno := (v_ctid::point)[0];
v_key_oid := ('x' || substring(v_key_hex, 1, 2))::bit(8)::int +
 (('x' || substring(v_key_hex, 4, 2))::bit(8)::int << 8);
v_key_label := coalesce((select enumlabel from pg_enum where oid = 
v_key_

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-11-22 Thread Peter Geoghegan

On Sun, Nov 20, 2016 at 3:42 PM, Robert Haas  wrote:
> OK.  If it's not reasonable to continue checking after an ERROR, then
> I think ERROR is the way to go.  If somebody really doesn't like that
> lack of flexibility (in either direction), they can propose a change
> later for separate consideration.  That limitation is not, in my view,
> a sufficient reason to hold up the patch on the table.

I attach V4 of amcheck. I decided to leave things as they are, as far
as forcing a different elevel goes (you still need to modify a macro).
I didn't change the elevel macro from CONCERN in a couple of cases, as
Andres suggested, because in fact that's generally the same as DEBUG1,
not WARNING (I agreed with Andres that it was WARNING before, but it
isn't unless you #define NOT_USED).

Andres said: "Theoretically we should recheck that the oids still
match, theoretically the locking here allows the index->table mapping
to change". I don't know how to act on this feedback, since comparable
index + heap locking code should have the same problem, but doesn't
consider it. How is this any different to reindex_index()?

Other than that, all feedback items were worked through. I made the
functions PARALLEL SAFE, too, since I noticed that that wasn't the
case in passing.

-- 
Peter Geoghegan
From 21c843ae0193ed17b2e5234d67f2e73f7015b3cd Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Tue, 10 Jun 2014 22:20:40 -0700
Subject: [PATCH] Add amcheck extension to contrib

The new extension makes available two SQL-callable functions, each of
which accept a single argument (a regclass/nbtree index relation
argument) and return void.

The SQL-callable function bt_index_check() checks invariants of the
target index by walking the tree, performing various checks of the
sanity of the page space using insertion scankeys to compare items.  The
SQL-callable function bt_index_parent_check() performs a superset of the
checks performed by bt_index_check(): the same checks, as well as
another check verifying that each downlink is actually respected as a
lower bound on its child page.

bt_index_check() requires only an AccessShareLock on the target.
bt_index_parent_check() requires an ExclusiveLock on the target,
and ShareLock on its heap relation.
---
 contrib/Makefile |1 +
 contrib/amcheck/Makefile |   19 +
 contrib/amcheck/amcheck--1.0.sql |   20 +
 contrib/amcheck/amcheck.control  |5 +
 contrib/amcheck/verify_nbtree.c  | 1227 ++
 doc/src/sgml/amcheck.sgml|  275 +
 doc/src/sgml/contrib.sgml|1 +
 doc/src/sgml/filelist.sgml   |1 +
 8 files changed, 1549 insertions(+)
 create mode 100644 contrib/amcheck/Makefile
 create mode 100644 contrib/amcheck/amcheck--1.0.sql
 create mode 100644 contrib/amcheck/amcheck.control
 create mode 100644 contrib/amcheck/verify_nbtree.c
 create mode 100644 doc/src/sgml/amcheck.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index 25263c0..4acd50e 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -6,6 +6,7 @@ include $(top_builddir)/src/Makefile.global

 SUBDIRS = \
 		adminpack	\
+		amcheck		\
 		auth_delay	\
 		auto_explain	\
 		bloom		\
diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
new file mode 100644
index 000..f14c7aa
--- /dev/null
+++ b/contrib/amcheck/Makefile
@@ -0,0 +1,19 @@
+# contrib/amcheck/Makefile
+
+MODULE_big	= amcheck
+OBJS		= verify_nbtree.o $(WIN32RES)
+
+EXTENSION = amcheck
+DATA = amcheck--1.0.sql
+PGFILEDESC = "amcheck - functions to verify access method invariants"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/amcheck
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/amcheck/amcheck--1.0.sql b/contrib/amcheck/amcheck--1.0.sql
new file mode 100644
index 000..b415f0e
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0.sql
@@ -0,0 +1,20 @@
+/* contrib/amcheck/amcheck--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION amcheck" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+CREATE FUNCTION bt_index_check(index regclass)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_index_parent_check()
+--
+CREATE FUNCTION bt_index_parent_check(index regclass)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
new file mode 100644
index 000..180f150
--- /dev/null
+++ b/contrib/amcheck/amcheck.control
@@ -0,0 +1,5 @@
+# amcheck extension
+comment = 'verify access method invariants'
+default_version = '1.0'
+module_pathname = '$libdir/amcheck'
+relocatable = true
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
new file

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-11-20 Thread Robert Haas

On Sat, Nov 19, 2016 at 11:38 PM, Peter Geoghegan  wrote:
> On Sat, Nov 19, 2016 at 6:45 PM, Robert Haas  wrote:
>>> What do you think about new argument with default vs. GUC? I guess
>>> that the GUC might be a lot less of a foot-gun. We might even give it
>>> a suitably scary name, to indicate that it will make the server PANIC.
>>> (I gather that you don't care about other aspects of verbosity -- just
>>> about the ability to make amcheck PANIC in the event of an invariant
>>> violation without recompiling it.)
>>
>> Yikes.  I don't think I want to expose any kind of API that lets the
>> user PANIC the server.  A value < ERROR sounds far more reasonable
>> than a value > ERROR.
>
> In general, I don't want to get into the business of reasoning about
> how well we can limp along when there is a would-be error condition
> within amcheck. Once "the impossible" has actually occurred, it's very
> difficult to reason about what still works. Also, I actually agree
> that making it possible for the tool to force a PANIC through a
> user-visible interface is a bad idea.
>
> Maybe we should just leave it as it is -- experts can recompile the
> tool after modifying it to use an elevel that is != ERROR (the thing I
> mention about elevel < ERROR is already documented in code comments).
> If that breaks, they get to keep both halves.

OK.  If it's not reasonable to continue checking after an ERROR, then
I think ERROR is the way to go.  If somebody really doesn't like that
lack of flexibility (in either direction), they can propose a change
later for separate consideration.  That limitation is not, in my view,
a sufficient reason to hold up the patch on the table.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-11-19 Thread Peter Geoghegan

On Sat, Nov 19, 2016 at 6:45 PM, Robert Haas  wrote:
>> What do you think about new argument with default vs. GUC? I guess
>> that the GUC might be a lot less of a foot-gun. We might even give it
>> a suitably scary name, to indicate that it will make the server PANIC.
>> (I gather that you don't care about other aspects of verbosity -- just
>> about the ability to make amcheck PANIC in the event of an invariant
>> violation without recompiling it.)
>
> Yikes.  I don't think I want to expose any kind of API that lets the
> user PANIC the server.  A value < ERROR sounds far more reasonable
> than a value > ERROR.

In general, I don't want to get into the business of reasoning about
how well we can limp along when there is a would-be error condition
within amcheck. Once "the impossible" has actually occurred, it's very
difficult to reason about what still works. Also, I actually agree
that making it possible for the tool to force a PANIC through a
user-visible interface is a bad idea.

Maybe we should just leave it as it is -- experts can recompile the
tool after modifying it to use an elevel that is != ERROR (the thing I
mention about elevel < ERROR is already documented in code comments).
If that breaks, they get to keep both halves.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-11-19 Thread Robert Haas

On Fri, Nov 18, 2016 at 6:51 PM, Peter Geoghegan  wrote:
> On Thu, Nov 17, 2016 at 12:04 PM, Peter Geoghegan  wrote:
>>> Hm, if we want that - and it doesn't seem like a bad idea - I think we
>>> should be make it available without recompiling.
>>
>> I suppose, provided it doesn't let CORRUPTION elevel be < ERROR. That
>> might be broken if it was allowed.
>
> What do you think about new argument with default vs. GUC? I guess
> that the GUC might be a lot less of a foot-gun. We might even give it
> a suitably scary name, to indicate that it will make the server PANIC.
> (I gather that you don't care about other aspects of verbosity -- just
> about the ability to make amcheck PANIC in the event of an invariant
> violation without recompiling it.)

Yikes.  I don't think I want to expose any kind of API that lets the
user PANIC the server.  A value < ERROR sounds far more reasonable
than a value > ERROR.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-11-18 Thread Peter Geoghegan

On Thu, Nov 17, 2016 at 12:04 PM, Peter Geoghegan  wrote:
>> Hm, if we want that - and it doesn't seem like a bad idea - I think we
>> should be make it available without recompiling.
>
> I suppose, provided it doesn't let CORRUPTION elevel be < ERROR. That
> might be broken if it was allowed.

What do you think about new argument with default vs. GUC? I guess
that the GUC might be a lot less of a foot-gun. We might even give it
a suitably scary name, to indicate that it will make the server PANIC.
(I gather that you don't care about other aspects of verbosity -- just
about the ability to make amcheck PANIC in the event of an invariant
violation without recompiling it.)

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-11-17 Thread Peter Geoghegan

On Wed, Nov 16, 2016 at 5:06 PM, Thomas Munro
 wrote:
> + * Ideally, we'd compare every item in the index against every other
> + * item in the index, and not trust opclass obedience of the transitive
> + * law to bridge the gap between children and their grandparents (as
> + * well as great-grandparents, and so on).  We don't go to those
> + * lengths because that would be prohibitively expensive, and probably
> + * not markedly more effective in practice.
> + */
>
> I skimmed the Modern B-Tree Techniques paper recently, and there was a
> section on "the cousin problem" when verifying btrees, which this
> comment reminded me of.

I'm impressed that you made the connection. This comment reminded you
of that section because it was directly influenced by it. I actually
use the term "cousin page" elsewhere, because it's really useful to
distinguish cousins from "true siblings" when talking about page
deletion in particular...and I have a lot to say about page deletion.

> I tried to come up with an example of a pair
> of characters being switched in a collation that would introduce
> undetectable corruption:
>
> T1.  Order   = a < b < c < d
>
>  Btree   =[a|c]
>   /   \
>  / \
> /   \
>   [a]---[c]
>| |
>| |
>   [b]---[d]
>
> T2.  Order   = a < c < b < d

So, the Btree you show is supposed to be good? You've left it up to me
to draw T2, the corrupt version? If so, here is a drawing of T2 for
the sake of clarity:

  Btree   =[a|b]
   /   \
  / \
 /   \
   [a]---[b]
| |
| |
   [c]---[d]

> Now c and b have been switched around.  Won't amcheck still pass since
> we only verify immediate downlinks and siblings?  Yet searches for b
> will take a wrong turn at the root.  Do I have this right?  I wonder
> if there is a way to verify that each page's high key < the 'next' key
> for all ancestors.

I don't think you have it right, because your argument is predicated
on classic L&Y, not the Postgres variant. The high key is literally
guaranteed to be *equal* to the initial first value of the right half
of a split page post-split (at least initially), which is where the
new downlink for the new right half comes from in turn, so it will be
exactly equal as well. There is a comment which says all this within
_bt_insert_parent(), actually.

In general, I think it's pretty bad that we're so loose about
representing the key space in downlinks compared to classic L&Y,
because it probably makes index bloat a lot worse in internal pages,
messing up the key space long term. As long as we're doing that,
though, we clearly cannot usefully "verify that each page's high key <
the 'next' key for all ancestors", because we deviate from L&Y in a
way that makes that check always fail. It's okay that it always fails
for us (it doesn't actually indicate corruption), because of our
duplicate-aware index scan logic (change to invariant).

In summary, I'm pretty sure that our "Ki <= v <= Ki+1" thing makes the
cousin problem a non-issue, because we don't follow downlinks that are
equal to our scankey -- we go to their immediate left and start
*there*, in order to support duplicates in leaf pages. Index scans for
'b' work for both T1 and T2 in Postgres, because we refuse to follow
the 'b' downlinks when doing an index scan for 'b' (relevant to T2). I
would actually like to change things to make the invariant the classic
L&Y "Ki < v <= Ki+1", to avoid bloat in the internal pages and to make
suffix truncation in internal pages work.

So, we don't have the cousin problem, but since I wish for us to adopt
the stricter classic L&Y invariant at some point, you could say that I
also wished that we had the cousin problem.

Confused yet? :-)

[1] https://archive.org/stream/symmetricconcurr00lani#page/6/mode/2up
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-11-17 Thread Peter Geoghegan

On Wed, Nov 16, 2016 at 1:03 PM, Andres Freund  wrote:
> I think the patch could use a pgindent run.

Okay.

> I'd really want a function that runs all check on a table.

Why not just use a custom SQL query? The docs have an example of one
such query, that verifies all user indexes, but there could be another
example.

>> +#define CHECK_SUPERUSER() { \
>> + if (!superuser()) \
>> + ereport(ERROR, \
>> + 
>> (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
>> +  (errmsg("must be superuser to use 
>> verification functions"; }
>
> Why is this a macro?

This was just copied from pageinspect, TBH. What would you prefer?

>> +#ifdef NOT_USED
>> +#define CORRUPTION   PANIC
>> +#define CONCERN  WARNING
>> +#define POSITION NOTICE
>> +#else
>> +#define CORRUPTION   ERROR
>> +#define CONCERN  DEBUG1
>> +#define POSITION DEBUG2
>> +#endif
>
> Hm, if we want that - and it doesn't seem like a bad idea - I think we
> should be make it available without recompiling.

I suppose, provided it doesn't let CORRUPTION elevel be < ERROR. That
might be broken if it was allowed.

> Hm, I think it'd be, for the long term, better if we'd move the btree
> check code to amcheck_nbtree.c or such.

Agreed.

>> +Datum
>> +bt_index_parent_check(PG_FUNCTION_ARGS)
>> +{
>> + Oid indrelid = PG_GETARG_OID(0);

> Theoretically we should recheck that the oids still match, theoretically
> the locking here allows the index->table mapping to change. It's not a
> huge window tho...

>> + /* Check for active uses of the index in the current transaction */
>> + CheckTableNotInUse(indrel, "bt_index_parent_check");
>
> Why do we actually care?

This was left-over from duplicating REINDEX code. Indeed, we have no
reason to care at all.

>> + if (!rel->rd_index->indisready)
>> + ereport(ERROR,
>> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>> +  errmsg("cannot check index \"%s\"",
>> + RelationGetRelationName(rel)),
>> +  errdetail("Index is not yet ready for 
>> insertions")));
>
> Why can't we check this? Seems valuable to allow checking after a
> partial CONCURRENTLY index build?

Okay.

> Why don't you use IndexIsReady et al? Not sure if the patch compiles
> against an earlier release (there weren't valid/ready/live dead fields,
> but instead it was compressed into fewer)?

This version is almost identical to the Github version of amcheck that
we've been using with 9.4+. Agreed on IndexIsReady().

>> +/*
>> + * bt_check_every_level()
>> + *
>> + * Main entry point for B-Tree SQL-callable functions.  Walks the B-Tree in
>> + * logical order, verifying invariants as it goes.
>> + *
>> + * It is the caller's responsibility to acquire appropriate heavyweight 
>> lock on
>> + * the index relation, and advise us if extra checks are safe when an
>> + * ExclusiveLock is held.  An ExclusiveLock is generally assumed to prevent 
>> any
>> + * kind of physical modification to the index structure, including
>> + * modifications that VACUUM may make.
>> + */
>
> Hm, what kind of locking does the kill_prior_tuple stuff use?

The killing itself, or the setting of LP_DEAD bit? Ordinary index
scans could concurrently set LP_DEAD with only an AccessShareLock.
But, that's just metadata that amcheck doesn't care about. Anything
that needs to actually recyle space based on finding an LP_DEAD bit
set is a write operation, that is precluded from running when
bt_index_parent_check() is called by its ExclusiveLock.
_bt_vacuum_one_page() is only called by one nbtinsert.c code path,
when it looks like a page split might be needed.

>> +static void
>> +bt_check_every_level(Relation rel, bool exclusivelock)
>> +{
>
>> + /*
>> +  * Certain deletion patterns can result in "skinny" B-Tree indexes, 
>> where
>> +  * the fast root and true root differ.
>> +  *
>> +  * Start from the true root, not the fast root, unlike conventional 
>> index
>> +  * scans.  This approach is more thorough, and removes the risk of
>> +  * following a stale fast root from the meta page.
>> +  */
>> + if (metad->btm_fastroot != metad->btm_root)
>> + ereport(CONCERN,
>> + (errcode(ERRCODE_DUPLICATE_OBJECT),
>> +  errmsg("fast root mismatch in index %s",
>> + RelationGetRelationName(rel)),
>> +  errdetail_internal("Fast block %u (level %u) "
>> + 
>> "differs from true root "
>> + "block 
>> %u (level %u).",
>> +

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-11-16 Thread Thomas Munro

On Thu, Sep 8, 2016 at 6:44 AM, Peter Geoghegan  wrote:
> On Fri, Sep 2, 2016 at 11:19 AM, Kevin Grittner  wrote:
>> IMV the process is to post a patch to this list to certify that it
>> is yours to contribute and free of IP encumbrances that would
>> prevent us from using it.  I will wait for that post.
>
> I attach my V3

+ * Ideally, we'd compare every item in the index against every other
+ * item in the index, and not trust opclass obedience of the transitive
+ * law to bridge the gap between children and their grandparents (as
+ * well as great-grandparents, and so on).  We don't go to those
+ * lengths because that would be prohibitively expensive, and probably
+ * not markedly more effective in practice.
+ */

I skimmed the Modern B-Tree Techniques paper recently, and there was a
section on "the cousin problem" when verifying btrees, which this
comment reminded me of.  I tried to come up with an example of a pair
of characters being switched in a collation that would introduce
undetectable corruption:

T1.  Order   = a < b < c < d

 Btree   =[a|c]
  /   \
 / \
/   \
  [a]---[c]
   | |
   | |
  [b]---[d]

T2.  Order   = a < c < b < d

Now c and b have been switched around.  Won't amcheck still pass since
we only verify immediate downlinks and siblings?  Yet searches for b
will take a wrong turn at the root.  Do I have this right?  I wonder
if there is a way to verify that each page's high key < the 'next' key
for all ancestors.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-11-16 Thread Andres Freund

Hi,


I think the patch could use a pgindent run.

On 2016-09-07 11:44:01 -0700, Peter Geoghegan wrote:
> diff --git a/contrib/amcheck/amcheck--1.0.sql 
> b/contrib/amcheck/amcheck--1.0.sql
> new file mode 100644
> index 000..ebbd6ac
> --- /dev/null
> +++ b/contrib/amcheck/amcheck--1.0.sql
> @@ -0,0 +1,20 @@
> +/* contrib/amcheck/amcheck--1.0.sql */
> +
> +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> +\echo Use "CREATE EXTENSION amcheck" to load this file. \quit
> +
> +--
> +-- bt_index_check()
> +--
> +CREATE FUNCTION bt_index_check(index regclass)
> +RETURNS VOID
> +AS 'MODULE_PATHNAME', 'bt_index_check'
> +LANGUAGE C STRICT;
> +
> +--
> +-- bt_index_parent_check()
> +--
> +CREATE FUNCTION bt_index_parent_check(index regclass)
> +RETURNS VOID
> +AS 'MODULE_PATHNAME', 'bt_index_parent_check'
> +LANGUAGE C STRICT;

I'd really want a function that runs all check on a table.


> +#define CHECK_SUPERUSER() { \
> + if (!superuser()) \
> + ereport(ERROR, \
> + 
> (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
> +  (errmsg("must be superuser to use 
> verification functions"; }

Why is this a macro?


> +/*
> + * Callers to verification functions should never receive a false positive
> + * indication of corruption.  Therefore, when using amcheck functions for
> + * stress testing, it may be useful to temporally change the CORRUPTION 
> elevel
> + * to PANIC, to immediately halt the server in the event of detecting an
> + * invariant condition violation.  This may preserve more information about 
> the
> + * nature of the underlying problem.  Note that modifying the CORRUPTION
> + * constant to be an elevel < ERROR is not well tested.
> + */
> +#ifdef NOT_USED
> +#define CORRUPTION   PANIC
> +#define CONCERN  WARNING
> +#define POSITION NOTICE
> +#else
> +#define CORRUPTION   ERROR
> +#define CONCERN  DEBUG1
> +#define POSITION DEBUG2
> +#endif

Hm, if we want that - and it doesn't seem like a bad idea - I think we
should be make it available without recompiling.


> +/*
> + * A B-Tree cannot possibly have this many levels, since there must be one
> + * block per level, which is bound by the range of BlockNumber:
> + */
> +#define InvalidBtreeLevel((uint32) InvalidBlockNumber)

Hm, I think it'd be, for the long term, better if we'd move the btree
check code to amcheck_nbtree.c or such.


> +Datum
> +bt_index_parent_check(PG_FUNCTION_ARGS)
> +{
> + Oid indrelid = PG_GETARG_OID(0);
> + Oid heapid;
> + Relationindrel;
> + Relationheaprel;
> +
> + CHECK_SUPERUSER();
> +
> + /*
> +  * We must lock table before index to avoid deadlocks.  However, if the
> +  * passed indrelid isn't an index then IndexGetRelation() will fail.
> +  * Rather than emitting a not-very-helpful error message, postpone
> +  * complaining, expecting that the is-it-an-index test below will fail.
> +  */
> + heapid = IndexGetRelation(indrelid, true);
> + if (OidIsValid(heapid))
> + heaprel = heap_open(heapid, ShareLock);
> + else
> + heaprel = NULL;
> +
> + /*
> +  * Open the target index relations separately (like relation_openrv(), 
> but
> +  * with heap relation locked first to prevent deadlocking).  In hot 
> standby
> +  * mode this will raise an error.
> +  */
> + indrel = index_open(indrelid, ExclusiveLock);

Theoretically we should recheck that the oids still match, theoretically
the locking here allows the index->table mapping to change. It's not a
huge window tho...


> + /* Check for active uses of the index in the current transaction */
> + CheckTableNotInUse(indrel, "bt_index_parent_check");

Why do we actually care?



> +static void
> +btree_index_checkable(Relation rel)
> +{
> + if (rel->rd_rel->relkind != RELKIND_INDEX ||
> + rel->rd_rel->relam != BTREE_AM_OID)
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +  errmsg("only nbtree access method indexes are 
> supported"),
> +  errdetail("Relation \"%s\" is not a B-Tree 
> index.",
> +
> RelationGetRelationName(rel;
> +
> + if (RELATION_IS_OTHER_TEMP(rel))
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +  errmsg("cannot access temporary tables of 
> other sessions"),
> +  errdetail("Index \"%s\" is associated with 
> temporary relation.",
> +
> RelationGetRelationName(rel;
> +
> + if (!rel->rd_index->indisready)
> + ereport(ERROR,
>

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-10-16 Thread Michael Paquier

On Mon, Oct 17, 2016 at 10:46 AM, Noah Misch  wrote:
> - Detect impossible conditions in the hint bits.  A tuple should not have both
>   HEAP_XMAX_COMMITTED and HEAP_XMAX_INVALID.  Every tuple bearing
>   HEAP_ONLY_TUPLE should bear HEAP_UPDATED.  HEAP_HASVARWIDTH should be true
>   if and only if the tuple has a non-NULL value in a negative-typlen column,
>   possibly a dropped column.  A tuple should not have both HEAP_KEYS_UPDATED
>   and HEAP_XMAX_LOCK_ONLY.
>
> - Report evidence of upgrades from older versions.  If the tool sees
>   HEAP_MOVED_IN or HEAP_MOVED_OFF, it can report that the cluster was
>   binary-upgraded from 8.3 or 8.4.  If the user did not upgrade from such a
>   version, the user should assume corruption.
>
> - Check VARSIZE() of each variable-length datum.  Corrupt lengths might direct
>   you to seek past the end of the tuple, or they might imply excess free space
>   at the end of the tuple.
>
> - Verify agreement between CLOG, MULTIXACT, and hint bits.  If the hint bits
>   include HEAP_XMAX_LOCK_ONLY, the multixact should not contain a
>   MultiXactStatusUpdate member.  README.tuplock documents other invariants.
>   If the tool sees a tuple passing HEAP_LOCKED_UPGRADED, it can report that
>   the cluster was binary-upgraded from a version in [8.3, 9.1].
>
> - Verify that TOAST pointers (in non-dropped columns of visible tuples) point
>   to valid data in the TOAST relation.  This is much more expensive than the
>   other checks I've named, so it should be optional.

There is a lot of value in that actually. I got bitten lately by an
issue where this got corrupted because of an incorrect failover flow.

> - If VM_ALL_VISIBLE() or VM_ALL_FROZEN() passes for a particular page, verify
>   that the visibility data stored in the page is compatible with that claim.
>
> - Examine PageHeaderData values.  If pd_checksum is non-zero in a cluster with
>   checksums disabled, the cluster was binary-upgraded from [8.3, 9.2].

Agreed basically on all those items, though the first version does not
need to do everything :)

Honestly just btree checks have enough value to validate the
integration of this module. You can get that as "I'll look at your
patch and provide a review soon because I have a hell lot of cases
where this will be useful".
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-10-16 Thread Noah Misch

On Fri, Oct 14, 2016 at 04:56:39PM -0700, Peter Geoghegan wrote:
> On Mon, Feb 29, 2016 at 4:09 PM, Peter Geoghegan  wrote:
> > To recap, the extension adds some SQL-callable functions that verify
> > certain invariant conditions hold within some particular B-Tree index.
> > These are the conditions that index scans rely on always being true.
> > The tool's scope may eventually cover other AMs, including heapam, but
> > nbtree seems like the best place to start.
> 
> Noah and I discussed possible future directions for amcheck in person
> recently. I would like to get Noah's thoughts again here on how a tool
> like amcheck might reasonably target other access methods for
> verification. In particular, the heapam. MultiXacts were mentioned as
> a structure that could receive verification in a future iteration of
> this tool, but I lack expertise there.

Yes, a heap checker could examine plenty of things.  Incomplete list:

- Detect impossible conditions in the hint bits.  A tuple should not have both
  HEAP_XMAX_COMMITTED and HEAP_XMAX_INVALID.  Every tuple bearing
  HEAP_ONLY_TUPLE should bear HEAP_UPDATED.  HEAP_HASVARWIDTH should be true
  if and only if the tuple has a non-NULL value in a negative-typlen column,
  possibly a dropped column.  A tuple should not have both HEAP_KEYS_UPDATED
  and HEAP_XMAX_LOCK_ONLY.

- Report evidence of upgrades from older versions.  If the tool sees
  HEAP_MOVED_IN or HEAP_MOVED_OFF, it can report that the cluster was
  binary-upgraded from 8.3 or 8.4.  If the user did not upgrade from such a
  version, the user should assume corruption.

- Check VARSIZE() of each variable-length datum.  Corrupt lengths might direct
  you to seek past the end of the tuple, or they might imply excess free space
  at the end of the tuple.

- Verify agreement between CLOG, MULTIXACT, and hint bits.  If the hint bits
  include HEAP_XMAX_LOCK_ONLY, the multixact should not contain a
  MultiXactStatusUpdate member.  README.tuplock documents other invariants.
  If the tool sees a tuple passing HEAP_LOCKED_UPGRADED, it can report that
  the cluster was binary-upgraded from a version in [8.3, 9.1].

- Verify that TOAST pointers (in non-dropped columns of visible tuples) point
  to valid data in the TOAST relation.  This is much more expensive than the
  other checks I've named, so it should be optional.

- If VM_ALL_VISIBLE() or VM_ALL_FROZEN() passes for a particular page, verify
  that the visibility data stored in the page is compatible with that claim.

- Examine PageHeaderData values.  If pd_checksum is non-zero in a cluster with
  checksums disabled, the cluster was binary-upgraded from [8.3, 9.2].

> I've placed a lot of emphasis on the importance of having a
> low-overhead verification process, particularly in terms of the
> strength of heavyweight lock that the verification process requires.
> Ideally, it would be possible to run any new verification process in a
> fairly indiscriminate way with only limited impact on live production
> systems.

I suspect you could keep heap checker overhead similar to the cost of "SELECT
count(*) FROM table_name".


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-10-14 Thread Peter Geoghegan

On Mon, Feb 29, 2016 at 4:09 PM, Peter Geoghegan  wrote:
> To recap, the extension adds some SQL-callable functions that verify
> certain invariant conditions hold within some particular B-Tree index.
> These are the conditions that index scans rely on always being true.
> The tool's scope may eventually cover other AMs, including heapam, but
> nbtree seems like the best place to start.

Noah and I discussed possible future directions for amcheck in person
recently. I would like to get Noah's thoughts again here on how a tool
like amcheck might reasonably target other access methods for
verification. In particular, the heapam. MultiXacts were mentioned as
a structure that could receive verification in a future iteration of
this tool, but I lack expertise there.

I've placed a lot of emphasis on the importance of having a
low-overhead verification process, particularly in terms of the
strength of heavyweight lock that the verification process requires.
Ideally, it would be possible to run any new verification process in a
fairly indiscriminate way with only limited impact on live production
systems.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-10-06 Thread Michael Paquier

On Fri, Oct 7, 2016 at 8:05 AM, Peter Geoghegan  wrote:
> Your customer
> databases might feature far more use of Japanese collations, for
> example, which might be an important factor.

Not really :)
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-10-06 Thread Peter Geoghegan

On Sun, Oct 2, 2016 at 6:48 PM, Michael Paquier
 wrote:
> Okay, moved to next CF. I may look at it finally I got some use-cases
> for it, similar to yours I guess..

Let me know how that goes.

One thing I've definitely noticed is that the tool is good at finding
corner-case bugs, so even if you can only test a small fraction of the
number of databases that I've been able to test, there could still be
significant value in your performing your own exercise. Your customer
databases might feature far more use of Japanese collations, for
example, which might be an important factor. (Well, the use of Arabic
code points turned out to be an important factor in the question of
whether or not en_US.utf8 could have been affected by the Glibc +
abbreviated keys bug.)

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-10-02 Thread Michael Paquier

On Thu, Sep 8, 2016 at 3:44 AM, Peter Geoghegan  wrote:
> On Fri, Sep 2, 2016 at 11:19 AM, Kevin Grittner  wrote:
>> IMV the process is to post a patch to this list to certify that it
>> is yours to contribute and free of IP encumbrances that would
>> prevent us from using it.  I will wait for that post.
>
> I attach my V3. There are only very minor code changes here, such as
> breaking out a separate elevel constant for DEBUG2 traces that show
> information of how the B-Tree is accessed (e.g. current level, block
> number). Tiny tweaks to about 3 comments were also performed. These
> changes are trivial.
>
> I've also removed the tests added, since external sort regression test
> coverage is in a much better place now.

Okay, moved to next CF. I may look at it finally I got some use-cases
for it, similar to yours I guess..
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-09-12 Thread Peter Geoghegan

On Wed, Sep 7, 2016 at 11:44 AM, Peter Geoghegan  wrote:
> I attach my V3.

I should point out that I'm leaving to go on Vacation this weekend.
I'll be away for a week, and will not be answering mail during that
period. If anyone has any questions on this for me, it might be
preferable to ask them before I leave.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-09-07 Thread Peter Geoghegan

On Fri, Sep 2, 2016 at 11:19 AM, Kevin Grittner  wrote:
> IMV the process is to post a patch to this list to certify that it
> is yours to contribute and free of IP encumbrances that would
> prevent us from using it.  I will wait for that post.

I attach my V3. There are only very minor code changes here, such as
breaking out a separate elevel constant for DEBUG2 traces that show
information of how the B-Tree is accessed (e.g. current level, block
number). Tiny tweaks to about 3 comments were also performed. These
changes are trivial.

I've also removed the tests added, since external sort regression test
coverage is in a much better place now.

Finally, the documentation was updated, to make it closer to the
Github project's documentation. I expanded what was started as the
original amcheck sgml documentation based on the experience of using
amcheck on production databases at Heroku. This isn't a big change,
but it's the biggest in V3. The documentation now emphasizes the
likelihood of amcheck finding software errors, which is all we found.
Jim Gray predicted that fault tolerant systems will tend to have
almost all problems arise from operator error and software faults in
practice, so perhaps I should have predicted that that would be the
general trend.

-- 
Peter Geoghegan
From 03046b49c893c7695bbae365c33e3d7b27a1f9a3 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Tue, 10 Jun 2014 22:20:40 -0700
Subject: [PATCH] Add amcheck extension to contrib

The new extension makes available two SQL-callable functions, each of
which accept a single argument (a regclass/nbtree index relation
argument) and return void.

The SQL-callable function bt_index_check() checks invariants of the
target index by walking the tree, performing various checks of the
sanity of the page space using insertion scankeys to compare items.  The
SQL-callable function bt_index_parent_check() performs a superset of the
checks performed by bt_index_check(): the same checks, as well as
another check verifying that each downlink is actually respected as a
lower bound on its child page.

bt_index_check() requires only an AccessShareLock on the target.
bt_index_parent_check() requires an ExclusiveLock on the target,
and ShareLock on its heap relation.
---
 contrib/Makefile |1 +
 contrib/amcheck/Makefile |   19 +
 contrib/amcheck/amcheck--1.0.sql |   20 +
 contrib/amcheck/amcheck.c| 1267 ++
 contrib/amcheck/amcheck.control  |5 +
 doc/src/sgml/amcheck.sgml|  278 +
 doc/src/sgml/contrib.sgml|1 +
 doc/src/sgml/filelist.sgml   |1 +
 8 files changed, 1592 insertions(+)
 create mode 100644 contrib/amcheck/Makefile
 create mode 100644 contrib/amcheck/amcheck--1.0.sql
 create mode 100644 contrib/amcheck/amcheck.c
 create mode 100644 contrib/amcheck/amcheck.control
 create mode 100644 doc/src/sgml/amcheck.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index 25263c0..4acd50e 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -6,6 +6,7 @@ include $(top_builddir)/src/Makefile.global

 SUBDIRS = \
 		adminpack	\
+		amcheck		\
 		auth_delay	\
 		auto_explain	\
 		bloom		\
diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
new file mode 100644
index 000..7bb29aa
--- /dev/null
+++ b/contrib/amcheck/Makefile
@@ -0,0 +1,19 @@
+# contrib/amcheck/Makefile
+
+MODULE_big	= amcheck
+OBJS		= amcheck.o $(WIN32RES)
+
+EXTENSION = amcheck
+DATA = amcheck--1.0.sql
+PGFILEDESC = "amcheck - functions to verify access method invariants"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/amcheck
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/amcheck/amcheck--1.0.sql b/contrib/amcheck/amcheck--1.0.sql
new file mode 100644
index 000..ebbd6ac
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0.sql
@@ -0,0 +1,20 @@
+/* contrib/amcheck/amcheck--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION amcheck" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+CREATE FUNCTION bt_index_check(index regclass)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT;
+
+--
+-- bt_index_parent_check()
+--
+CREATE FUNCTION bt_index_parent_check(index regclass)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT;
diff --git a/contrib/amcheck/amcheck.c b/contrib/amcheck/amcheck.c
new file mode 100644
index 000..5c4d02d
--- /dev/null
+++ b/contrib/amcheck/amcheck.c
@@ -0,0 +1,1267 @@
+/*-
+ *
+ * amcheck.c
+ *		Verifies the integrity of access methods based on invariants.
+ *
+ * Provides SQL-callable functions for verifying that various logical
+ * invariants in the structure of index access method

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-09-02 Thread Peter Geoghegan

On Fri, Sep 2, 2016 at 11:16 AM, Peter Geoghegan  wrote:
> There are only tiny differences, which in any case you can see in the
> commit log on Github. There is no reason why code review needs to
> block on this V3, IMV.

Also, I don't think that we need to include V2's tests now that there
is independent test coverage for external sorts, which is a recent
development. I imagine that the V2 test run duration might be
problematic on some buildfarm animals. I'll omit that in V3.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-09-02 Thread Kevin Grittner

On Fri, Sep 2, 2016 at 1:16 PM, Peter Geoghegan  wrote:
> On Fri, Sep 2, 2016 at 11:11 AM, Kevin Grittner  wrote:
>> I changed the CF entry to "Waiting on Author" pending that.  (I was
>> starting to review the patch for commit, so it's none to early to
>> mention that the last posted version is not what I should be
>> looking at.)
>
> There are only tiny differences, which in any case you can see in the
> commit log on Github. There is no reason why code review needs to
> block on this V3, IMV.

IMV the process is to post a patch to this list to certify that it
is yours to contribute and free of IP encumbrances that would
prevent us from using it.  I will wait for that post.

-- 
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-09-02 Thread Peter Geoghegan

On Fri, Sep 2, 2016 at 11:11 AM, Kevin Grittner  wrote:
> I changed the CF entry to "Waiting on Author" pending that.  (I was
> starting to review the patch for commit, so it's none to early to
> mention that the last posted version is not what I should be
> looking at.)

There are only tiny differences, which in any case you can see in the
commit log on Github. There is no reason why code review needs to
block on this V3, IMV.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-09-02 Thread Kevin Grittner

On Fri, Sep 2, 2016 at 11:31 AM, Peter Geoghegan  wrote:
> On Thu, Aug 18, 2016 at 5:14 PM, Peter Geoghegan  wrote:
>> I'd certainly welcome that. There are Debian packages available from
>> the Github version of amcheck, which is otherwise practically
>> identical to the most recent version of the patch posted here:
>>
>> https://github.com/petergeoghegan/amcheck
>>
>> (This version also targets Postgres 9.4+).
>
> There were some extremely minor changes made here compared to my V2
> patch, which, ISTM, means that a V3 patch is necessary. I'll get to
> that next week.

I changed the CF entry to "Waiting on Author" pending that.  (I was
starting to review the patch for commit, so it's none to early to
mention that the last posted version is not what I should be
looking at.)

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-09-02 Thread Peter Geoghegan

On Thu, Aug 18, 2016 at 5:14 PM, Peter Geoghegan  wrote:
> I'd certainly welcome that. There are Debian packages available from
> the Github version of amcheck, which is otherwise practically
> identical to the most recent version of the patch posted here:
>
> https://github.com/petergeoghegan/amcheck
>
> (This version also targets Postgres 9.4+).

There were some extremely minor changes made here compared to my V2
patch, which, ISTM, means that a V3 patch is necessary. I'll get to
that next week.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-08-18 Thread Peter Geoghegan

On Thu, Aug 18, 2016 at 5:35 PM, Michael Paquier
 wrote:
> This would be packaged from source in my case, but that's no big deal
> :) At least I can see that it is added in the next CF, and that's
> marked as ready for committer for a couple of months now...

If you consider how the code is written, you'll see that it's very
unlikely to destabilize production systems.

Every page is copied into local memory once before being operated on,
and only one buffer lock/pin is held at once (for long enough to do
that copying). If there were bugs in amcheck, it's much more likely
that they'd be "false positive" bugs. In any case, I haven't seen any
issue with the tool itself yet, having now run the tool on thousands
of servers.

I think I'll have a lot more information in about a week, when I've
had time to work through more data.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-08-18 Thread Michael Paquier

On Fri, Aug 19, 2016 at 9:14 AM, Peter Geoghegan  wrote:
> On Thu, Aug 18, 2016 at 5:06 PM, Michael Paquier
>  wrote:
>> Cool. I have been honestly wondering about deploying this tool as well
>> to allow some of the QE tests to perform live checks of btree indexes
>> as we use a bunch of them.
>
> I'd certainly welcome that. There are Debian packages available from
> the Github version of amcheck, which is otherwise practically
> identical to the most recent version of the patch posted here:
>
> https://github.com/petergeoghegan/amcheck

This would be packaged from source in my case, but that's no big deal
:) At least I can see that it is added in the next CF, and that's
marked as ready for committer for a couple of months now...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-08-18 Thread Peter Geoghegan

On Thu, Aug 18, 2016 at 5:06 PM, Michael Paquier
 wrote:
> Cool. I have been honestly wondering about deploying this tool as well
> to allow some of the QE tests to perform live checks of btree indexes
> as we use a bunch of them.

I'd certainly welcome that. There are Debian packages available from
the Github version of amcheck, which is otherwise practically
identical to the most recent version of the patch posted here:

https://github.com/petergeoghegan/amcheck

(This version also targets Postgres 9.4+).

> By the way, I have not looked at the patch,
> but this supports just btree, right? Wouldn't btree_check be a better
> name, or do you think that the interface you provide is generic enough
> that it could be extended in the future for gin, gist, etc.?

Yes, the idea of calling it amcheck was support for other AMs could be
added later.

Personally, I would like to make amcheck verifying the heap through a
B-Tree index as a next step. There is also a good case for the tool
directly verifying heap relations, without involving any index, but
that can come later.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-08-18 Thread Michael Paquier

On Fri, Aug 19, 2016 at 2:40 AM, Peter Geoghegan  wrote:
> Heroku began a selective roll-out of amcheck yesterday. amcheck
> already found a bug in the PostGiS Geography B-Tree opclass:
> [...]
> I'll go report this to the PostGiS people.

Cool. I have been honestly wondering about deploying this tool as well
to allow some of the QE tests to perform live checks of btree indexes
as we use a bunch of them. By the way, I have not looked at the patch,
but this supports just btree, right? Wouldn't btree_check be a better
name, or do you think that the interface you provide is generic enough
that it could be extended in the future for gin, gist, etc.?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-08-18 Thread Peter Geoghegan

On Sat, Mar 12, 2016 at 12:38 PM, Peter Geoghegan  wrote:
> Only insofar as it helps diagnose the underlying issue, when it is a
> more subtle issue. Actually fixing the index is almost certainly a
> REINDEX. Once you're into the messy business of diagnosing a
> problematic opclass, you have to be an expert, and tweaking amcheck
> for your requirements (i.e. rebuilding from source) becomes
> reasonable. Part of the reason that the code is so heavily commented
> is to make it hackable, because I do not feel optimistic that I can
> get an expert-orientated interface right, but I still want to make the
> tool as useful as possible to experts.

Heroku began a selective roll-out of amcheck yesterday. amcheck
already found a bug in the PostGiS Geography B-Tree opclass:

https://github.com/postgis/postgis/blob/svn-trunk/postgis/geography_btree.c#L260

The issue is not that PG_RETURN_BOOL() is used here (that's just a
problem of style). Rather, the issue is that the Geography opclass
support function 1 considers an "empty geometry" equal to all other
possible values (values not limited to other empty geometries). This
breaks the transitive law, which nbtree requires and relies on.

I'll go report this to the PostGiS people.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-04-08 Thread Anastasia Lubennikova


29.03.2016 06:13, Peter Geoghegan:

On Tue, Mar 22, 2016 at 10:57 AM, Peter Geoghegan  wrote:

That's right - I have a small number of feedback items to work
through. I also determined myself that there could be a very low
probability race condition when checking the key space across sibling
pages, and will work to address that.

I attach a V2 revision.

Behavioral changes:

* Fixes very low probability race condition when verifying ordering
across sibling pages with bt_index_check()/AccessShareLock.

Notably, the fix does not involve using the high key from the next
page instead of the first item, despite my recent suggestion that it
would; I changed my mind. In this revision,
bt_right_page_check_scankey() continues to return the first item on
the next/right page (as a scankey), just as in the first version. The
locking considerations here are more complicated than before, and
hopefully correct. I abandoned the high key idea when I realized that
a concurrent page split could happen in the right sibling, making the
high key generally no more safe in the face of concurrent target page
deletion than continuing to use the first data item. Using the first
data item from the right page is better than using the high key item
from the right page verification-wise anyway, so that's good. I found
a new way of locking down and recovering from any race conditions (the
low probability bug in V1 for bt_index_check()/AccessShareLock calls
to the C function bt_right_page_check_scankey()). I won't go into the
gory details here, but I will mention that my new approach is inspired
by how backwards scans work already.

* Adds DEBUG1 traces that indicate the level and page currently being checked.

Advanced users will probably find these traces useful on occasion.
It's nice to be able to see how amcheck walks the B-Tree in practice.

Non-behavioral changes:

* Explains why an ExclusiveLock is needed on the target index by
bt_index_parent_check() (and a ShareLock on its heap relation).

This new explanation talks about why an ExclusiveLock *is* necessary
when checking downlinks against child pages, per Tomas Vondra's
request. (Note that nothing changes about child/downlink verification
itself). Before now, I focused on why they were not necessary, but
Tomas was right to point out that that isn't enough information.

* No longer uses the word "comport" in an error message, per Amit
Langote. Similarly, minor typos fixed, per Tomas.

* Adds a variety of pg_regress-based tests for external sorting. (Via
various CREATE INDEX statements involving a variety of datatypes. Data
is generated pseudo-randomly.)

Notably, the regression tests *never* test external sorting today.

ISTM that amcheck is a convenient, reliable, and portable way of
testing the sorting of collatable types like text, so this is the best
way of testing external sorting. External sorting uses abbreviated
keys for runs, but discards them as runs are written out, and so the
merging of runs does not use abbreviated keys. Abbreviated comparisons
and authoritative comparisons are therefore reliably used within the
same sort. One particular goal here is that something like the recent
strxfrm() debacle might be caught by this testing on some platforms
(if we ever get back to trusting strxfrm() at all, that is). The style
of these new tests is unorthodox, because a variety of collations are
tested, which varies from system to system; we should discuss all
this. The tests can take a minute or two to run on my laptop, but that
could vary significantly. And so, whether or not that needs to be
targeted by something other than "check" and "installcheck" is a
discussion that needs to happen. I imagine a lot of buildfarm animals
have slow storage, but on the other hand I don't think hackers run the
contrib regression tests so often. I prefer to not follow the example
of test_decoding if possible; test_decoding's Makefile only runs tests
when wal_level=logical is expected (that's how the relevant Makefile
targets are scoped), but it's a complicated special case.

Review
--

As already noted after the first bullet point, there are some tricky
considerations on fixing the race when the user calls
bt_index_check(). Reviewers should pay close attention to this; could
I be wrong about that being race-safe even now? This seems to me to be
by far the best place for reviewers to look for problems in this
patch, as it's the only bt_index_check()/AccessShareLock case that
compares anything *across* pages. The measures taken to prevent false
positives described within bt_target_page_check() require expert
review, especially because this is the one case where a faulty
rationale may result in under-protection from false positives, and not
just harmless over-protection.

The rationale for why we can reliably recover from a race described
within the C function bt_right_page_check_scankey() is *complicated*.
I may have failed to make it as simple as possible despite significant
effort. It's

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-28 Thread Peter Geoghegan

On Tue, Mar 22, 2016 at 10:57 AM, Peter Geoghegan  wrote:
> That's right - I have a small number of feedback items to work
> through. I also determined myself that there could be a very low
> probability race condition when checking the key space across sibling
> pages, and will work to address that.

I attach a V2 revision.

Behavioral changes:

* Fixes very low probability race condition when verifying ordering
across sibling pages with bt_index_check()/AccessShareLock.

Notably, the fix does not involve using the high key from the next
page instead of the first item, despite my recent suggestion that it
would; I changed my mind. In this revision,
bt_right_page_check_scankey() continues to return the first item on
the next/right page (as a scankey), just as in the first version. The
locking considerations here are more complicated than before, and
hopefully correct. I abandoned the high key idea when I realized that
a concurrent page split could happen in the right sibling, making the
high key generally no more safe in the face of concurrent target page
deletion than continuing to use the first data item. Using the first
data item from the right page is better than using the high key item
from the right page verification-wise anyway, so that's good. I found
a new way of locking down and recovering from any race conditions (the
low probability bug in V1 for bt_index_check()/AccessShareLock calls
to the C function bt_right_page_check_scankey()). I won't go into the
gory details here, but I will mention that my new approach is inspired
by how backwards scans work already.

* Adds DEBUG1 traces that indicate the level and page currently being checked.

Advanced users will probably find these traces useful on occasion.
It's nice to be able to see how amcheck walks the B-Tree in practice.

Non-behavioral changes:

* Explains why an ExclusiveLock is needed on the target index by
bt_index_parent_check() (and a ShareLock on its heap relation).

This new explanation talks about why an ExclusiveLock *is* necessary
when checking downlinks against child pages, per Tomas Vondra's
request. (Note that nothing changes about child/downlink verification
itself). Before now, I focused on why they were not necessary, but
Tomas was right to point out that that isn't enough information.

* No longer uses the word "comport" in an error message, per Amit
Langote. Similarly, minor typos fixed, per Tomas.

* Adds a variety of pg_regress-based tests for external sorting. (Via
various CREATE INDEX statements involving a variety of datatypes. Data
is generated pseudo-randomly.)

Notably, the regression tests *never* test external sorting today.

ISTM that amcheck is a convenient, reliable, and portable way of
testing the sorting of collatable types like text, so this is the best
way of testing external sorting. External sorting uses abbreviated
keys for runs, but discards them as runs are written out, and so the
merging of runs does not use abbreviated keys. Abbreviated comparisons
and authoritative comparisons are therefore reliably used within the
same sort. One particular goal here is that something like the recent
strxfrm() debacle might be caught by this testing on some platforms
(if we ever get back to trusting strxfrm() at all, that is). The style
of these new tests is unorthodox, because a variety of collations are
tested, which varies from system to system; we should discuss all
this. The tests can take a minute or two to run on my laptop, but that
could vary significantly. And so, whether or not that needs to be
targeted by something other than "check" and "installcheck" is a
discussion that needs to happen. I imagine a lot of buildfarm animals
have slow storage, but on the other hand I don't think hackers run the
contrib regression tests so often. I prefer to not follow the example
of test_decoding if possible; test_decoding's Makefile only runs tests
when wal_level=logical is expected (that's how the relevant Makefile
targets are scoped), but it's a complicated special case.

Review
--

As already noted after the first bullet point, there are some tricky
considerations on fixing the race when the user calls
bt_index_check(). Reviewers should pay close attention to this; could
I be wrong about that being race-safe even now? This seems to me to be
by far the best place for reviewers to look for problems in this
patch, as it's the only bt_index_check()/AccessShareLock case that
compares anything *across* pages. The measures taken to prevent false
positives described within bt_target_page_check() require expert
review, especially because this is the one case where a faulty
rationale may result in under-protection from false positives, and not
just harmless over-protection.

The rationale for why we can reliably recover from a race described
within the C function bt_right_page_check_scankey() is *complicated*.
I may have failed to make it as simple as possible despite significant
effort. It's hard to describe these things

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-22 Thread Peter Geoghegan

On Tue, Mar 22, 2016 at 10:57 AM, Peter Geoghegan  wrote:
> That's right - I have a small number of feedback items to work
> through. I also determined myself that there could be a very low
> probability race condition when checking the key space across sibling
> pages, and will work to address that. If I'm right about that then
> it's not a lot of work to fix; I'm probably just going to use the
> right page's high key rather than its first data item.

I also want to use amcheck to test sorting, especially external
sorting which is currently totally untested. It would be nice to see
testing of strxfrm() on the buildfarm, too -- amcheck provides a nice
way to make sure strxfrm() and strcoll() are in agreement for at least
those cases that are tested, without having to worry about portability
in the same way as a simple pg_regress approach would require.

Note that there are reportedly systems in mainstream use where
strxfrm() is broken; it doesn't agree with strcoll().

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-22 Thread Peter Geoghegan

On Tue, Mar 22, 2016 at 9:33 AM, David Steele  wrote:
> It looks like an updated patch is expected here, though it seems that
> the only requests are for updates to comments.

That's right - I have a small number of feedback items to work
through. I also determined myself that there could be a very low
probability race condition when checking the key space across sibling
pages, and will work to address that. If I'm right about that then
it's not a lot of work to fix; I'm probably just going to use the
right page's high key rather than its first data item.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-22 Thread David Steele

On 3/15/16 3:42 AM, Peter Geoghegan wrote:
> On Tue, Mar 15, 2016 at 12:31 AM, Amit Langote
>  wrote:
>> Ah, I see the nuance.  Thanks for the explanation.  Maybe,
>> bt_index_check() and bt_index_parent_child_check() /
>> bt_index_check_parent_child().  IMHO, the latter more clearly highlights
>> the fact that parent/child relationships in the form of down-links are
>> checked.
> 
> Well, the downlink is in the parent, because there is no such thing as
> an "uplink". So I prefer bt_index_parent_check(), since it usefully
> hints at starting from the parent. It's also more concise.
> 
>> By the way, one request (as a non-native speaker of English language, who
>> ends up looking up quite a few words regularly) -
>>
>> Could we use "conform" or "correspond" instead of "comport" in the
>> following error message:
>>
>> "left link/right link pair in index \"%s\" don't comport"
> 
> OK. I'll do something about that.

It looks like an updated patch is expected here, though it seems that
the only requests are for updates to comments.

-- 
-David
da...@pgmasters.net


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-15 Thread Amit Langote

On 2016/03/15 16:42, Peter Geoghegan wrote:
> On Tue, Mar 15, 2016 at 12:31 AM, Amit Langote
>  wrote:
>> By the way, one request (as a non-native speaker of English language, who
>> ends up looking up quite a few words regularly) -
>>
>> Could we use "conform" or "correspond" instead of "comport" in the
>> following error message:
>>
>> "left link/right link pair in index \"%s\" don't comport"
> 
> OK. I'll do something about that.

Thanks a lot.

- Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-15 Thread Peter Geoghegan

On Tue, Mar 15, 2016 at 12:31 AM, Amit Langote
 wrote:
> Ah, I see the nuance.  Thanks for the explanation.  Maybe,
> bt_index_check() and bt_index_parent_child_check() /
> bt_index_check_parent_child().  IMHO, the latter more clearly highlights
> the fact that parent/child relationships in the form of down-links are
> checked.

Well, the downlink is in the parent, because there is no such thing as
an "uplink". So I prefer bt_index_parent_check(), since it usefully
hints at starting from the parent. It's also more concise.

> By the way, one request (as a non-native speaker of English language, who
> ends up looking up quite a few words regularly) -
>
> Could we use "conform" or "correspond" instead of "comport" in the
> following error message:
>
> "left link/right link pair in index \"%s\" don't comport"

OK. I'll do something about that.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-15 Thread Amit Langote


Hi Peter,

On 2016/03/15 16:11, Peter Geoghegan wrote:
> On Mon, Mar 14, 2016 at 11:48 PM, Amit Langote
>  wrote:
>>> Dunno about that. It's defining characteristic is that it checks child
>>> pages against their parent IMV. Things are not often defined in terms
>>> of their locking requirements.
>>
>> At the risk of sounding a bit verbose, do bt_check_level() for a check
>> that inspects a level at a time and bt_check_multi_level() for a check
>> that spans levels sound descriptive?
> 
> Hmm. But all functions verify multiple levels. What distinguishes
> bt_index_parent_check()'s verification is that the downlinks in
> internal pages are checked against actual child pages (every item in
> the child page, in fact). It's the parent/child relationship that is
> verified in addition to the standard checks of every page on and
> across (not between) every level.

Ah, I see the nuance.  Thanks for the explanation.  Maybe,
bt_index_check() and bt_index_parent_child_check() /
bt_index_check_parent_child().  IMHO, the latter more clearly highlights
the fact that parent/child relationships in the form of down-links are
checked.

By the way, one request (as a non-native speaker of English language, who
ends up looking up quite a few words regularly) -

Could we use "conform" or "correspond" instead of "comport" in the
following error message:

"left link/right link pair in index \"%s\" don't comport"

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-15 Thread Peter Geoghegan

On Mon, Mar 14, 2016 at 11:48 PM, Amit Langote
 wrote:
>> Dunno about that. It's defining characteristic is that it checks child
>> pages against their parent IMV. Things are not often defined in terms
>> of their locking requirements.
>
> At the risk of sounding a bit verbose, do bt_check_level() for a check
> that inspects a level at a time and bt_check_multi_level() for a check
> that spans levels sound descriptive?

Hmm. But all functions verify multiple levels. What distinguishes
bt_index_parent_check()'s verification is that the downlinks in
internal pages are checked against actual child pages (every item in
the child page, in fact). It's the parent/child relationship that is
verified in addition to the standard checks of every page on and
across (not between) every level.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-14 Thread Amit Langote

On 2016/03/12 6:31, Peter Geoghegan wrote:
> On Thu, Mar 10, 2016 at 9:18 AM, Tomas Vondra
>  wrote:
>> I've looked at this patch today, mostly to educate myself, so this
>> probably should not count as a full review. Anyway, the patch seems in
>> excellent shape - it'd be great if all patches (including those written
>> by me) had this level of comments/docs.

+1

>> Can we come up with names that more clearly identify the difference
>> between those two functions? I mean, _parent_ does not make it
>> particularly obvious that the second function acquires exclusive lock
>> and performs more thorough checks.
> 
> Dunno about that. It's defining characteristic is that it checks child
> pages against their parent IMV. Things are not often defined in terms
> of their locking requirements.

At the risk of sounding a bit verbose, do bt_check_level() for a check
that inspects a level at a time and bt_check_multi_level() for a check
that spans levels sound descriptive?

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-14 Thread Peter Geoghegan

On Fri, Mar 11, 2016 at 8:19 AM, Anastasia Lubennikova
 wrote:
> But I have some concerns about compatibility with my patches.
> I've tried to call bt_index_check() over my "including" patch [1] and caught
> a segfault.
>
> LOG:  server process (PID 31794) was terminated by signal 11: Segmentation
> fault
> DETAIL:  Failed process was running: select bt_index_check('idx');
>
> I do hope that my patch will be accepted in 9.6, so this conflict looks
> really bad.
> I think that error is caused by changes in pages layout. To save some space
> nonkey attributes are truncated

> [1] https://commitfest.postgresql.org/9/433/

I posted a review of your "Covering + unique indexes" patch, where I
made an educated guess about what the problem is here (I sort of
hinted at what I thought it was already, in this thread, actually). I
haven't actually tested this theory of mine myself just yet, but let
me know what you think of it on the thread for your patch.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-14 Thread Peter Geoghegan

On Mon, Mar 14, 2016 at 10:39 AM, Robert Haas  wrote:
> I don't particularly like that interface.  I also suggest that it
> would be better to leave throttling to a future commit, and focus on
> getting the basic feature in first.

Works for me. I don't think throttling is an especially compelling feature.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-14 Thread Robert Haas

On Fri, Mar 11, 2016 at 7:45 PM, Peter Geoghegan  wrote:
> On Fri, Mar 11, 2016 at 4:30 PM, Jim Nasby  wrote:
>> Right, but you still have the option to enable them if you don't want to
>> swamp your IO system. That's why CIC obeys it too. If I was running a
>> consistency check on a production system I'd certainly want the option to
>> throttle it. Without that option, I don't see running this on production
>> systems as being an option. If that's not a goal then fine, but if it is a
>> goal I think it needs to be there.
>>
>> Isn't it just a few extra lines of code to support it?
>
> I see your point.
>
> I'll add that if people like the interface you propose. (Overloading
> the VACUUM cost delay functions to cause a delay for amcheck
> functions, too). Note that the functions already use an appropriate
> buffer access strategy (it avoids blowing shared_buffers, much like
> VACUUM itself).

I don't particularly like that interface.  I also suggest that it
would be better to leave throttling to a future commit, and focus on
getting the basic feature in first.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-13 Thread Jim Nasby


On 3/11/16 6:45 PM, Peter Geoghegan wrote:

I'll add that if people like the interface you propose. (Overloading
the VACUUM cost delay functions to cause a delay for amcheck
functions, too).


I thought that had already been overloaded by CIC, but I'm not finding 
reference to it... ANALYZE does use it though, so the ship has already 
sorta sailed.


I'm actually a bit surprised cost delay isn't used anywhere else. As 
more background operations are added I suspect users will want it at 
some point.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-13 Thread Peter Geoghegan

On Sat, Mar 12, 2016 at 7:46 PM, Matt Kelly  wrote:
> You can actually pretty easily produce a test case by setting up streaming
> replication between servers running two different version of glibc.
>
> I actually wrote a tool that spins up a pair of VMs using vagrant and then
> sets them up as streaming replica's using ansible.  It provides a nice one
> liner to get a streaming replica test environment going and it will easily
> provide the cross glibc test case.  Technically, though it belongs to Trip
> because I wrote it on company time.  Let me see if I can open source a
> version of it later this week that way you can use it for testing.

That could be interesting. The earlier prototypes of this tool are
known to have detected glibc collation incompatibilities in real
production systems.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-12 Thread Matt Kelly

>
> On Fri, 2016-03-11 at 17:24 +0100, Michael Paquier wrote:
> > On Fri, Mar 11, 2016 at 5:19 PM, Anastasia Lubennikova wrote:
> > >
> > > BTW, if you know a good way to corrupt index (and do it
> > > reproducible) I'd be
> > > very glad to see it.
> > You can use for example dd in non-truncate mode to corrupt on-disk
> > page data, say that for example:
> > dd if=/dev/random bs=8192 count=1 \
> > seek=$BLOCK_ID of=base/$DBOID/$RELFILENODE \
> > conv=notrunc
> I guess /dev/random is not very compatible with the "reproducible"
> requirement. I mean, it will reproducibly break the page, but pretty
> much completely, which is mostly what checksums are for.

You can actually pretty easily produce a test case by setting up streaming
replication between servers running two different version of glibc.

I actually wrote a tool that spins up a pair of VMs using vagrant and then
sets them up as streaming replica's using ansible.  It provides a nice one
liner to get a streaming replica test environment going and it will easily
provide the cross glibc test case.  Technically, though it belongs to Trip
because I wrote it on company time.  Let me see if I can open source a
version of it later this week that way you can use it for testing.

- Matt K.

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-12 Thread Peter Geoghegan

On Fri, Mar 11, 2016 at 6:33 PM, Tomas Vondra
 wrote:
> Right, but isn't there a difference between the two functions in this
> respect? Once you find corruption involving relationship between
> multiple pages, then I agree it's complicated to do any reasoning about
> what additional checks are safe.
>
> But does that problem also apply to bt_index_check, which checks pages
> independently?

I think so, yes.

> Admittedly, this also depends on the use case. If all we need to do is
> answering a question "Is the index corrupted?" then sure, bailing out
> on the first error is perfectly appropriate.
>
> But perhaps this would be useful for some recovery/forensics tasks?

Maybe, but I feel like making it possible to change the CORRUPTION
elevel macro was the right trade-off. I don't want to have to reason
about the universe of possible problems that could occur when the tool
must limp on in the event of corruption. For example, I don't want to
have to deal with infinite loops. In practice, an expert would
probably be fine to change the constant themselves if they needed to.

Indexes can always be rebuilt. The tool is for identifying and
diagnosing corruption, but if you want to diagnose a faulty opclass or
something, then I think you need to get out pageinspect. You need
human judgement for that.

> From time to time we need to investigate corruption in a database, i.e.
> see how much of the data is actually corrupted, list pages that need to
> be zeroed to get the cluster up to salvage as much as possible, etc.
> Currently this is tedious because we essentially find/fix the pages one
> by one. It'd be very useful to list all broken pages in one go and then
> fix all of them.
>
> Obviously, that's about heapam checks, but perhaps it would be useful
> for an index too?

Only insofar as it helps diagnose the underlying issue, when it is a
more subtle issue. Actually fixing the index is almost certainly a
REINDEX. Once you're into the messy business of diagnosing a
problematic opclass, you have to be an expert, and tweaking amcheck
for your requirements (i.e. rebuilding from source) becomes
reasonable. Part of the reason that the code is so heavily commented
is to make it hackable, because I do not feel optimistic that I can
get an expert-orientated interface right, but I still want to make the
tool as useful as possible to experts.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-11 Thread Tomas Vondra


On Fri, 2016-03-11 at 16:40 -0800, Peter Geoghegan wrote:
> On Fri, Mar 11, 2016 at 4:17 PM, Peter Geoghegan 
> wrote:
> > 
> > If you want the tool to limp on when it finds an error, that can be
> > done by changing the constant for the CORRUPTION macro in
> > amcheck.c.
> > But having that be dynamically configurable is not really
> > compatible
> > with the goal of having amcheck be run routinely.
> Also, it's just really hard to reason about what remains OK to check
> after the first problem is encountered in the general case. It's
> "unreasonable" for any of the checks to ever fail. So, by that
> standard, assuming that they might fail at all could be called
> paranoid. Who can say what "paranoid enough" should be? I think it's
> useful to have a low-impact, generic check function for B-Tree
> indexes, but I don't think we need to hold back on being exhaustive
> elsewhere.

Right, but isn't there a difference between the two functions in this
respect? Once you find corruption involving relationship between
multiple pages, then I agree it's complicated to do any reasoning about
what additional checks are safe.

But does that problem also apply to bt_index_check, which checks pages
independently?

Admittedly, this also depends on the use case. If all we need to do is
answering a question "Is the index corrupted?" then sure, bailing out
on the first error is perfectly appropriate.

But perhaps this would be useful for some recovery/forensics tasks?

>From time to time we need to investigate corruption in a database, i.e.
see how much of the data is actually corrupted, list pages that need to
be zeroed to get the cluster up to salvage as much as possible, etc.
Currently this is tedious because we essentially find/fix the pages one
by one. It'd be very useful to list all broken pages in one go and then
fix all of them.

Obviously, that's about heapam checks, but perhaps it would be useful
for an index too?

regard

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-11 Thread Peter Geoghegan

On Fri, Mar 11, 2016 at 4:30 PM, Jim Nasby  wrote:
> Right, but you still have the option to enable them if you don't want to
> swamp your IO system. That's why CIC obeys it too. If I was running a
> consistency check on a production system I'd certainly want the option to
> throttle it. Without that option, I don't see running this on production
> systems as being an option. If that's not a goal then fine, but if it is a
> goal I think it needs to be there.
>
> Isn't it just a few extra lines of code to support it?

I see your point.

I'll add that if people like the interface you propose. (Overloading
the VACUUM cost delay functions to cause a delay for amcheck
functions, too). Note that the functions already use an appropriate
buffer access strategy (it avoids blowing shared_buffers, much like
VACUUM itself).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-11 Thread Peter Geoghegan

On Fri, Mar 11, 2016 at 4:17 PM, Peter Geoghegan  wrote:
> If you want the tool to limp on when it finds an error, that can be
> done by changing the constant for the CORRUPTION macro in amcheck.c.
> But having that be dynamically configurable is not really compatible
> with the goal of having amcheck be run routinely.

Also, it's just really hard to reason about what remains OK to check
after the first problem is encountered in the general case. It's
"unreasonable" for any of the checks to ever fail. So, by that
standard, assuming that they might fail at all could be called
paranoid. Who can say what "paranoid enough" should be? I think it's
useful to have a low-impact, generic check function for B-Tree
indexes, but I don't think we need to hold back on being exhaustive
elsewhere.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-11 Thread Jim Nasby


On 3/11/16 6:17 PM, Peter Geoghegan wrote:

Not sure about the cost delay thing. Delays are disabled by default
for manually issued VACUUM, so have doubts that that's useful.


Right, but you still have the option to enable them if you don't want to 
swamp your IO system. That's why CIC obeys it too. If I was running a 
consistency check on a production system I'd certainly want the option 
to throttle it. Without that option, I don't see running this on 
production systems as being an option. If that's not a goal then fine, 
but if it is a goal I think it needs to be there.


Isn't it just a few extra lines of code to support it?
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-11 Thread Peter Geoghegan

On Fri, Mar 11, 2016 at 3:50 PM, Jim Nasby  wrote:
> I also agree that the nmodule name isn't very clear. If this is meant to be
> the start of a generic consistency checker, lets call it that. Otherwise, it
> should be marked as being specific to btrees, because presumably we might
> eventually want similar tools for GIN, etc. (FWIW I'd vote for a general
> consistency checker).

It's a generic consistency checker -- that's the intent. Robert wanted
to go that way about a year ago, and I think it makes sense (this
module started out life as btreecheck). As I said, I don't really care
what the name ends up being. amcheck seems fine to me, but I'll go
with any reasonable suggestion that reflects the scope of the tool as
a generic consistency checker for AMs, including heapam.

I don't know that I understand the concern about the naming of
bt_index_parent_check(). I called it something that accurately
reflects what it does. That particular SQL-callable function is more
orientated towards helping hackers than towards helping users detect
routine problems, as the docs say, but it could help users working
with hackers, and it might even help users acting on their own. I
don't want to make it sound scary, because it isn't. I don't know what
problems will be detected by this tool most often, and I don't think
it would be wise to try to predict how that will work out. And so,
concealing the internal workings of each function/check by giving them
all totally non-technical names seems like a bad idea. It's not that
hard to understand that a B-Tree has multiple levels, and checking the
levels against each other requires more restrictive/stronger
heavyweight locks. I've only added 2 SQL-callable functions, each of
which has one argument, and the first of which has a generic name and
very light locking requirements, like a SELECT. It's not that hard to
get the general idea, as an ordinary user.

> I know the vacuum race condition would be very rare, but I don't think it
> can be ignored. Last thing you want out of a consistency checker is false
> negatives/positives.

That's what I said. And the docs also say that there should never be a
false positive. That's definitely an important design goal here,
because it enables routine testing.

> I do think it would be reasonable to just wholesale
> block against concurrent vacuums, but I don't think there's any reasonable
> way to do that.

Actually, that's exactly what bt_index_parent_check() does already.
VACUUM requires a SHARE UPDATE EXCLUSIVE lock on the heap relation,
which cannot be concurrently acquired due to bt_index_parent_check()'s
acquisition of a SHARE lock. The locking for bt_index_parent_check()
is almost the same as REINDEX, except that that acquires an ACCESS
EXCLUSIVE lock on the index relation; bt_index_parent_check() only
requires an EXCLUSIVE lock on the index relation.

Not sure about the cost delay thing. Delays are disabled by default
for manually issued VACUUM, so have doubts that that's useful.

If you want the tool to limp on when it finds an error, that can be
done by changing the constant for the CORRUPTION macro in amcheck.c.
But having that be dynamically configurable is not really compatible
with the goal of having amcheck be run routinely.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-11 Thread Jim Nasby

On 3/11/16 3:31 PM, Peter Geoghegan wrote:

Can we come up with names that more clearly identify the difference
>between those two functions? I mean,_parent_  does not make it
>particularly obvious that the second function acquires exclusive lock
>and performs more thorough checks.

Dunno about that. It's defining characteristic is that it checks child
pages against their parent IMV. Things are not often defined in terms
of their locking requirements.

First, thanks for your work on this. I've wanted it in the past.

I agree the name isn't very clear. Perhaps _recurse?

I also agree that the nmodule name isn't very clear. If this is meant to 
be the start of a generic consistency checker, lets call it that. 
Otherwise, it should be marked as being specific to btrees, because 
presumably we might eventually want similar tools for GIN, etc. (FWIW 
I'd vote for a general consistency checker).

I know the vacuum race condition would be very rare, but I don't think 
it can be ignored. Last thing you want out of a consistency checker is 
false negatives/positives. I do think it would be reasonable to just 
wholesale block against concurrent vacuums, but I don't think there's 
any reasonable way to do that.

I would prefer the ability to do something other than raising an error 
when corruption is found, so that you could find all corruption in an 
index. Obviously could log to a different level. Another option would be 
SRFs that return info about all the corruption found, but that's 
probably overkill.

It'd be nice if you had the option to obey vacuum_cost_delay when 
running this, but that's clearly just a nice-to-have (or maybe just obey 
it all the time, since it defaults to 0).

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-11 Thread Peter Geoghegan

On Fri, Mar 11, 2016 at 8:19 AM, Anastasia Lubennikova
 wrote:
> I do hope that my patch will be accepted in 9.6, so this conflict looks
> really bad.

I hope so too. I'll have to look into this issue.

> I think that error is caused by changes in pages layout. To save some space
> nonkey attributes are truncated
> when btree copies the indexed value into inner pages or into high key. You
> can look at index_reform_tuple() calls.

That seems like the kind of problem that would be expected when things
like that change. I think it's going to be hard to add new B-Tree
features without a tool like this, which was a big reason to work on
it; to make these new projects possible to test and review. I see many
opportunities to improve the B-Tree code, just as I imagine you do.
These projects are quite strategically important, because B-Trees are
so frequently used. I think that Postgres B-Trees produce excessive
random I/O for checkpoints for various reasons, and this disadvantages
Postgres when it is compared to other major systems. As things get
better in other areas of Postgres, these hard problems become more
important to solve.

> I wonder, whether you can add some check of catalog version to your module
> or this case requires more changes?

I think that it's just going to be tied to the Postgres version. So,
if your B-Tree patches are committed first, it's on me to make sure
they're handled correctly. Or vice-versa. Not worried that that will
be a problem.

We already take special steps to avoid the "minus infinity" item on
internal pages. I think that in the future, if Postgres B-Trees get
suffix truncation for internal page items, there are new problems for
amcheck (suffix truncation remove unneeded information from internal
page items, sometimes greatly increasing B-Tree fan-out. Internal page
items need only be sufficient to guide index scans correctly.).

Specially, with suffix truncation there might be "minus infinity"
*attributes*, too (these could make it safe to completely remove
attributes/columns past the first distinguishing/distinct attribute on
each item on internal pages). That's a case that amcheck then needs to
care about, just like it currently cares about the existing concept of
minus infinity items. That's how it goes for amcheck.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-11 Thread Peter Geoghegan

On Fri, Mar 11, 2016 at 1:31 PM, Peter Geoghegan  wrote:
> You could have a race, where
> there was a concurrent page deletion of the left sibling of the child
> page, then a concurrent insertion into the newly expanded keyspace of
> the parent. Therefore, the downlink in the parent (which is the
> "target", to use the patch's terminology) would not be a lower bound
> on items in the page.

Excuse me: I meant the newly expanded keyspace of the *child*. (The
parent's keyspace would have covered everything. It's naturally far
larger than either child's keyspace, since it typically has several
hundred pages.)


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-11 Thread Peter Geoghegan

On Thu, Mar 10, 2016 at 9:18 AM, Tomas Vondra
 wrote:
> I've looked at this patch today, mostly to educate myself, so this
> probably should not count as a full review. Anyway, the patch seems in
> excellent shape - it'd be great if all patches (including those written
> by me) had this level of comments/docs.

Thanks. I try. This patch is something I've been working on slowly and
steadily for over 2 years. It was time to see it through.

>> Note that no function currently checks that the index is consistent
>> with the heap, which would be very useful (that's probably how I'd
>> eventually target the heapam, actually).
>
> I'm afraid I don't understand what "target the heapam" means. Could you
> elaborate?

Just that that's how I'd make amcheck verify that the heap was sane,
if I was going to undertake that project. Or, I'd start there.

> I agree. It'd be nice to have a tool that also checks for data
> corruption the a lower level (e.g. that varlena headers are not
> corrupted etc.) but that seems like a task for some other tool.

I'm not sure that that's a task for another tool. I think that this
tool is scoped at detecting corruption, and that could work well for
the heap, too. There might be fewer interesting things to test about
the heap when indexes aren't involved. Following HOT chains through an
index, and verifying things like that about the heap as we go could
detect a lot of problematic cases.

> Can we come up with names that more clearly identify the difference
> between those two functions? I mean, _parent_ does not make it
> particularly obvious that the second function acquires exclusive lock
> and performs more thorough checks.

Dunno about that. It's defining characteristic is that it checks child
pages against their parent IMV. Things are not often defined in terms
of their locking requirements.

> This also applies to the name of the contrib module - it's not
> particularly obvious what "amcheck" unless you're familiar with the
> concept of access methods. Which is quite unlikely for regular users.
> Maybe something like "check_index" would be more illustrative?

I think that given the heap may be targeted in the future, amcheck is
more future-proof. I think Robert might have liked that name a year
ago or something. In general, I'm not too worried about what the
contrib module is called in the end.

> Well, I wouldn't count myself as an expert here, but do we actually need
> such protection? I mean, we only do such checks when holding an
> exclusive lock on the parent relation, no? And even if we don't vacuum
> can only remove entries from the pages - why should that cause
> violations of any invariants?

I think that a more worked out explanation for why the ExclusiveLock
is needed is appropriate. I meant to do that.

Basically, a heavier lock is needed because of page deletion by
VACUUM, which is the major source of complexity (much more than page
splits, I'd say). In general, the key space can be consolidated by
VACUUM in a way that breaks child page checking because the downlink
we followed from our target page is no longer the current downlink.
Page deletion deletes the right sibling's downlink, and the deleted
page's downlink is used for its sibling. You could have a race, where
there was a concurrent page deletion of the left sibling of the child
page, then a concurrent insertion into the newly expanded keyspace of
the parent. Therefore, the downlink in the parent (which is the
"target", to use the patch's terminology) would not be a lower bound
on items in the page.

That's a very low probability race, because it involves deletion a
page representing a range in the keyspace that has no live items, but
then immediately getting one just in time for our check. But, I'm
pretty determined that false positives like that need to be impossible
(or else it's a bug).

I have a paranoid feeling that there is a similar very low probability
race with the left-right keyspace checks, which don't have relation
ExclusiveLock protection (IOW, I think that that might be buggy). I
need to think about that some more, but current thinking is that it
would hardly matter if we used the highkey from right page rather than
the first data item, which definitely would be correct. And it would
be simpler.

> A few minor comments:

Thanks for catching those issues. Will fix.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-11 Thread Peter Geoghegan

On Fri, Mar 11, 2016 at 1:09 PM, Peter Geoghegan  wrote:
> Or, you could add code like this to comparetup_index_btree(), to
> simulate a broken opclass:
>
> diff --git a/src/backend/utils/sort/tuplesort.c
> b/src/backend/utils/sort/tuplesort.c
> index 67d86ed..23712ff 100644
> --- a/src/backend/utils/sort/tuplesort.c
> +++ b/src/backend/utils/sort/tuplesort.c
> @@ -3562,6 +3562,9 @@ comparetup_index_btree(const SortTuple *a, const
> SortTuple *b,
> compare = ApplySortComparator(a->datum1, a->isnull1,
>
> b->datum1, b->isnull1,
>   sortKey);
> +
> +   if (random() <= (MAX_RANDOM_VALUE / 1000))
> +   compare = -compare;
> if (compare != 0)
> return compare;


Note that this patch that I sketched would make CREATE INDEX produce
corrupt indexes, but the tool's verification itself would not be
affected. Although even if it was (even if _bt_compare() gave the same
wrong answers), it would still very likely detect corruption.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-11 Thread Peter Geoghegan

On Fri, Mar 11, 2016 at 8:24 AM, Michael Paquier
 wrote:
> You can use for example dd in non-truncate mode to corrupt on-disk
> page data, say that for example:
> dd if=/dev/random bs=8192 count=1 \
> seek=$BLOCK_ID of=base/$DBOID/$RELFILENODE \
> conv=notrunc

Sure, but that would probably fail at the first hurdle -- the page
header would be corrupt. Which is a valid test, but not all that
interesting.

One testing workflow I tried is overwriting some page in a B-Tree
relfilenode with some other page in the same file:

$:~/pgdata/base/12413$ dd if=somefile of=somefile conv=notrunc bs=8192
count=1 skip=2 seek=3

That should fail due to the key space not being in order across pages,
which is slightly interesting. Or, you could selectively change one
item with a hex editor, as Anastasia did.

Or, you could add code like this to comparetup_index_btree(), to
simulate a broken opclass:

diff --git a/src/backend/utils/sort/tuplesort.c
b/src/backend/utils/sort/tuplesort.c
index 67d86ed..23712ff 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -3562,6 +3562,9 @@ comparetup_index_btree(const SortTuple *a, const
SortTuple *b,
compare = ApplySortComparator(a->datum1, a->isnull1,

b->datum1, b->isnull1,
  sortKey);
+
+   if (random() <= (MAX_RANDOM_VALUE / 1000))
+   compare = -compare;
if (compare != 0)
return compare;

There are many options when you want to produce a corrupt B-Tree index!

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-11 Thread Tomas Vondra

Hi,

On Fri, 2016-03-11 at 17:24 +0100, Michael Paquier wrote:
> On Fri, Mar 11, 2016 at 5:19 PM, Anastasia Lubennikova wrote:
> > 
> > BTW, if you know a good way to corrupt index (and do it
> > reproducible) I'd be
> > very glad to see it.
> You can use for example dd in non-truncate mode to corrupt on-disk
> page data, say that for example:
> dd if=/dev/random bs=8192 count=1 \
> seek=$BLOCK_ID of=base/$DBOID/$RELFILENODE \
> conv=notrunc

I guess /dev/random is not very compatible with the "reproducible"
requirement. I mean, it will reproducibly break the page, but pretty
much completely, which is mostly what checksums are for.

I think the main goal of the amcheck module is to identify issues that
are much more subtle - perhaps a coding error or unhandled cornercase
which results in page with a valid structure and checksum, but broken
page. Or perhaps issues coming from outside of PostgreSQL, like a
subtle change in the locales.

Simulating those issues requires a tool that allows minor tweaks to the
page, dd is a bit too dull for that.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-11 Thread Michael Paquier

On Fri, Mar 11, 2016 at 5:19 PM, Anastasia Lubennikova wrote:
> BTW, if you know a good way to corrupt index (and do it reproducible) I'd be
> very glad to see it.

You can use for example dd in non-truncate mode to corrupt on-disk
page data, say that for example:
dd if=/dev/random bs=8192 count=1 \
seek=$BLOCK_ID of=base/$DBOID/$RELFILENODE \
conv=notrunc
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-11 Thread Anastasia Lubennikova


01.03.2016 03:09, Peter Geoghegan:

I was assigned an "action point" during the FOSDEM developer meeting:
"Post new version of btree consistency checker patch". I attach a new
WIP version of my consistency checker tool, amcheck. This patch is
proposed for 9.6, as an extension in contrib -- maybe we can still get
it in. This is the first time I've added any version of this to a
commitfest, although I've posted a couple of rough versions of this in
the past couple of years. The attached version has received a major
overhaul, and is primarily aimed at finding corruption in production
systems, although I think it will still have significant value for
debugging too. Maybe it can help with some of the B-Tree patches in
the final commitfest, for example. I also have some hope that it will
become a learning tool for people interested in how B-Tree indexes
work.

To recap, the extension adds some SQL-callable functions that verify
certain invariant conditions hold within some particular B-Tree index.
These are the conditions that index scans rely on always being true.
The tool's scope may eventually cover other AMs, including heapam, but
nbtree seems like the best place to start.

Note that no function currently checks that the index is consistent
with the heap, which would be very useful (that's probably how I'd
eventually target the heapam, actually).


Hi,
thank you for the contrib module. The patch itself looks great.
It was really useful to have such a tool while working on b-tree patches.


Invariants


nbtree invariants that the tool verifies with just an AccessShareLock
on the relation are:

* That all items are in the correct, opclass order on each page.

* That the page "high key", if any, actually bounds the items on the page.

* That the last item on a page is less than or equal to the first item
on the next page (the page to its right). The idea here is that the
key space spans multiple pages, not just one page, so it make sense to
check the last item where we can.

With an ExclusiveLock + ShareLock, some addition invariants are verified:

* That child pages actually have their parent's downlink as a lower bound.

* Sane right links and left links at each level.

Obviously, this tool is all about distrusting the structure of a
B-Tree index. That being the case, it's not always totally clear where
to draw the line. I think I have the balance about right, though.

Interface
===

There are only 2 SQL callable functions in the extension, which are
very similar:

bt_index_check(index regclass)

bt_index_parent_check(index regclass)

The latter is more thorough than the former -- it performs all checks,
including those checks that I mentioned required an ExclusiveLock. So,
bt_index_check() only requires an AccessShareLock.
bt_index_parent_check() requires an ExclusiveLock on the index
relation, and a ShareLock on its heap relation, almost like REINDEX.
bt_index_parent_check() performs verification that is a superset of
the verification performed by bt_index_check() -- mostly, the extra
verification/work is that it verifies downlinks against child pages.

Both functions raise an error in the event of observing that an
invariant in a B-Tree was violated, such as items being out of order
on a page. I've written extensive documentation, which goes into
practical aspects of using amcheck effectively. It doesn't go into
significant detail about every invariant that is checked, but gives a
good idea of roughly what checks are performed.

I could almost justify only having one function with an argument about
the downlink/child verification, but that would be a significant
footgun given the varying locking requirements that such a function
would have.


I've done some testing. And I can say that both announced functions work 
well.
BTW, if you know a good way to corrupt index (and do it reproducible) 
I'd be very glad to see it.


I created an index, then opened the file and changed the order of elements.
And bt_index_check() found the error successfully.

/* Ensure that the data is changed. */
SELECT * FROM bt_page_items('idx', 1);

 itemoffset | ctid  | itemlen | nulls | vars | data
+---+-+---+--+-
  1 | (0,3) |  16 | f | f| 03 00 00 00 00 00 00 00
  2 | (0,2) |  16 | f | f| 02 00 00 00 00 00 00 00
  3 | (0,1) |  16 | f | f| 01 00 00 00 00 00 00 00
(3 rows)


/* Great, amcheck found an error. */
select bt_index_check('idx');
ERROR:  page order invariant violated for index "idx"
DETAIL:  Lower index tid=(1,1) (points to heap tid=(0,3)) higher index 
tid=(1,2) (points to heap tid=(0,2)) page lsn=0/17800D0.



Locking
==

We never rely on something like holding on to a buffer pin as an
interlock for correctness (the vacuum interlock thing isn't generally
necessary, because we don't look at the heap at all). We simply pin +
BT_READ lock a buffer, copy it into local memory allocated by
pal

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-10 Thread Tomas Vondra

Hi,

I've looked at this patch today, mostly to educate myself, so this
probably should not count as a full review. Anyway, the patch seems in
excellent shape - it'd be great if all patches (including those written
by me) had this level of comments/docs.

On Mon, 2016-02-29 at 16:09 -0800, Peter Geoghegan wrote:
> I was assigned an "action point" during the FOSDEM developer meeting:
> "Post new version of btree consistency checker patch". I attach a new
> WIP version of my consistency checker tool, amcheck.

...

> To recap, the extension adds some SQL-callable functions that verify
> certain invariant conditions hold within some particular B-Tree index.
> These are the conditions that index scans rely on always being true.
> The tool's scope may eventually cover other AMs, including heapam, but
> nbtree seems like the best place to start.
> 
> Note that no function currently checks that the index is consistent
> with the heap, which would be very useful (that's probably how I'd
> eventually target the heapam, actually).

I'm afraid I don't understand what "target the heapam" means. Could you
elaborate?

> Invariants
> 
...
> Obviously, this tool is all about distrusting the structure of a
> B-Tree index. That being the case, it's not always totally clear where
> to draw the line. I think I have the balance about right, though.

I agree. It'd be nice to have a tool that also checks for data
corruption the a lower level (e.g. that varlena headers are not
corrupted etc.) but that seems like a task for some other tool.

> Interface
> ===
> 
> There are only 2 SQL callable functions in the extension, which are
> very similar:
> 
> bt_index_check(index regclass)
> 
> bt_index_parent_check(index regclass)
> 

Can we come up with names that more clearly identify the difference
between those two functions? I mean, _parent_ does not make it
particularly obvious that the second function acquires exclusive lock
and performs more thorough checks.

This also applies to the name of the contrib module - it's not
particularly obvious what "amcheck" unless you're familiar with the
concept of access methods. Which is quite unlikely for regular users.
Maybe something like "check_index" would be more illustrative?

> Locking
> ==
...
> 
> We do, on the other hand, have a detailed rationale for why it's okay
> that we don't have an ExclusiveLock on the index relation for checks
> that span the key space of more than one page by following right links
> to compare items across sibling pages. This isn't the same thing as
> having an explicit interlock like a pin -- our interlock is one
> against *recycling* by vacuum, which is based on recentGlobalXmin.
> This rationale requires expert review.

Well, I wouldn't count myself as an expert here, but do we actually need
such protection? I mean, we only do such checks when holding an
exclusive lock on the parent relation, no? And even if we don't vacuum
can only remove entries from the pages - why should that cause
violations of any invariants?

A few minor comments:

1) This should probably say "one block per level":

/*
 * A B-Tree cannot possibly have this many levels, since there must be 
 * one block per page, and that is bound by the range of BlockNumber:
 */

2) comment before bt_check_every_level() says:

   ... assumed to prevent any kind of physically modification ...

   probably should be "physical modification" instead.

3) s/targecontext/targetcontext/ in BtreeCheckState comment

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] amcheck (B-Tree integrity checking tool)

2016-02-29 Thread Peter Geoghegan

I was assigned an "action point" during the FOSDEM developer meeting:
"Post new version of btree consistency checker patch". I attach a new
WIP version of my consistency checker tool, amcheck. This patch is
proposed for 9.6, as an extension in contrib -- maybe we can still get
it in. This is the first time I've added any version of this to a
commitfest, although I've posted a couple of rough versions of this in
the past couple of years. The attached version has received a major
overhaul, and is primarily aimed at finding corruption in production
systems, although I think it will still have significant value for
debugging too. Maybe it can help with some of the B-Tree patches in
the final commitfest, for example. I also have some hope that it will
become a learning tool for people interested in how B-Tree indexes
work.

To recap, the extension adds some SQL-callable functions that verify
certain invariant conditions hold within some particular B-Tree index.
These are the conditions that index scans rely on always being true.
The tool's scope may eventually cover other AMs, including heapam, but
nbtree seems like the best place to start.

Note that no function currently checks that the index is consistent
with the heap, which would be very useful (that's probably how I'd
eventually target the heapam, actually).

Invariants


nbtree invariants that the tool verifies with just an AccessShareLock
on the relation are:

* That all items are in the correct, opclass order on each page.

* That the page "high key", if any, actually bounds the items on the page.

* That the last item on a page is less than or equal to the first item
on the next page (the page to its right). The idea here is that the
key space spans multiple pages, not just one page, so it make sense to
check the last item where we can.

With an ExclusiveLock + ShareLock, some addition invariants are verified:

* That child pages actually have their parent's downlink as a lower bound.

* Sane right links and left links at each level.

Obviously, this tool is all about distrusting the structure of a
B-Tree index. That being the case, it's not always totally clear where
to draw the line. I think I have the balance about right, though.

Interface
===

There are only 2 SQL callable functions in the extension, which are
very similar:

bt_index_check(index regclass)

bt_index_parent_check(index regclass)

The latter is more thorough than the former -- it performs all checks,
including those checks that I mentioned required an ExclusiveLock. So,
bt_index_check() only requires an AccessShareLock.
bt_index_parent_check() requires an ExclusiveLock on the index
relation, and a ShareLock on its heap relation, almost like REINDEX.
bt_index_parent_check() performs verification that is a superset of
the verification performed by bt_index_check() -- mostly, the extra
verification/work is that it verifies downlinks against child pages.

Both functions raise an error in the event of observing that an
invariant in a B-Tree was violated, such as items being out of order
on a page. I've written extensive documentation, which goes into
practical aspects of using amcheck effectively. It doesn't go into
significant detail about every invariant that is checked, but gives a
good idea of roughly what checks are performed.

I could almost justify only having one function with an argument about
the downlink/child verification, but that would be a significant
footgun given the varying locking requirements that such a function
would have.

Locking
==

We never rely on something like holding on to a buffer pin as an
interlock for correctness (the vacuum interlock thing isn't generally
necessary, because we don't look at the heap at all). We simply pin +
BT_READ lock a buffer, copy it into local memory allocated by
palloc(), and then immediately release the buffer lock and drop the
pin. This is the same in all instances. There is never more than one
buffer lock or pin held at a time.

We do, on the other hand, have a detailed rationale for why it's okay
that we don't have an ExclusiveLock on the index relation for checks
that span the key space of more than one page by following right links
to compare items across sibling pages. This isn't the same thing as
having an explicit interlock like a pin -- our interlock is one
against *recycling* by vacuum, which is based on recentGlobalXmin.
This rationale requires expert review.

Performance
==

Trying to keep the tool as simple as possible, while still making it
do verification that is as useful as possible was my priority here,
not performance. Still, verification completes fairly quickly.
Certainly, it takes far less time than having to REINDEX the index,
and doesn't need too much memory. I think that in practice most
problems that can be detected by the B-Tree checker functions will be
detected with the lighter variant.

-- 
Peter Geoghegan
From 573810d8d3c994ce1a16ecffb2f5d208c0ff93e3 Mon Sep 17 00:00:00 2001

89 matches

Mail list logo