Re: [HACKERS] A design for amcheck heapam verification
On Thu, Oct 5, 2017 at 7:00 PM, Peter Geoghegan wrote: > v3 of the patch series, attached, does it that way -- it adds a > bloom_create(). The new bloom_create() function still allocates its > own memory, but does so while using a FLEXIBLE_ARRAY_MEMBER. A > separate bloom_init() function (that works with dynamic shared memory) > could easily be added later, for the benefit of parallel hash join. Since Peter E's work on making the documentation sgml files more XML-like has broken the v3 patch doc build, attached is v4, which fixes this bit rot. It also has a few small tweaks here and there to the docs. Nothing worth noting specifically, really -- I just don't like to leave my patches with bit rot for long. (Hat-tip to Thomas Munro for making this easy to detect with his new CF continuous integration tooling.) I should point out that I shipped virtually the same code yesterday, as v1.1 of the Github version of amcheck (also known as amcheck_next). Early adopters will be able to use this new "heapallindexed" functionality in the next few days, once packages become available for the apt and yum community repos. Just as before, the Github version will work on versions of Postgres >= 9.4. This seems like good timing on my part, because we know that this new "heapallindexed" verification will detect the "freeze the dead" bugs that the next point release is set to have fixes for -- that is actually kind of how one of the bugs was found [1]. We may even want to advertise the available of this check within amcheck_next, in the release notes for the next Postgres point release. [1] https://www.postgresql.org/message-id/cah2-wznm4rcrhfaiwkpwtpew2bxdtgrozk7jwwgucxeh3d1...@mail.gmail.com -- Peter Geoghegan From 7906c7391a9f52d334c2cbc7d3e245ff014629f2 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Tue, 2 May 2017 00:19:24 -0700 Subject: [PATCH 2/2] Add amcheck verification of indexes against heap. Add a new, optional capability to bt_index_check() and bt_index_parent_check(): callers can check that each heap tuple that ought to have an index entry does in fact have one. This happens at the end of the existing verification checks. This is implemented by using a Bloom filter data structure. The implementation performs set membership tests within a callback (the same type of callback that each index AM registers for CREATE INDEX). The Bloom filter is populated during the initial index verification scan. --- contrib/amcheck/Makefile | 2 +- contrib/amcheck/amcheck--1.0--1.1.sql| 28 +++ contrib/amcheck/amcheck.control | 2 +- contrib/amcheck/expected/check_btree.out | 14 +- contrib/amcheck/sql/check_btree.sql | 9 +- contrib/amcheck/verify_nbtree.c | 298 --- doc/src/sgml/amcheck.sgml| 173 ++ src/include/utils/snapmgr.h | 2 +- 8 files changed, 454 insertions(+), 74 deletions(-) create mode 100644 contrib/amcheck/amcheck--1.0--1.1.sql diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile index 43bed91..c5764b5 100644 --- a/contrib/amcheck/Makefile +++ b/contrib/amcheck/Makefile @@ -4,7 +4,7 @@ MODULE_big = amcheck OBJS = verify_nbtree.o $(WIN32RES) EXTENSION = amcheck -DATA = amcheck--1.0.sql +DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql PGFILEDESC = "amcheck - function for verifying relation integrity" REGRESS = check check_btree diff --git a/contrib/amcheck/amcheck--1.0--1.1.sql b/contrib/amcheck/amcheck--1.0--1.1.sql new file mode 100644 index 000..e6cca0a --- /dev/null +++ b/contrib/amcheck/amcheck--1.0--1.1.sql @@ -0,0 +1,28 @@ +/* contrib/amcheck/amcheck--1.0--1.1.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.1'" to load this file. \quit + +-- +-- bt_index_check() +-- +DROP FUNCTION bt_index_check(regclass); +CREATE FUNCTION bt_index_check(index regclass, +heapallindexed boolean DEFAULT false) +RETURNS VOID +AS 'MODULE_PATHNAME', 'bt_index_check' +LANGUAGE C STRICT PARALLEL RESTRICTED; + +-- +-- bt_index_parent_check() +-- +DROP FUNCTION bt_index_parent_check(regclass); +CREATE FUNCTION bt_index_parent_check(index regclass, +heapallindexed boolean DEFAULT false) +RETURNS VOID +AS 'MODULE_PATHNAME', 'bt_index_parent_check' +LANGUAGE C STRICT PARALLEL RESTRICTED; + +-- Don't want these to be available to public +REVOKE ALL ON FUNCTION bt_index_check(regclass, boolean) FROM PUBLIC; +REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean) FROM PUBLIC; diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control index 05e2861..4690484 100644 --- a/contrib/amcheck/amcheck.control +++ b/contrib/amcheck/amcheck.control @@ -1,5 +1,5 @@ # amcheck extension comment = 'functions for verifying relation integrity' -default_version = '1.0' +default_version = '1.1' module_pathname = '$libdir/amcheck' relocatable = true diff --git a/contr
Re: [HACKERS] A design for amcheck heapam verification
On Fri, Sep 29, 2017 at 10:54 AM, Peter Geoghegan wrote: >> Something that allocates new memory as the patch's bloom_init() >> function does I'd tend to call 'make' or 'create' or 'new' or >> something, rather than 'init'. > > I tend to agree. I'll adopt that style in the next version. I just > didn't want the caller to have to manage the memory themselves. v3 of the patch series, attached, does it that way -- it adds a bloom_create(). The new bloom_create() function still allocates its own memory, but does so while using a FLEXIBLE_ARRAY_MEMBER. A separate bloom_init() function (that works with dynamic shared memory) could easily be added later, for the benefit of parallel hash join. Other notable changes: * We now support bloom filters that have bitsets of up to 512MB in size. The previous limit was 128MB. * We now use TransactionXmin for our AccessShareLock xmin cutoff, rather than calling GetOldestXmin(). This is the same cut-off used by xacts that must avoid broken hot chains for their earliest snapshot. This avoids a scan of the proc array, and allows more thorough verification, since GetOldestXmin() was overly restrictive here. * Expanded code comments describing the kinds of problems the new verification capability is expected to be good at catching. For example, there is now a passing reference to the CREATE INDEX CONCURRENTLY bug that was fixed back in February (it's given as an example of a more general problem -- faulty HOT safety assessment). With the new heapallindexed enhancement added by this patch, amcheck can be expected to catch that issue much of the time. We also go into heap-only tuple handling within IndexBuildHeapScan(). The way that CREATE INDEX tries to index the most recent tuple in a HOT chain (while locating the root tuple in the chain, to get the right heap TID for the index) has proven to be very useful as a smoke test while investigating HOT/VACUUM FREEZE bugs in the past couple of weeks [1]. I believe it would have caught several historic MultiXact/recovery bugs, too. This all seems worth noting explicitly. [1] https://postgr.es/m/cah2-wznm4rcrhfaiwkpwtpew2bxdtgrozk7jwwgucxeh3d1...@mail.gmail.com -- Peter Geoghegan From 3bed03a9e0506c0b81097b634c5f1b5534a2dcb3 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Tue, 2 May 2017 00:19:24 -0700 Subject: [PATCH 2/2] Add amcheck verification of indexes against heap. Add a new, optional capability to bt_index_check() and bt_index_parent_check(): callers can check that each heap tuple that ought to have an index entry does in fact have one. This happens at the end of the existing verification checks. This is implemented by using a Bloom filter data structure. The implementation performs set membership tests within a callback (the same type of callback that each index AM registers for CREATE INDEX). The Bloom filter is populated during the initial index verification scan. --- contrib/amcheck/Makefile | 2 +- contrib/amcheck/amcheck--1.0--1.1.sql| 28 contrib/amcheck/amcheck.control | 2 +- contrib/amcheck/expected/check_btree.out | 14 +- contrib/amcheck/sql/check_btree.sql | 9 +- contrib/amcheck/verify_nbtree.c | 275 --- doc/src/sgml/amcheck.sgml| 157 ++ src/include/utils/snapmgr.h | 2 +- 8 files changed, 423 insertions(+), 66 deletions(-) create mode 100644 contrib/amcheck/amcheck--1.0--1.1.sql diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile index 43bed91..c5764b5 100644 --- a/contrib/amcheck/Makefile +++ b/contrib/amcheck/Makefile @@ -4,7 +4,7 @@ MODULE_big = amcheck OBJS = verify_nbtree.o $(WIN32RES) EXTENSION = amcheck -DATA = amcheck--1.0.sql +DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql PGFILEDESC = "amcheck - function for verifying relation integrity" REGRESS = check check_btree diff --git a/contrib/amcheck/amcheck--1.0--1.1.sql b/contrib/amcheck/amcheck--1.0--1.1.sql new file mode 100644 index 000..e6cca0a --- /dev/null +++ b/contrib/amcheck/amcheck--1.0--1.1.sql @@ -0,0 +1,28 @@ +/* contrib/amcheck/amcheck--1.0--1.1.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.1'" to load this file. \quit + +-- +-- bt_index_check() +-- +DROP FUNCTION bt_index_check(regclass); +CREATE FUNCTION bt_index_check(index regclass, +heapallindexed boolean DEFAULT false) +RETURNS VOID +AS 'MODULE_PATHNAME', 'bt_index_check' +LANGUAGE C STRICT PARALLEL RESTRICTED; + +-- +-- bt_index_parent_check() +-- +DROP FUNCTION bt_index_parent_check(regclass); +CREATE FUNCTION bt_index_parent_check(index regclass, +heapallindexed boolean DEFAULT false) +RETURNS VOID +AS 'MODULE_PATHNAME', 'bt_index_parent_check' +LANGUAGE C STRICT PARALLEL RESTRICTED; + +-- Don't want these to be available to public +REVOKE ALL ON FUNCTION bt_index_check(regclass, boolean) FROM PUBLIC; +REVOKE ALL O
Re: [HACKERS] A design for amcheck heapam verification
On Fri, Sep 29, 2017 at 1:57 PM, Peter Geoghegan wrote: > On Fri, Sep 29, 2017 at 10:29 AM, Robert Haas wrote: >> I am also wondering whether this patch should consider >> 81c5e46c490e2426db243eada186995da5bb0ba7 as a way of obtaining >> multiple hash values. I suppose that's probably inferior to what is >> already being done on performance grounds, but I'll just throw out a >> mention of it here all the same in case it was overlooked or the >> relevance not spotted... > > Well, we sometimes only want one hash value. This happens when we're > very short on memory (especially relative to the estimated final size > of the set), so it's a fairly common requirement. And, we have a > convenient way to get a second independent uint32 hash function now > (murmurhash32()). Right, so if you wanted to use the extended hash function infrastructure, you'd just call the extended hash function with as many different seeds as the number of hash functions you need. If you need 1, you call it with one seed, say 0. And if you need any larger number, well, cool. Like I say, I'm not at all sure this is better than what you've got right now. But it's an option. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Fri, Sep 29, 2017 at 10:29 AM, Robert Haas wrote: > I am also wondering whether this patch should consider > 81c5e46c490e2426db243eada186995da5bb0ba7 as a way of obtaining > multiple hash values. I suppose that's probably inferior to what is > already being done on performance grounds, but I'll just throw out a > mention of it here all the same in case it was overlooked or the > relevance not spotted... Well, we sometimes only want one hash value. This happens when we're very short on memory (especially relative to the estimated final size of the set), so it's a fairly common requirement. And, we have a convenient way to get a second independent uint32 hash function now (murmurhash32()). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Thu, Sep 28, 2017 at 8:34 PM, Thomas Munro wrote: > FWIW I think if I were attacking that problem the first thing I'd > probably try would be getting rid of that internal pointer > filter->bitset in favour of a FLEXIBLE_ARRAY_MEMBER and then making > the interface look something like this: > > extern size_t bloom_estimate(int64 total elems, int work_mem); > extern void bloom_init(bloom_filter *filter, int64 total_elems, int work_mem); > > Something that allocates new memory as the patch's bloom_init() > function does I'd tend to call 'make' or 'create' or 'new' or > something, rather than 'init'. I tend to agree. I'll adopt that style in the next version. I just didn't want the caller to have to manage the memory themselves. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Thu, Sep 28, 2017 at 11:34 PM, Thomas Munro wrote: > FWIW I think if I were attacking that problem the first thing I'd > probably try would be getting rid of that internal pointer > filter->bitset in favour of a FLEXIBLE_ARRAY_MEMBER and then making > the interface look something like this: > > extern size_t bloom_estimate(int64 total elems, int work_mem); > extern void bloom_init(bloom_filter *filter, int64 total_elems, int work_mem); Yes, that seems quite convenient and is by now an established coding pattern. I am also wondering whether this patch should consider 81c5e46c490e2426db243eada186995da5bb0ba7 as a way of obtaining multiple hash values. I suppose that's probably inferior to what is already being done on performance grounds, but I'll just throw out a mention of it here all the same in case it was overlooked or the relevance not spotted... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Fri, Sep 29, 2017 at 4:17 PM, Michael Paquier wrote: >> As for DSM, I think that that can come later, and can be written by >> somebody closer to that problem. There can be more than one >> initialization function. > > I don't completely disagree with that, there could be multiple > initialization functions. Still, an advantage about designing things > right from the beginning with a set of correct APIs is that we don't > need extra things later and this will never bother module maintainers. > I would think that this utility interface should be minimal and > portable to maintain a long-term stance. FWIW I think if I were attacking that problem the first thing I'd probably try would be getting rid of that internal pointer filter->bitset in favour of a FLEXIBLE_ARRAY_MEMBER and then making the interface look something like this: extern size_t bloom_estimate(int64 total elems, int work_mem); extern void bloom_init(bloom_filter *filter, int64 total_elems, int work_mem); Something that allocates new memory as the patch's bloom_init() function does I'd tend to call 'make' or 'create' or 'new' or something, rather than 'init'. 'init' has connotations of being the second phase in an allocate-and-init pattern for me. Then bloom_filt_make() would be trivially implemented on top of bloom_estimate() and bloom_init(), and bloom_init() could be used directly in DSM, DSA, traditional shmem without having to add any special DSM support. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Thu, Sep 28, 2017 at 3:32 AM, Peter Geoghegan wrote: > On Wed, Sep 27, 2017 at 1:45 AM, Michael Paquier > wrote: >> I have signed up as a reviewer of this patch, and I have looked at the >> bloom filter implementation for now. This is the kind of facility that >> people have asked for on this list for many years. >> >> One first thing striking me is that there is no test for this >> implementation, which would be a base stone for other things, it would >> be nice to validate that things are working properly before moving on >> with 0002, and 0001 is a feature on its own. I don't think that it >> would be complicated to have a small module in src/test/modules which >> plugs in a couple of SQL functions on top of bloomfilter.h. > > 0001 is just a library utility. None of the backend lib utilities > (like HyperLogLog, Discrete knapsack, etc) have dedicated test > frameworks. Coding up an SQL interface for these things is a > non-trivial project in itself. > > I have testing the implementation myself, but that was something that > used C code. > > Can you tell me what you have in mind here in detail? For example, > should there be a custom datatype that encapsulates the current state > of the bloom filter? Or, should there be an aggregate function, that > takes a containment argument that is tested at the end? That could be something very simple: - A function to initiate a bloom filter in a session context, with a number of elements in input, which uses for example integers. - A function to add an integer element to it. - A function to query if an integer may exist or not. - A function to free it. The point is just to stress this code, I don't think that it is a project this "large". >> +#define MAX_HASH_FUNCS 10 >> Being able to define the number of hash functions used at >> initialization would be nicer. Usually this number is kept way lower >> than the number of elements to check as part of a set, but I see no >> reason to not allow people to play with this API in a more extended >> way. You can then let your module decide what it wants to use. > > The number of hash functions used is itself a function of the amount > of memory available, and an estimate of the overall size of the set > made ahead of time (in the case of amcheck, this is > pg_class.reltuples). The typical interface is that the caller either > specifies the amount of memory, or the required false positive rate > (either way, they must also provide that estimate). + filter->k_hash_funcs = optimal_k(bitset_bits, total_elems); The estimate is from wikipedia-sensei: https://en.wikipedia.org/wiki/Bloom_filter#Optimal_number_of_hash_functions Being able to enforce that would be nice, not mandatory perhaps. > The value of MAX_HASH_FUNCS, 10, was chosen based on the fact that we > never actually use more than that many hash functions in practice, > given the limitations on the total amount of memory you can use > (128MB). The formula for determining the optimum number of hash > functions is pretty standard stuff. Hm... OK. That could be a default... Not really convinced though. >> +/* >> + * Hash function is taken from sdbm, a public-domain reimplementation of the >> + * ndbm database library. >> + */ >> Reference link? > > http://www.eecs.harvard.edu/margo/papers/usenix91/paper.pdf That would be nicer if added in the code :) > I'm probably going to end up using murmurhash32() instead of the sdbm > hash function anyway, now that Andres has exposed it in a header file. > This probably won't matter in the next version. Yeah, that may be a good idea at the end. >> + bitset_bytes = bitset_bits / BITS_PER_BYTE; >> + filter->bitset = palloc0(bitset_bytes); >> + filter->k_hash_funcs = optimal_k(bitset_bits, total_elems); >> + filter->seed = seed; >> I think that doing the allocation within the initialization phase is a >> mistake. Being able to use a DSA would be nice, but I predict as well >> cases where a module may want to have a bloom filter that persists as >> well across multiple transactions, so the allocation should be able to >> live across memory contexts. > > Why not just switch memory context before calling? Again, other > comparable utilities don't have provide this in their interface. > > As for DSM, I think that that can come later, and can be written by > somebody closer to that problem. There can be more than one > initialization function. I don't completely disagree with that, there could be multiple initialization functions. Still, an advantage about designing things right from the beginning with a set of correct APIs is that we don't need extra things later and this will never bother module maintainers. I would think that this utility interface should be minimal and portable to maintain a long-term stance. >> What I think you should do instead to >> make this bloom implementation more modular is to let the caller give >> a pointer to a memory area as well as its size. Then what bloom_init >> should do is to ju
Re: [HACKERS] A design for amcheck heapam verification
On Wed, Sep 27, 2017 at 1:45 AM, Michael Paquier wrote: > I have signed up as a reviewer of this patch, and I have looked at the > bloom filter implementation for now. This is the kind of facility that > people have asked for on this list for many years. > > One first thing striking me is that there is no test for this > implementation, which would be a base stone for other things, it would > be nice to validate that things are working properly before moving on > with 0002, and 0001 is a feature on its own. I don't think that it > would be complicated to have a small module in src/test/modules which > plugs in a couple of SQL functions on top of bloomfilter.h. 0001 is just a library utility. None of the backend lib utilities (like HyperLogLog, Discrete knapsack, etc) have dedicated test frameworks. Coding up an SQL interface for these things is a non-trivial project in itself. I have testing the implementation myself, but that was something that used C code. Can you tell me what you have in mind here in detail? For example, should there be a custom datatype that encapsulates the current state of the bloom filter? Or, should there be an aggregate function, that takes a containment argument that is tested at the end? > +#define MAX_HASH_FUNCS 10 > Being able to define the number of hash functions used at > initialization would be nicer. Usually this number is kept way lower > than the number of elements to check as part of a set, but I see no > reason to not allow people to play with this API in a more extended > way. You can then let your module decide what it wants to use. The number of hash functions used is itself a function of the amount of memory available, and an estimate of the overall size of the set made ahead of time (in the case of amcheck, this is pg_class.reltuples). The typical interface is that the caller either specifies the amount of memory, or the required false positive rate (either way, they must also provide that estimate). The value of MAX_HASH_FUNCS, 10, was chosen based on the fact that we never actually use more than that many hash functions in practice, given the limitations on the total amount of memory you can use (128MB). The formula for determining the optimum number of hash functions is pretty standard stuff. > + * work_mem is sized in KB, in line with the general convention. > In what is that a general convention? Using bytes would be more > intuitive IMO.. Still I think that this could be removed, see below > points. Both tuplesort and tuplestore do this. These are infrastructure that is passed work_mem or maintenance_work_mem by convention, where those are sized in KB. > +/* > + * Hash function is taken from sdbm, a public-domain reimplementation of the > + * ndbm database library. > + */ > Reference link? http://www.eecs.harvard.edu/margo/papers/usenix91/paper.pdf I'm probably going to end up using murmurhash32() instead of the sdbm hash function anyway, now that Andres has exposed it in a header file. This probably won't matter in the next version. > + bitset_bytes = bitset_bits / BITS_PER_BYTE; > + filter->bitset = palloc0(bitset_bytes); > + filter->k_hash_funcs = optimal_k(bitset_bits, total_elems); > + filter->seed = seed; > I think that doing the allocation within the initialization phase is a > mistake. Being able to use a DSA would be nice, but I predict as well > cases where a module may want to have a bloom filter that persists as > well across multiple transactions, so the allocation should be able to > live across memory contexts. Why not just switch memory context before calling? Again, other comparable utilities don't have provide this in their interface. As for DSM, I think that that can come later, and can be written by somebody closer to that problem. There can be more than one initialization function. > What I think you should do instead to > make this bloom implementation more modular is to let the caller give > a pointer to a memory area as well as its size. Then what bloom_init > should do is to just initialize this area of memory with zeros. This > approach would give a lot of freedom. Not linking a bloom definition > to work_mem would be nice as well. The implementation is probably always going to be bound in size by work_mem in practice, like tuplesort and tuplestore. I would say that that's a natural fit. > + hashb = (filter->k_hash_funcs > 1 ? sdbmhash(elem, len) : 0); > I am wondering if it would make sense to be able to enforce the hash > function being used. The default does not look bad to me though, so we > could live with that. I prefer to keep things simple. I'm not aware of any use case that calls for the user to use a custom hash function. That said, I could believe that someone would want to use their own hash value for each bloom_add_element(), when they have one close at hand anyway -- much like addHyperLogLog(). Again, that seems like work for the ultimate consumer of that functionality. It's a trivial twea
Re: [HACKERS] A design for amcheck heapam verification
On Thu, Sep 7, 2017 at 11:26 AM, Peter Geoghegan wrote: > On Wed, Aug 30, 2017 at 9:29 AM, Peter Geoghegan wrote: >> On Wed, Aug 30, 2017 at 5:02 AM, Alvaro Herrera >> wrote: >>> Eh, if you want to optimize it for the case where debug output is not >>> enabled, make sure to use ereport() not elog(). ereport() >>> short-circuits evaluation of arguments, whereas elog() does not. >> >> I should do that, but it's still not really noticeable. > > Since this patch has now bit-rotted, I attach a new revision, V2. > Apart from fixing some Makefile bitrot, this revision also makes some > small tweaks as suggested by Thomas and Alvaro. The documentation is > also revised and expanded, and now discusses practical aspects of the > set membership being tested using a Bloom filter, how that relates to > maintenance_work_mem, and so on. > > Note that this revision does not let the Bloom filter caller use their > own dynamic shared memory, which is something that Thomas asked about. > While that could easily be added, I think it should happen later. I > really just wanted to make sure that my Bloom filter was not in some > way fundamentally incompatible with Thomas' planned enhancements to > (parallel) hash join. I have signed up as a reviewer of this patch, and I have looked at the bloom filter implementation for now. This is the kind of facility that people have asked for on this list for many years. One first thing striking me is that there is no test for this implementation, which would be a base stone for other things, it would be nice to validate that things are working properly before moving on with 0002, and 0001 is a feature on its own. I don't think that it would be complicated to have a small module in src/test/modules which plugs in a couple of SQL functions on top of bloomfilter.h. +#define MAX_HASH_FUNCS 10 Being able to define the number of hash functions used at initialization would be nicer. Usually this number is kept way lower than the number of elements to check as part of a set, but I see no reason to not allow people to play with this API in a more extended way. You can then let your module decide what it wants to use. + * work_mem is sized in KB, in line with the general convention. In what is that a general convention? Using bytes would be more intuitive IMO.. Still I think that this could be removed, see below points. +/* + * Hash function is taken from sdbm, a public-domain reimplementation of the + * ndbm database library. + */ Reference link? + bitset_bytes = bitset_bits / BITS_PER_BYTE; + filter->bitset = palloc0(bitset_bytes); + filter->k_hash_funcs = optimal_k(bitset_bits, total_elems); + filter->seed = seed; I think that doing the allocation within the initialization phase is a mistake. Being able to use a DSA would be nice, but I predict as well cases where a module may want to have a bloom filter that persists as well across multiple transactions, so the allocation should be able to live across memory contexts. What I think you should do instead to make this bloom implementation more modular is to let the caller give a pointer to a memory area as well as its size. Then what bloom_init should do is to just initialize this area of memory with zeros. This approach would give a lot of freedom. Not linking a bloom definition to work_mem would be nice as well. + hashb = (filter->k_hash_funcs > 1 ? sdbmhash(elem, len) : 0); I am wondering if it would make sense to be able to enforce the hash function being used. The default does not look bad to me though, so we could live with that. +typedef struct bloom_filter bloom_filter; Not allowing callers have a look at the structure contents is definitely the right approach. So, in my opinion, this bloom facility still needs more work. -- Michael -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Wed, Sep 6, 2017 at 7:26 PM, Peter Geoghegan wrote: > On Wed, Aug 30, 2017 at 9:29 AM, Peter Geoghegan wrote: >> On Wed, Aug 30, 2017 at 5:02 AM, Alvaro Herrera >> wrote: >>> Eh, if you want to optimize it for the case where debug output is not >>> enabled, make sure to use ereport() not elog(). ereport() >>> short-circuits evaluation of arguments, whereas elog() does not. >> >> I should do that, but it's still not really noticeable. > > Since this patch has now bit-rotted, I attach a new revision, V2. I should point out that I am relying on deterministic TOAST compression within index_form_tuple() at present. This could, in theory, become a problem later down the road, when toast_compress_datum() compression becomes configurable via a storage parameter or something (e.g., we use PGLZ_strategy_always, rather than the hard coded PGLZ_strategy_default strategy). While I should definitely have a comment above the new amcheck index_form_tuple() call that points this out, it's not clear if that's all that is required. Normalizing the representation of hashed index tuples to make verification robust against unforeseen variation in TOAST compression strategy seems like needless complexity to me, but I'd like to hear a second opinion on that. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Wed, Aug 30, 2017 at 9:29 AM, Peter Geoghegan wrote:
> On Wed, Aug 30, 2017 at 5:02 AM, Alvaro Herrera
> wrote:
>> Eh, if you want to optimize it for the case where debug output is not
>> enabled, make sure to use ereport() not elog(). ereport()
>> short-circuits evaluation of arguments, whereas elog() does not.
>
> I should do that, but it's still not really noticeable.
Since this patch has now bit-rotted, I attach a new revision, V2.
Apart from fixing some Makefile bitrot, this revision also makes some
small tweaks as suggested by Thomas and Alvaro. The documentation is
also revised and expanded, and now discusses practical aspects of the
set membership being tested using a Bloom filter, how that relates to
maintenance_work_mem, and so on.
Note that this revision does not let the Bloom filter caller use their
own dynamic shared memory, which is something that Thomas asked about.
While that could easily be added, I think it should happen later. I
really just wanted to make sure that my Bloom filter was not in some
way fundamentally incompatible with Thomas' planned enhancements to
(parallel) hash join.
--
Peter Geoghegan
From d4dda95dd41204315dc12936fac83d2df8676992 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan
Date: Thu, 24 Aug 2017 20:58:21 -0700
Subject: [PATCH 1/2] Add Bloom filter data structure implementation.
A Bloom filter is a space-efficient, probabilistic data structure that
can be used to test set membership. Callers will sometimes incur false
positives, but never false negatives. The rate of false positives is a
function of the total number of elements and the amount of memory
available for the Bloom filter.
Two classic applications of Bloom filters are cache filtering, and data
synchronization testing. Any user of Bloom filters must accept the
possibility of false positives as a cost worth paying for the benefit in
space efficiency.
---
src/backend/lib/Makefile | 4 +-
src/backend/lib/README| 2 +
src/backend/lib/bloomfilter.c | 297 ++
src/include/lib/bloomfilter.h | 26
4 files changed, 327 insertions(+), 2 deletions(-)
create mode 100644 src/backend/lib/bloomfilter.c
create mode 100644 src/include/lib/bloomfilter.h
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index d1fefe4..191ea9b 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/lib
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-OBJS = binaryheap.o bipartite_match.o dshash.o hyperloglog.o ilist.o \
- knapsack.o pairingheap.o rbtree.o stringinfo.o
+OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
+ ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/README b/src/backend/lib/README
index 5e5ba5e..376ae27 100644
--- a/src/backend/lib/README
+++ b/src/backend/lib/README
@@ -3,6 +3,8 @@ in the backend:
binaryheap.c - a binary heap
+bloomfilter.c - probabilistic, space-efficient set membership testing
+
hyperloglog.c - a streaming cardinality estimator
pairingheap.c - a pairing heap
diff --git a/src/backend/lib/bloomfilter.c b/src/backend/lib/bloomfilter.c
new file mode 100644
index 000..e93f9b0
--- /dev/null
+++ b/src/backend/lib/bloomfilter.c
@@ -0,0 +1,297 @@
+/*-
+ *
+ * bloomfilter.c
+ * Minimal Bloom filter
+ *
+ * A Bloom filter is a probabilistic data structure that is used to test an
+ * element's membership of a set. False positives are possible, but false
+ * negatives are not; a test of membership of the set returns either "possibly
+ * in set" or "definitely not in set". This can be very space efficient when
+ * individual elements are larger than a few bytes, because elements are hashed
+ * in order to set bits in the Bloom filter bitset.
+ *
+ * Elements can be added to the set, but not removed. The more elements that
+ * are added, the larger the probability of false positives. Caller must hint
+ * an estimated total size of the set when its Bloom filter is initialized.
+ * This is used to balance the use of memory against the final false positive
+ * rate.
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/lib/bloomfilter.c
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include
+
+#include "access/hash.h"
+#include "lib/bloomfilter.h"
+#include "utils/memutils.h"
+
+#define MAX_HASH_FUNCS 10
+#define NBITS(filt) ((1 << (filt)->bloom_power))
+
+typedef struct bloom_filter
+{
+ /* 2 ^ bloom_power is the size of the bitset (in bits) */
+ intbloom_power;
+ unsigned char *bitset;
+
+ /* K hash functions are used, which are randomly seeded */
+ intk_hash_funcs;
+ uint32 seed;
+} bloom_filter;
+
+static int pow2_tru
Re: [HACKERS] A design for amcheck heapam verification
On Wed, Aug 30, 2017 at 5:02 AM, Alvaro Herrera wrote: > Eh, if you want to optimize it for the case where debug output is not > enabled, make sure to use ereport() not elog(). ereport() > short-circuits evaluation of arguments, whereas elog() does not. I should do that, but it's still not really noticeable. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
Peter Geoghegan wrote: > > Your patch brings us one step closer to that goal. (The book says > > that this approach is good far sparse bitsets, but your comment says > > that we expect something near 50%. That's irrelevant anyway since a > > future centralised popcount() implementation would do this in > > word-sized chunks with a hardware instruction or branch-free-per-word > > lookups in a table and not care at all about sparseness.) > > I own a copy of Hacker's Delight (well, uh, Daniel Farina lent me his > copy about 2 years ago!). pop()/popcount() does seem like a clever > algorithm, that we should probably think about adopting in some cases, > but I should point at that the current caller to my > bloom_prop_bits_set() function is an elog() DEBUG1 call. This is not > at all performance critical. Eh, if you want to optimize it for the case where debug output is not enabled, make sure to use ereport() not elog(). ereport() short-circuits evaluation of arguments, whereas elog() does not. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Tue, Aug 29, 2017 at 7:22 PM, Thomas Munro
wrote:
> Indeed. Thank you for working on this! To summarise a couple of
> ideas that Peter and I discussed off-list a while back: (1) While
> building the hash table for a hash join we could build a Bloom filter
> per future batch and keep it in memory, and then while reading from
> the outer plan we could skip writing tuples out to future batches if
> there is no chance they'll find a match when read back in later (works
> only for inner joins and only pays off in inverse proportion to the
> join's selectivity);
Right. Hash joins do tend to be very selective, though, so I think
that this could help rather a lot. With just 8 or 10 bits per element,
you can eliminate almost all the batch write-outs on the outer side.
No per-worker synchronization for BufFiles is needed when this
happens, either. It seems like that could be very important.
> To use this for anything involving parallelism where a Bloom filter
> must be shared we'd probably finish up having to create a shared
> version of bloom_init() that either uses caller-provided memory and
> avoids the internal pointer, or allocates DSA memory. I suppose you
> could consider splitting your bloom_init() function up into
> bloom_estimate() and bloom_init(user_supplied_space, ...) now, and
> getting rid of the separate pointer to bitset (ie stick char
> bitset[FLEXIBLE_ARRAY_MEMBER] at the end of the struct)?
Makes sense. Not hard to add that.
> I was just observing that there is an opportunity for code reuse here.
> This function's definition would ideally be something like:
>
> double
> bloom_prop_bits_set(bloom_filter *filter)
> {
> return popcount(filter->bitset, NBYTES(filter)) / (double) NBITS(filter);
> }
>
> This isn't an objection to the way you have it, since we don't have a
> popcount function yet! We have several routines in the tree for
> counting bits, though not yet the complete set from Hacker's Delight.
Right. I'm also reminded of the lookup tables for the visibility/freeze map.
> Your patch brings us one step closer to that goal. (The book says
> that this approach is good far sparse bitsets, but your comment says
> that we expect something near 50%. That's irrelevant anyway since a
> future centralised popcount() implementation would do this in
> word-sized chunks with a hardware instruction or branch-free-per-word
> lookups in a table and not care at all about sparseness.)
I own a copy of Hacker's Delight (well, uh, Daniel Farina lent me his
copy about 2 years ago!). pop()/popcount() does seem like a clever
algorithm, that we should probably think about adopting in some cases,
but I should point at that the current caller to my
bloom_prop_bits_set() function is an elog() DEBUG1 call. This is not
at all performance critical.
> + * Test if bloom filter definitely lacks element.
>
> I think where "Bloom filter" appears in prose it should have a capital
> letter (person's name).
Agreed.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Wed, Aug 30, 2017 at 1:00 PM, Peter Geoghegan wrote:
> On Tue, Aug 29, 2017 at 4:34 PM, Thomas Munro
> wrote:
>> Some drive-by comments on the lib patch:
>
> I was hoping that you'd look at this, since you'll probably want to
> use a bloom filter for parallel hash join at some point. I've tried to
> keep this one as simple as possible. I think that there is a good
> chance that it will be usable for parallel hash join with multiple
> batches. You'll need to update the interface a little bit to make that
> work (e.g., bring-your-own-hash interface), but hopefully the changes
> you'll need will be limited to a few small details.
Indeed. Thank you for working on this! To summarise a couple of
ideas that Peter and I discussed off-list a while back: (1) While
building the hash table for a hash join we could build a Bloom filter
per future batch and keep it in memory, and then while reading from
the outer plan we could skip writing tuples out to future batches if
there is no chance they'll find a match when read back in later (works
only for inner joins and only pays off in inverse proportion to the
join's selectivity); (2) We could push a Bloom filter down to scans
(many other databases do this, and at least one person has tried this
with PostgreSQL and found it to pay off[1]).
To use this for anything involving parallelism where a Bloom filter
must be shared we'd probably finish up having to create a shared
version of bloom_init() that either uses caller-provided memory and
avoids the internal pointer, or allocates DSA memory. I suppose you
could consider splitting your bloom_init() function up into
bloom_estimate() and bloom_init(user_supplied_space, ...) now, and
getting rid of the separate pointer to bitset (ie stick char
bitset[FLEXIBLE_ARRAY_MEMBER] at the end of the struct)?
>> +bloom_add_element(bloom_filter *filter, unsigned char *elem, Size len)
>>
>> I think the plan is to use size_t for new stuff[1].
>
> I'd forgotten.
>
>> This is another my_log2(), right?
>>
>> It'd be nice to replace both with fls() or flsl(), though it's
>> annoying to have to think about long vs int64 etc. We already use
>> fls() in two places and supply an implementation in src/port/fls.c for
>> platforms that lack it (Windows?), but not the long version.
>
> win64 longs are only 32-bits, so my_log2() would do the wrong thing
> for me on that platform. pow2_truncate() is provided with a number of
> bits as its argument, not a number of bytes (otherwise this would
> work).
Hmm. Right.
> Ideally, we'd never use long integers, because its width is platform
> dependent, and yet it is only ever used as an alternative to int
> because it is wider than int. One example of where this causes
> trouble: logtape.c uses long ints, so external sorts can use half the
> temp space on win64.
Agreed, "long" is terrible.
>> +bloom_prop_bits_set(bloom_filter *filter)
>> +{
>> +intbitset_bytes = NBITS(filter) / BITS_PER_BYTE;
>> +int64bits_set = 0;
>> +inti;
>> +
>> +for (i = 0; i < bitset_bytes; i++)
>> +{
>> +unsigned char byte = filter->bitset[i];
>> +
>> +while (byte)
>> +{
>> +bits_set++;
>> +byte &= (byte - 1);
>> +}
>> +}
>
> I don't follow what you mean here.
I was just observing that there is an opportunity for code reuse here.
This function's definition would ideally be something like:
double
bloom_prop_bits_set(bloom_filter *filter)
{
return popcount(filter->bitset, NBYTES(filter)) / (double) NBITS(filter);
}
This isn't an objection to the way you have it, since we don't have a
popcount function yet! We have several routines in the tree for
counting bits, though not yet the complete set from Hacker's Delight.
Your patch brings us one step closer to that goal. (The book says
that this approach is good far sparse bitsets, but your comment says
that we expect something near 50%. That's irrelevant anyway since a
future centralised popcount() implementation would do this in
word-sized chunks with a hardware instruction or branch-free-per-word
lookups in a table and not care at all about sparseness.)
+ * Test if bloom filter definitely lacks element.
I think where "Bloom filter" appears in prose it should have a capital
letter (person's name).
[1]
http://www.nus.edu.sg/nurop/2010/Proceedings/SoC/NUROP_Congress_Cheng%20Bin.pdf
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Tue, Aug 29, 2017 at 4:34 PM, Thomas Munro
wrote:
> Some drive-by comments on the lib patch:
I was hoping that you'd look at this, since you'll probably want to
use a bloom filter for parallel hash join at some point. I've tried to
keep this one as simple as possible. I think that there is a good
chance that it will be usable for parallel hash join with multiple
batches. You'll need to update the interface a little bit to make that
work (e.g., bring-your-own-hash interface), but hopefully the changes
you'll need will be limited to a few small details.
> +bloom_add_element(bloom_filter *filter, unsigned char *elem, Size len)
>
> I think the plan is to use size_t for new stuff[1].
I'd forgotten.
> This is another my_log2(), right?
>
> It'd be nice to replace both with fls() or flsl(), though it's
> annoying to have to think about long vs int64 etc. We already use
> fls() in two places and supply an implementation in src/port/fls.c for
> platforms that lack it (Windows?), but not the long version.
win64 longs are only 32-bits, so my_log2() would do the wrong thing
for me on that platform. pow2_truncate() is provided with a number of
bits as its argument, not a number of bytes (otherwise this would
work).
Ideally, we'd never use long integers, because its width is platform
dependent, and yet it is only ever used as an alternative to int
because it is wider than int. One example of where this causes
trouble: logtape.c uses long ints, so external sorts can use half the
temp space on win64.
> +/*
> + * Hash function is taken from sdbm, a public-domain reimplementation of the
> + * ndbm database library.
> + */
> +static uint32
> +sdbmhash(unsigned char *elem, Size len)
> +{
> I see that this is used in gawk, BerkeleyDB and all over the place[2].
> Nice. I understand that this point of this is to be a hash function
> that is different from our usual one, for use by k_hashes().
Right. It's only job is to be a credible hash function that isn't
derivative of hash_any().
> Do you
> think it belongs somewhere more common than this? It seems a bit like
> our hash-related code is scattered all over the place but should be
> consolidated, but I suppose that's a separate project.
Unsure. In its defense, there is also a private murmurhash one-liner
within tidbitmap.c. I don't mind changing this, but it's slightly odd
to expose a hash function whose only job is to be completely unrelated
to hash_any().
> Unnecessary braces here and elsewhere for single line body of for loops.
>
> +bloom_prop_bits_set(bloom_filter *filter)
> +{
> +intbitset_bytes = NBITS(filter) / BITS_PER_BYTE;
> +int64bits_set = 0;
> +inti;
> +
> +for (i = 0; i < bitset_bytes; i++)
> +{
> +unsigned char byte = filter->bitset[i];
> +
> +while (byte)
> +{
> +bits_set++;
> +byte &= (byte - 1);
> +}
> +}
I don't follow what you mean here.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Wed, Aug 30, 2017 at 8:34 AM, Thomas Munro wrote: > It'd be nice to replace both with fls() or flsl(), though it's > annoying to have to think about long vs int64 etc. We already use > fls() in two places and supply an implementation in src/port/fls.c for > platforms that lack it (Windows?), but not the long version. Yes, you can complain about MSVC compilation here. -- Michael -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Wed, Aug 30, 2017 at 7:58 AM, Peter Geoghegan wrote:
> On Thu, May 11, 2017 at 4:30 PM, Peter Geoghegan wrote:
>> I spent only a few hours writing a rough prototype, and came up with
>> something that does an IndexBuildHeapScan() scan following the
>> existing index verification steps. Its amcheck callback does an
>> index_form_tuple() call, hashes the resulting IndexTuple (heap TID and
>> all), and tests it for membership of a bloom filter generated as part
>> of the main B-Tree verification phase. The IndexTuple memory is freed
>> immediately following hashing.
>
> I attach a cleaned-up version of this. It has extensive documentation.
> My bloom filter implementation is broken out as a separate patch,
> added as core infrastructure under "lib".
Some drive-by comments on the lib patch:
+bloom_add_element(bloom_filter *filter, unsigned char *elem, Size len)
I think the plan is to use size_t for new stuff[1].
+/*
+ * Which element of the sequence of powers-of-two is less than or equal to n?
+ *
+ * Used to size bitset, which in practice is never allowed to exceed 2 ^ 30
+ * bits (128MB). This frees us from giving further consideration to int
+ * overflow.
+ */
+static int
+pow2_truncate(int64 target_bitset_size)
+{
+int v = 0;
+
+while (target_bitset_size > 0)
+{
+v++;
+target_bitset_size = target_bitset_size >> 1;
+}
+
+return Min(v - 1, 30);
+}
This is another my_log2(), right?
It'd be nice to replace both with fls() or flsl(), though it's
annoying to have to think about long vs int64 etc. We already use
fls() in two places and supply an implementation in src/port/fls.c for
platforms that lack it (Windows?), but not the long version.
+/*
+ * Hash function is taken from sdbm, a public-domain reimplementation of the
+ * ndbm database library.
+ */
+static uint32
+sdbmhash(unsigned char *elem, Size len)
+{
+uint32hash = 0;
+inti;
+
+for (i = 0; i < len; elem++, i++)
+{
+hash = (*elem) + (hash << 6) + (hash << 16) - hash;
+}
+
+return hash;
+}
I see that this is used in gawk, BerkeleyDB and all over the place[2].
Nice. I understand that this point of this is to be a hash function
that is different from our usual one, for use by k_hashes(). Do you
think it belongs somewhere more common than this? It seems a bit like
our hash-related code is scattered all over the place but should be
consolidated, but I suppose that's a separate project.
Unnecessary braces here and elsewhere for single line body of for loops.
+bloom_prop_bits_set(bloom_filter *filter)
+{
+intbitset_bytes = NBITS(filter) / BITS_PER_BYTE;
+int64bits_set = 0;
+inti;
+
+for (i = 0; i < bitset_bytes; i++)
+{
+unsigned char byte = filter->bitset[i];
+
+while (byte)
+{
+bits_set++;
+byte &= (byte - 1);
+}
+}
Sorry I didn't follow up with my threat[3] to provide a central
popcount() function to replace the implementations all over the tree.
[1] https://www.postgresql.org/message-id/25076.1489699457%40sss.pgh.pa.us
[2] http://www.cse.yorku.ca/~oz/hash.html
[3]
https://www.postgresql.org/message-id/CAEepm%3D3g1_fjJGp38QGv%2B38BC2HHVkzUq6s69nk3mWLgPHqC3A%40mail.gmail.com
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Thu, May 11, 2017 at 4:30 PM, Peter Geoghegan wrote: > I spent only a few hours writing a rough prototype, and came up with > something that does an IndexBuildHeapScan() scan following the > existing index verification steps. Its amcheck callback does an > index_form_tuple() call, hashes the resulting IndexTuple (heap TID and > all), and tests it for membership of a bloom filter generated as part > of the main B-Tree verification phase. The IndexTuple memory is freed > immediately following hashing. I attach a cleaned-up version of this. It has extensive documentation. My bloom filter implementation is broken out as a separate patch, added as core infrastructure under "lib". I do have some outstanding concerns about V1 of the patch series: * I'm still uncertain about the question of using IndexBuildHeapScan() during Hot Standby. It seems safe, since we're only using the CONCURRENTLY/AccessShareLock path when this happens, but someone might find that objectionable on general principle. For now, in this first version, it remains possible to call IndexBuildHeapScan() during Hot Standby, to allow the new verification to work there. * The bloom filter has been experimentally verified, and is based on source material which is directly cited. It would nevertheless be useful to have the hashing stuff scrutinized, because it's possible that I've overlooked some subtlety. This is only the beginning for heapam verification. Comprehensive coverage can be added later, within routines that specifically target some table, not some index. While this patch series only adds index-to-heap verification for B-Tree indexes, I can imagine someone adopting the same technique to verifying that other access methods are consistent with their heap relation. For example, it would be easy to do this with hash indexes. Any other index access method where the same high-level principle that I rely on applies can do index-to-heap verification with just a few tweaks. I'm referring to the high-level principle that comments specifically point out in the patch: that REINDEX leaves you with an index structure that has exactly the same entries as the old index structure had, though possibly with fewer dead index tuples. I like my idea of reusing IndexBuildHeapScan() for verification. Very few new LOCs are actually added to amcheck by this patch, and IndexBuildHeapScan() is itself tested. -- Peter Geoghegan From 48499cfb58b7bf705e93fb12cc5359ec12cd9c51 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Tue, 2 May 2017 00:19:24 -0700 Subject: [PATCH 2/2] Add amcheck verification of indexes against heap. Add a new, optional capability to bt_index_check() and bt_index_parent_check(): callers can check that each heap tuple that ought to have an index entry does in fact have one. This happens at the end of the existing verification checks. This is implemented by using a bloom filter data structure. The implementation performs set membership tests within a callback (the same type of callback that each index AM registers for CREATE INDEX). The bloom filter is populated during the initial index verification scan. --- contrib/amcheck/Makefile | 2 +- contrib/amcheck/amcheck--1.0--1.1.sql| 28 contrib/amcheck/amcheck.control | 2 +- contrib/amcheck/expected/check_btree.out | 14 +- contrib/amcheck/sql/check_btree.sql | 9 +- contrib/amcheck/verify_nbtree.c | 236 --- doc/src/sgml/amcheck.sgml| 103 +++--- 7 files changed, 345 insertions(+), 49 deletions(-) create mode 100644 contrib/amcheck/amcheck--1.0--1.1.sql diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile index 43bed91..c5764b5 100644 --- a/contrib/amcheck/Makefile +++ b/contrib/amcheck/Makefile @@ -4,7 +4,7 @@ MODULE_big = amcheck OBJS = verify_nbtree.o $(WIN32RES) EXTENSION = amcheck -DATA = amcheck--1.0.sql +DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql PGFILEDESC = "amcheck - function for verifying relation integrity" REGRESS = check check_btree diff --git a/contrib/amcheck/amcheck--1.0--1.1.sql b/contrib/amcheck/amcheck--1.0--1.1.sql new file mode 100644 index 000..e6cca0a --- /dev/null +++ b/contrib/amcheck/amcheck--1.0--1.1.sql @@ -0,0 +1,28 @@ +/* contrib/amcheck/amcheck--1.0--1.1.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.1'" to load this file. \quit + +-- +-- bt_index_check() +-- +DROP FUNCTION bt_index_check(regclass); +CREATE FUNCTION bt_index_check(index regclass, +heapallindexed boolean DEFAULT false) +RETURNS VOID +AS 'MODULE_PATHNAME', 'bt_index_check' +LANGUAGE C STRICT PARALLEL RESTRICTED; + +-- +-- bt_index_parent_check() +-- +DROP FUNCTION bt_index_parent_check(regclass); +CREATE FUNCTION bt_index_parent_check(index regclass, +heapallindexed boolean DEFAULT false) +RETURNS VOID +AS 'MODULE_PATHNAME', 'bt_index_parent_check' +L
Re: [HACKERS] A design for amcheck heapam verification
On Mon, May 1, 2017 at 6:39 PM, Peter Geoghegan wrote: > On Mon, May 1, 2017 at 6:20 PM, Tom Lane wrote: >> Maybe you can fix this by assuming that your own session's advertised xmin >> is a safe upper bound on everybody else's RecentGlobalXmin. But I'm not >> sure if that rule does what you want. > > That's what you might ultimately need to fall back on (that, or > perhaps repeated calls to GetOldestXmin() to recheck, in the style of > pg_visibility). I spent only a few hours writing a rough prototype, and came up with something that does an IndexBuildHeapScan() scan following the existing index verification steps. Its amcheck callback does an index_form_tuple() call, hashes the resulting IndexTuple (heap TID and all), and tests it for membership of a bloom filter generated as part of the main B-Tree verification phase. The IndexTuple memory is freed immediately following hashing. The general idea here is that whatever IndexTuples ought to be in the index following a fresh REINDEX also ought to have been found in the index already. IndexBuildHeapScan() takes care of almost all of the details for us. I think I can do this correctly when only an AccessShareLock is acquired on heap + index, provided I also do a separate GetOldestXmin() before even the index scan begins, and do a final recheck of the saved GetOldestXmin() value against heap tuple xmin within the new IndexBuildHeapScan() callback (if we still think that it should have been found by the index scan, then actually throw a corruption related error). When there is only a ShareLock (for bt_parent_index_check() calls), the recheck isn't necessary. I think I should probably also make the IndexBuildHeapScan()-passed indexInfo structure "ii_Unique = false", since waiting for the outcome of a concurrent conflicting unique index insertion isn't useful, and can cause deadlocks. While I haven't really made my mind up, this design is extremely simple, and effectively tests IndexBuildHeapScan() at the same time as everything else. The addition of the bloom filter itself isn't trivial, but the code added to verify_nbtree.c is. The downside of going this way is that I cannot piggyback other types of heap verification on the IndexBuildHeapScan() scan. Still, perhaps it's worth it. Perhaps I should implement this bloom-filter-index-heap verification step as one extra option for the existing B-Tree verification functions. I may later add new verification functions that examine and verify the heap and related SLRUs alone. -- Peter Geoghegan VMware vCenter Server https://www.vmware.com/ -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Mon, May 1, 2017 at 9:20 PM, Tom Lane wrote: > ISTM if you want to do that you have an inherent race condition. > That is, no matter what you do, the moment after you look the currently > oldest open transaction could commit, allowing some other session's > view of RecentGlobalXmin to move past what you think it is, so that > that session could start pruning stuff. It can't prune the stuff we care about if we've got a shared content lock on the target buffer. That's the trick pg_visibility uses: /* * Time has passed since we computed OldestXmin, so it's * possible that this tuple is all-visible in reality even * though it doesn't appear so based on our * previously-computed value. Let's compute a new value so we * can be certain whether there is a problem. * * From a concurrency point of view, it sort of sucks to * retake ProcArrayLock here while we're holding the buffer * exclusively locked, but it should be safe against * deadlocks, because surely GetOldestXmin() should never take * a buffer lock. And this shouldn't happen often, so it's * worth being careful so as to avoid false positives. */ -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Mon, May 1, 2017 at 6:20 PM, Tom Lane wrote: > Maybe you can fix this by assuming that your own session's advertised xmin > is a safe upper bound on everybody else's RecentGlobalXmin. But I'm not > sure if that rule does what you want. That's what you might ultimately need to fall back on (that, or perhaps repeated calls to GetOldestXmin() to recheck, in the style of pg_visibility). It's useful to do rechecking, rather than just starting with the MVCC snapshot's xmin because you might be able to determine that the absence of some index tuple in the index (by which I mean its bloom filter) *still* cannot be explained by concurrent recycling. The conclusion that there is a real problem might never have been reached without this extra complexity. I'm not saying that it's worthwhile to add this complexity, rather than just starting with the MVCC snapshot's xmin in the first place -- I really don't have an opinion either way just yet. -- Peter Geoghegan VMware vCenter Server https://www.vmware.com/ -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
Peter Geoghegan writes: > If it's not clear what I mean: existing code that cares about > RecentGlobalXmin is using it as a *conservative* point before which > every snapshot sees every transaction as committed/aborted (and > therefore nobody can care if that other backend hot prunes dead tuples > from before then, or whatever it is). Whereas, amcheck needs to care > about the possibility that *anyone else* decided that pruning or > whatever is okay, based on generic criteria, and not what amcheck > happened to see as RecentGlobalXmin during snapshot acquisition. ISTM if you want to do that you have an inherent race condition. That is, no matter what you do, the moment after you look the currently oldest open transaction could commit, allowing some other session's view of RecentGlobalXmin to move past what you think it is, so that that session could start pruning stuff. Maybe you can fix this by assuming that your own session's advertised xmin is a safe upper bound on everybody else's RecentGlobalXmin. But I'm not sure if that rule does what you want. regards, tom lane -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Mon, May 1, 2017 at 4:28 PM, Peter Geoghegan wrote: > Anyone have an opinion on any of this? Offhand, I think that calling > GetOldestXmin() once per index when its "amcheck whole index scan" > finishes would be safe, and yet provide appreciably better test > coverage than only expecting things visible to our original MVCC > snapshot to be present in the index. I don't see a great reason to be > more aggressive and call GetOldestXmin() more often than once per > whole index scan, though. Wait, that's wrong, because in general RecentGlobalXmin may advance at any time as new snapshots are acquired by other backends. The only thing that we know for sure is that our MVCC snapshot is an interlock against things being recycled that the snapshot needs to see (according to MVCC semantics). And, we don't just have heap pruning to worry about -- we also have nbtree's LP_DEAD based recycling to worry about, before and during the amcheck full index scan (actually, this is probably the main source of problematic recycling for our verification protocol). So, I think that we could call GetOldestXmin() once, provided we were willing to recheck in the style of pg_visibility if and when there was an apparent violation that might be explained as caused by concurrent LP_DEAD recycling within nbtree. That seems complicated enough that I'll never be able to convince myself that it's worthwhile before actually trying to write the code. -- Peter Geoghegan VMware vCenter Server https://www.vmware.com/ -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Mon, May 1, 2017 at 2:10 PM, Peter Geoghegan wrote: > Actually, I guess amcheck would need to use its own scan's snapshot > xmin instead. This is true because it cares about visibility in a way > that's "backwards" relative to existing code that tests something > against RecentGlobalXmin. Is there any existing thing that works that > way? Looks like pg_visibility has a similar set of concerns, and so sometimes calls GetOldestXmin() to "recompute" what it calls OldestXmin (which I gather is like RecentGlobalXmin, but comes from calling GetOldestXmin() at least once). This happens within pg_visibility's collect_corrupt_items(). So, I could either follow that approach, or, more conservatively, call GetOldestXmin() immediately after each "amcheck whole index scan" finishes, for use later on, when we go to the heap. Within the heap, we expect that any committed tuple whose xmin precedes FooIndex.OldestXmin should be present in that index's bloom filter. Of course, when there are multiple indexes, we might only arrive at the heap much later. (I guess we'd also want to check if the MVCC Snapshot's xmin preceded FooIndex.OldestXmin, and set that as FooIndex.OldestXmin when that happened to be the case.) Anyone have an opinion on any of this? Offhand, I think that calling GetOldestXmin() once per index when its "amcheck whole index scan" finishes would be safe, and yet provide appreciably better test coverage than only expecting things visible to our original MVCC snapshot to be present in the index. I don't see a great reason to be more aggressive and call GetOldestXmin() more often than once per whole index scan, though. -- Peter Geoghegan VMware vCenter Server https://www.vmware.com/ -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On 1 May 2017 at 20:46, Robert Haas wrote: > One problem is that Bloom filters assume you can get > n independent hash functions for a given value, which we have not got. > That problem would need to be solved somehow. If you only have one > hash function, the size of the required bloom filter probably gets > very large. There's a simple formula to calculate the optimal number of hash functions and size of the filter given a target false positive rate. But I don't think this is as big of a problem as you imagine. a) we don't really only have one hash function, we have a 32-bit hash function and we could expand that to a larger bit size if we wanted. Bloom filters are never 2^32 size bit arrays for obvious reasons. If you have a 1kbit sized bloom filter that only needs 10 bits per index so you have three fully independent hash functions available already. If we changed to a 64-bit or 128-bit hash function then you could have enough bits available to have a larger set of hash functions and a larger array. b) you can get a poor man's universal hash out of hash_any or hash_int by just tweaking the input value in a way that doesn't interact in a simple way with the hash function. Even something as simple has xoring it with a random number (i.e. a vector of random numbers that identify your randomly chosen distinct "hash functions") seems to work fine. However for future-proofing security hardening I think Postgres should really implement a real mathematically rigorous Universal Hashing scheme which provides a family of hash functions from which to pick randomly. This prevents users from being able to generate data that would intentionally perform poorly in hash data structures for example. But it also means you have a whole family of hash functions to pick for bloom filters. -- greg -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Fri, Apr 28, 2017 at 6:02 PM, Peter Geoghegan wrote: > - Is committed, and committed before RecentGlobalXmin. Actually, I guess amcheck would need to use its own scan's snapshot xmin instead. This is true because it cares about visibility in a way that's "backwards" relative to existing code that tests something against RecentGlobalXmin. Is there any existing thing that works that way? If it's not clear what I mean: existing code that cares about RecentGlobalXmin is using it as a *conservative* point before which every snapshot sees every transaction as committed/aborted (and therefore nobody can care if that other backend hot prunes dead tuples from before then, or whatever it is). Whereas, amcheck needs to care about the possibility that *anyone else* decided that pruning or whatever is okay, based on generic criteria, and not what amcheck happened to see as RecentGlobalXmin during snapshot acquisition. -- Peter Geoghegan VMware vCenter Server https://www.vmware.com/ -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Mon, May 1, 2017 at 12:46 PM, Robert Haas wrote: > Bloom filters are one of those things that come up on this mailing > list incredibly frequently but rarely get used in committed code; thus > far, contrib/bloom is the only example we've got, and not for lack of > other proposals. They certainly are a fashionable data structure, but it's not as if they're a new idea. The math behind them is very well understood. They solve one narrow class of problem very well. > One problem is that Bloom filters assume you can get > n independent hash functions for a given value, which we have not got. > That problem would need to be solved somehow. If you only have one > hash function, the size of the required bloom filter probably gets > very large. I don't think that that's a problem, because you really only need 2 hash functions [1], which we have already (recall that Andres added Murmur hash to Postgres 10). It's an area that I'd certainly need to do more research on if I'm to go forward with bloom filters, but I'm pretty confident that there is a robust solution to the practical problem of not having arbitrary many hash functions at hand. (I think that you rarely need all that many independent hash functions, in any case.) It isn't that hard to evaluate whether or not an implementation has things right, at least for a variety of typical cases. We know how to objectively evaluate a hash function while making only some pretty mild assumptions. > When hashing index and heap tuples, do you propose to include the heap > TID in the data getting hashed? I think that would be a good idea, > because otherwise you're only verifying that every heap tuple has an > index pointer pointing at something, not that every heap tuple has an > index tuple pointing at the right thing. Yes -- I definitely want to hash the heap TID from each IndexTuple. > I wonder if it's also worth having a zero-error mode, even if it runs > for a long time. Scan the heap, and probe the index for the value > computed from each heap tuple. Maybe that's so awful that nobody > would ever use it, but I'm not sure. It might actually be simpler to > implement than what you have in mind. It's easy if you don't mind that the implementation will be an ad-hoc nested loop join. I guess I could do that too, if only because it won't be that hard, and that's really what you want when you know you have corruption. Performance will probably be prohibitively poor when verification needs to be run in any kind of routine way, which is a problem if that's the only way it can work. My sense is that verification needs to be reasonably low overhead, and it needs to perform pretty consistently, even if you only use it for stress-testing new features. To reiterate, another thing that makes a bloom filter attractive is how it simplifies resource management relative to an approach involving sorting or a hash table. There are a bunch of edge cases that I don't have to worry about around resource management (e.g., a subset of very wide outlier IndexTuples, or two indexes that are of very different sizes associated with the same table that need to receive an even share of memory). As I said, even if I was totally willing to duplicate the effort that went into respecting work_mem as a budget within places like tuplesort.c, having as little infrastructure code as possible is a specific goal for amcheck. [1] https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf -- Peter Geoghegan VMware vCenter Server https://www.vmware.com/ -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Fri, Apr 28, 2017 at 9:02 PM, Peter Geoghegan wrote: > I'd like to hear feedback on the general idea, and what the > user-visible interface ought to look like. The non-deterministic false > negatives may need to be considered by the user visible interface, > which is the main reason I mention it. Bloom filters are one of those things that come up on this mailing list incredibly frequently but rarely get used in committed code; thus far, contrib/bloom is the only example we've got, and not for lack of other proposals. One problem is that Bloom filters assume you can get n independent hash functions for a given value, which we have not got. That problem would need to be solved somehow. If you only have one hash function, the size of the required bloom filter probably gets very large. When hashing index and heap tuples, do you propose to include the heap TID in the data getting hashed? I think that would be a good idea, because otherwise you're only verifying that every heap tuple has an index pointer pointing at something, not that every heap tuple has an index tuple pointing at the right thing. I wonder if it's also worth having a zero-error mode, even if it runs for a long time. Scan the heap, and probe the index for the value computed from each heap tuple. Maybe that's so awful that nobody would ever use it, but I'm not sure. It might actually be simpler to implement than what you have in mind. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
