Re: [HACKERS] [BUGS] BUG #14155: bloom index error with unlogged table
On Wed, Jun 8, 2016 at 1:23 AM, Tom Lanewrote: > "David G. Johnston" writes: >> On Tue, Jun 7, 2016 at 1:35 AM, Michael Paquier >> wrote: >>> I have finally given a shot at improving the docs with the attached. >>> Comments are welcome. > >> [ assorted comments ] > > I adopted most of David's suggestions, whacked it around a bit further > myself, and committed. See what you think. That looks better, thanks. >> It would be nice to give guidance on selecting a bit size for columns and >> a signature length. Yes, Wikipedia covers the topic but to get the reader >> started some discussion of the relevant trade-offs when using larger >> numbers than the default would be nice. I don't suspect using smaller the >> default values is apt to be worthwhile... > > Agreed, but I didn't want to write such text myself. There's room for > further improvement here. I did add a note in the main example about > what happens with a non-default signature length, but that hardly > constitutes guidance. > > BTW, it seemed to me while generating the example that the planner's > costing for bloom index searches was unduly pessimistic; maybe there's > work to do there? I wanted them to do so to prove that index rechecks are necessary as false positives can be returned when scanning the index. We could add an extra example with an index that has a longer signature size... I am not sure that's worth the complication though. I am marking this item as closed, in my view things are looking far better. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #14155: bloom index error with unlogged table
"David G. Johnston"writes: > Do I understand the process correctly? The current 9.6 docs reflect what > was committed and branched as beta1. Ongoing work is done against master > (devel docs). When beta2 is released it is branched from the current > master; not beta1. At that point the current devel docs will become the > new 9.6 docs. There is no beta1 branch. 9.6 is still the master branch, and beta2 will be stamped on master, as will at least the next couple of betas. We will branch off REL9_6_STABLE whenever we're ready to open the tree for 9.7^H^H^H10 development, which likely won't be till September. (The point here is to try to keep people focused on stabilizing 9.6 for awhile longer yet. Last year we opened new development in the summer, and that had a detrimental effect on getting 9.5 out the door.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #14155: bloom index error with unlogged table
On Tue, Jun 7, 2016 at 12:23 PM, Tom Lanewrote: > "David G. Johnston" writes: > > On Tue, Jun 7, 2016 at 1:35 AM, Michael Paquier < > michael.paqu...@gmail.com> > > wrote: > >> I have finally given a shot at improving the docs with the attached. > >> Comments are welcome. > > > [ assorted comments ] > > I adopted most of David's suggestions, whacked it around a bit further > myself, and committed. See what you think. > Looks good though I'm waiting for the website to update. Do I understand the process correctly? The current 9.6 docs reflect what was committed and branched as beta1. Ongoing work is done against master (devel docs). When beta2 is released it is branched from the current master; not beta1. At that point the current devel docs will become the new 9.6 docs. Thanks! David J.
Re: [HACKERS] [BUGS] BUG #14155: bloom index error with unlogged table
"David G. Johnston"writes: > On Tue, Jun 7, 2016 at 1:35 AM, Michael Paquier > wrote: >> I have finally given a shot at improving the docs with the attached. >> Comments are welcome. > [ assorted comments ] I adopted most of David's suggestions, whacked it around a bit further myself, and committed. See what you think. > âIt would be nice to give guidance on selecting a bit size for columns and > a signature lengthâ. Yes, Wikipedia covers the topic but to get the reader > started some discussion of the relevant trade-offs when using larger > numbers than the default would be nice. I don't suspect using smaller the > default values is apt to be worthwhile... Agreed, but I didn't want to write such text myself. There's room for further improvement here. I did add a note in the main example about what happens with a non-default signature length, but that hardly constitutes guidance. BTW, it seemed to me while generating the example that the planner's costing for bloom index searches was unduly pessimistic; maybe there's work to do there? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #14155: bloom index error with unlogged table
On Tue, Jun 7, 2016 at 1:35 AM, Michael Paquierwrote: > On Fri, Jun 3, 2016 at 11:25 PM, Tom Lane wrote: > > Michael Paquier writes: > >> Actually, the docs could be more polished. > > > > I think the docs could stand to be rewritten from scratch ;-). But > > upthread there was an offer to work on them if we made the code behavior > > saner. I've done the latter part, I don't want to do the former. > > I have finally given a shot at improving the docs with the attached. > Comments are welcome. It would be nice to give guidance on selecting a bit size for columns and a signature length. Yes, Wikipedia covers the topic but to get the reader started some discussion of the relevant trade-offs when using larger numbers than the default would be nice. I don't suspect using smaller the default values is apt to be worthwhile... David J.
Re: [HACKERS] [BUGS] BUG #14155: bloom index error with unlogged table
On Tue, Jun 7, 2016 at 1:35 AM, Michael Paquierwrote: > On Fri, Jun 3, 2016 at 11:25 PM, Tom Lane wrote: > > Michael Paquier writes: > >> Actually, the docs could be more polished. > > > > I think the docs could stand to be rewritten from scratch ;-). But > > upthread there was an offer to work on them if we made the code behavior > > saner. I've done the latter part, I don't want to do the former. > > I have finally given a shot at improving the docs with the attached. > Comments are welcome. > Looks good. Thanks! Some minor word-smithing related stuff and one definitional concern: "of all indexed attributes and so it can report false positives" -> of all indexed attributes and as such is prone to reporting false positives; "in the set, however" -> "in the set although" "one only needs a single bloom index (default 80, maximum 4096)" -> the default seems like it would be better placed in the first paragraph of the intro where "whose size in calculated in bits" is mentioned; or better yet dropped altogether since the parameters section covers the defaults. *** "to the number of the column for" - the examples imply that each parameter refers to columns by name, not number. "a bloom index representing first the advantage to be more" - this intro to the example needs some work. maybe: "Here is a more complete example of index definition and usage, as well as a comparison with the equivalent btree index. The bloom index is considerably smaller as well as performs better than the btree index. ---As an aside, is a multi-column index really a fair comparison here? ---Leaving a sequential scan explain analyze in place should be considered. "The Bloom opclass interface" -> The Bloom opclass interface requires a hash function for the indexing datatype and an equality operator for searching. The example...(drop the simple conclusion the word the equality operator part better). "are implemented with the module" - are supplied by this module. (side question, for 10.0 how about we call these extensions instead of modules?) David J.
Re: [HACKERS] [BUGS] BUG #14155: bloom index error with unlogged table
On Fri, Jun 3, 2016 at 11:25 PM, Tom Lanewrote: > Michael Paquier writes: >> Actually, the docs could be more polished. > > I think the docs could stand to be rewritten from scratch ;-). But > upthread there was an offer to work on them if we made the code behavior > saner. I've done the latter part, I don't want to do the former. I have finally given a shot at improving the docs with the attached. Comments are welcome. -- Michael diff --git a/doc/src/sgml/bloom.sgml b/doc/src/sgml/bloom.sgml index 8667763..c1e204f 100644 --- a/doc/src/sgml/bloom.sgml +++ b/doc/src/sgml/bloom.sgml @@ -8,43 +8,38 @@ - bloom is a module that implements an index access method. It comes - as an example of custom access methods and generic WAL record usage. But it - is also useful in itself. + bloom provides an index access method usable as a + http://en.wikipedia.org/wiki/Bloom_filter;>Bloom filter. - - Introduction - - - The implementation of a - http://en.wikipedia.org/wiki/Bloom_filter;>Bloom filter - allows fast exclusion of non-candidate tuples via signatures. - Since a signature is a lossy representation of all indexed attributes, - search results must be rechecked using heap information. - The user can specify signature length in bits (default 80, maximum 4096) - and the number of bits generated for each index column (default 2, - maximum 4095). - + + A bloom filter is a space-efficient data structure that is used to test + whether an element is a member of a set. In the case of an index access + method, it allows fast exclusion of non-candidate tuples via signatures + whose size is calculated in bits and is controllable at index creation. + - - This index is useful if a table has many attributes and queries include - arbitrary combinations of them. A traditional btree index is - faster than a bloom index, but it can require many indexes to support all - possible queries where one needs only a single bloom index. A Bloom index - supports only equality comparison. Since it's a signature file, and not a - tree, it always must be read fully, but sequentially, so that index search - performance is constant and doesn't depend on a query. - - + + A signature is a lossy representation of all indexed attributes and + so it can report false positives (it is reported that an element exists in + the set, however it does not), so search results must be rechecked using + heap information. + + + + This type of index is useful if a table has many attributes and queries + include arbitrary combinations of them. A traditional btree + index is faster than a bloom index, but it can require many indexes to + support all possible queries where one needs only a single bloom index + (default 80, maximum 4096). + Parameters - bloom indexes accept the following parameters in the - WITH - clause. + A bloom index accepts the following parameters in its + WITH clause. @@ -52,7 +47,8 @@ length - Length of signature in bits + Length of signature in bits. Default is 80 and maximum is + 4096. @@ -62,7 +58,9 @@ col1 col32 - Number of bits generated for each index column + Number of bits generated for each index column. Each parameter refers + to the number of the column for the relation on which an index is + defined. @@ -82,91 +80,88 @@ CREATE INDEX bloomidx ON tbloom USING bloom (i1,i2,i3) - Here, we created a bloom index with a signature length of 80 bits, - and attributes i1 and i2 mapped to 2 bits, and attribute i3 mapped to 4 bits. + Here, a bloom index is created with a signature length of 80 bits, and + attributes i1 and i2 mapped to 2 bits, and attribute i3 mapped to 4 bits. - Here is a fuller example of index definition and usage: + Here is a more complete example of index definition and usage, a bloom + index representing first the advantage to be more space-efficient + than an equivalent btree index: -CREATE TABLE tbloom AS -SELECT -random()::int as i1, -random()::int as i2, -random()::int as i3, -random()::int as i4, -random()::int as i5, -random()::int as i6, -random()::int as i7, -random()::int as i8, -random()::int as i9, -random()::int as i10, -random()::int as i11, -random()::int as i12, -random()::int as i13 -FROM -generate_series(1,1000); -CREATE INDEX bloomidx ON tbloom USING - bloom (i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11, i12); -SELECT pg_relation_size('bloomidx'); -CREATE index btree_idx ON tbloom(i1,i2,i3,i4,i5,i6,i7,i8,i9,i10,i11,i12); -SELECT pg_relation_size('btree_idx'); - - - -=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 20 AND i10 = 15; - QUERY PLAN
Re: [HACKERS] [BUGS] BUG #14155: bloom index error with unlogged table
Jeff Janeswrites: > One thing from the commit-message: > "On-disk, we can still store it in words, so as to not break on-disk > compatibility with beta1." > Hasn't that ship already sailed? No. > Or is the concern about intra-version pg_upgrade rather than direct > on-disk compatibility? This. You can pg_upgrade across a catversion bump, but not across changes in user table or index contents. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #14155: bloom index error with unlogged table
On Thu, Jun 2, 2016 at 9:03 PM, Tom Lanewrote: > I wrote: >> Jeff Janes writes: >>> My biggest gripe with it at the moment is that the signature size should be >>> expressed in bits, and then internally rounded up to a multiple of 16, >>> rather than having it be expressed in 'uint16'. >>> If that were done it would be easier to fix the documentation to be more >>> understandable. > >> +1 ... that sort of definition seems much more future-proof, too. >> IMO it's not too late to change this. (We probably don't want to change >> the on-disk representation of the reloptions, but we could convert from >> bits to words in bloptions().) > > There were no objections to this, but also no action. Attached is a draft > patch ... any complaints? One thing from the commit-message: "On-disk, we can still store it in words, so as to not break on-disk compatibility with beta1." Hasn't that ship already sailed? from beta1 to HEAD: The database cluster was initialized with CATALOG_VERSION_NO 201605051, but the server was compiled with CATALOG_VERSION_NO 201605191. Or is the concern about intra-version pg_upgrade rather than direct on-disk compatibility? Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #14155: bloom index error with unlogged table
Michael Paquierwrites: > Actually, the docs could be more polished. I think the docs could stand to be rewritten from scratch ;-). But upthread there was an offer to work on them if we made the code behavior saner. I've done the latter part, I don't want to do the former. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #14155: bloom index error with unlogged table
On Fri, Jun 3, 2016 at 3:11 PM, Michael Paquierwrote: > On Fri, Jun 3, 2016 at 1:03 PM, Tom Lane wrote: >> I wrote: >>> Jeff Janes writes: My biggest gripe with it at the moment is that the signature size should be expressed in bits, and then internally rounded up to a multiple of 16, rather than having it be expressed in 'uint16'. If that were done it would be easier to fix the documentation to be more understandable. >> >>> +1 ... that sort of definition seems much more future-proof, too. >>> IMO it's not too late to change this. (We probably don't want to change >>> the on-disk representation of the reloptions, but we could convert from >>> bits to words in bloptions().) >> >> There were no objections to this, but also no action. Attached is a draft >> patch ... any complaints? > > None. This looks rather sane to me. Actually, the docs could be more polished. "Limitation" should be changed to "Limitations", and "Opclass interface" to "Operator Class Interface". The current wording "Seqscan is slow" is not clear, this should mention a sequential scan instead. And it is not that slow, just slower than the heap index scan of bloom.. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #14155: bloom index error with unlogged table
On Fri, Jun 3, 2016 at 1:03 PM, Tom Lanewrote: > I wrote: >> Jeff Janes writes: >>> My biggest gripe with it at the moment is that the signature size should be >>> expressed in bits, and then internally rounded up to a multiple of 16, >>> rather than having it be expressed in 'uint16'. >>> If that were done it would be easier to fix the documentation to be more >>> understandable. > >> +1 ... that sort of definition seems much more future-proof, too. >> IMO it's not too late to change this. (We probably don't want to change >> the on-disk representation of the reloptions, but we could convert from >> bits to words in bloptions().) > > There were no objections to this, but also no action. Attached is a draft > patch ... any complaints? None. This looks rather sane to me. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #14155: bloom index error with unlogged table
I wrote: > Jeff Janeswrites: >> My biggest gripe with it at the moment is that the signature size should be >> expressed in bits, and then internally rounded up to a multiple of 16, >> rather than having it be expressed in 'uint16'. >> If that were done it would be easier to fix the documentation to be more >> understandable. > +1 ... that sort of definition seems much more future-proof, too. > IMO it's not too late to change this. (We probably don't want to change > the on-disk representation of the reloptions, but we could convert from > bits to words in bloptions().) There were no objections to this, but also no action. Attached is a draft patch ... any complaints? regards, tom lane diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h index c21eebf..a9daf63 100644 *** a/contrib/bloom/bloom.h --- b/contrib/bloom/bloom.h *** typedef BloomPageOpaqueData *BloomPageOp *** 79,96 #define BLOOM_HEAD_BLKNO (1) /* first data page */ /* ! * Maximum of bloom signature length in uint16. Actual value ! * is 512 bytes */ ! #define MAX_BLOOM_LENGTH (256) /* Bloom index options */ typedef struct BloomOptions { int32 vl_len_; /* varlena header (do not touch directly!) */ ! int bloomLength; /* length of signature in uint16 */ ! int bitSize[INDEX_MAX_KEYS]; /* signature bits per index ! * key */ } BloomOptions; /* --- 79,108 #define BLOOM_HEAD_BLKNO (1) /* first data page */ /* ! * We store Bloom signatures as arrays of uint16 words. */ ! typedef uint16 BloomSignatureWord; ! ! #define SIGNWORDBITS ((int) (BITS_PER_BYTE * sizeof(BloomSignatureWord))) ! ! /* ! * Default and maximum Bloom signature length in bits. ! */ ! #define DEFAULT_BLOOM_LENGTH (5 * SIGNWORDBITS) ! #define MAX_BLOOM_LENGTH (256 * SIGNWORDBITS) ! ! /* ! * Default and maximum signature bits generated per index key. ! */ ! #define DEFAULT_BLOOM_BITS 2 ! #define MAX_BLOOM_BITS (MAX_BLOOM_LENGTH - 1) /* Bloom index options */ typedef struct BloomOptions { int32 vl_len_; /* varlena header (do not touch directly!) */ ! int bloomLength; /* length of signature in words (not bits!) */ ! int bitSize[INDEX_MAX_KEYS]; /* signature bits per index key */ } BloomOptions; /* *** typedef struct BloomState *** 143,154 /* * Tuples are very different from all other relations */ - typedef uint16 SignType; - typedef struct BloomTuple { ItemPointerData heapPtr; ! SignType sign[FLEXIBLE_ARRAY_MEMBER]; } BloomTuple; #define BLOOMTUPLEHDRSZ offsetof(BloomTuple, sign) --- 155,164 /* * Tuples are very different from all other relations */ typedef struct BloomTuple { ItemPointerData heapPtr; ! BloomSignatureWord sign[FLEXIBLE_ARRAY_MEMBER]; } BloomTuple; #define BLOOMTUPLEHDRSZ offsetof(BloomTuple, sign) *** typedef struct BloomTuple *** 156,162 /* Opaque data structure for bloom index scan */ typedef struct BloomScanOpaqueData { ! SignType *sign; /* Scan signature */ BloomState state; } BloomScanOpaqueData; --- 166,172 /* Opaque data structure for bloom index scan */ typedef struct BloomScanOpaqueData { ! BloomSignatureWord *sign; /* Scan signature */ BloomState state; } BloomScanOpaqueData; *** extern void BloomFillMetapage(Relation i *** 170,176 extern void BloomInitMetapage(Relation index); extern void BloomInitPage(Page page, uint16 flags); extern Buffer BloomNewBuffer(Relation index); ! extern void signValue(BloomState * state, SignType * sign, Datum value, int attno); extern BloomTuple *BloomFormTuple(BloomState * state, ItemPointer iptr, Datum *values, bool *isnull); extern bool BloomPageAddItem(BloomState * state, Page page, BloomTuple * tuple); --- 180,186 extern void BloomInitMetapage(Relation index); extern void BloomInitPage(Page page, uint16 flags); extern Buffer BloomNewBuffer(Relation index); ! extern void signValue(BloomState * state, BloomSignatureWord * sign, Datum value, int attno); extern BloomTuple *BloomFormTuple(BloomState * state, ItemPointer iptr, Datum *values, bool *isnull); extern bool BloomPageAddItem(BloomState * state, Page page, BloomTuple * tuple); diff --git a/contrib/bloom/blscan.c b/contrib/bloom/blscan.c index aebf32a..0c954dc 100644 *** a/contrib/bloom/blscan.c --- b/contrib/bloom/blscan.c *** blgetbitmap(IndexScanDesc scan, TIDBitma *** 93,99 /* New search: have to calculate search signature */ ScanKey skey = scan->keyData; ! so->sign = palloc0(sizeof(SignType) * so->state.opts.bloomLength); for (i = 0; i < scan->numberOfKeys; i++) { --- 93,99 /* New search: have to calculate search signature */ ScanKey skey = scan->keyData; ! so->sign = palloc0(sizeof(BloomSignatureWord) * so->state.opts.bloomLength); for (i = 0; i <
Re: [HACKERS] [BUGS] BUG #14155: bloom index error with unlogged table
Jeff Janeswrites: > Given what a Bloom filter is/does, I'm having a hard time seeing how it > makes much sense to support the boolean type. > My biggest gripe with it at the moment is that the signature size should be > expressed in bits, and then internally rounded up to a multiple of 16, > rather than having it be expressed in 'uint16'. > If that were done it would be easier to fix the documentation to be more > understandable. +1 ... that sort of definition seems much more future-proof, too. IMO it's not too late to change this. (We probably don't want to change the on-disk representation of the reloptions, but we could convert from bits to words in bloptions().) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #14155: bloom index error with unlogged table
On May 24, 2016 5:27 PM, "David G. Johnston"wrote: > > Moving my griping to -hackers only > > On Tue, May 24, 2016 at 8:08 PM, Tom Lane wrote: >> >> dig...@126.com writes: >> > postgres=# create unlogged table u_tbl (id int); >> > CREATE TABLE >> > postgres=# create index idx_u_tbl on u_tbl using bloom (id); >> > ERROR: index "idx_u_tbl" already contains data >> >> Yeah, it looks like nobody ever tested bloom's unlogged-index support; >> it doesn't work or even come very close to working. Will fix, thanks >> for the report! > > > I'll tack on my own gripe here, just because. > > It doesn't give me a lot of confidence in what was committed when the summary sentence for the module says: > > " > bloom is a module which implements an index access method. It comes as an example of custom access methods and generic WAL records usage. But it is also useful in itself. > " > > > Honestly, as a user I couldn't care less that bloom is "an example custom access method". I want to know what it does and that it does so reliably, and has a easy-to-use interface. I complained earlier about its lack of direct support for the boolean type. Teodor's response on the thread wasn't particularly encouraging: > Given what a Bloom filter is/does, I'm having a hard time seeing how it makes much sense to support the boolean type. My biggest gripe with it at the moment is that the signature size should be expressed in bits, and then internally rounded up to a multiple of 16, rather than having it be expressed in 'uint16'. If that were done it would be easier to fix the documentation to be more understandable. On the positive side, I've done extensive crash-recovery testing (not with unlogged tables, obviously) and that part seems solid. Cheers, Jeff
Re: [HACKERS] [BUGS] BUG #14155: bloom index error with unlogged table
Moving my griping to -hackers only On Tue, May 24, 2016 at 8:08 PM, Tom Lanewrote: > dig...@126.com writes: > > postgres=# create unlogged table u_tbl (id int); > > CREATE TABLE > > postgres=# create index idx_u_tbl on u_tbl using bloom (id); > > ERROR: index "idx_u_tbl" already contains data > > Yeah, it looks like nobody ever tested bloom's unlogged-index support; > it doesn't work or even come very close to working. Will fix, thanks > for the report! > I'll tack on my own gripe here, just because. It doesn't give me a lot of confidence in what was committed when the summary sentence for the module says: " bloom is a module which implements an index access method. It comes as an example of custom access methods and generic WAL records usage. But it is also useful in itself. " Honestly, as a user I couldn't care less that bloom is "an example custom access method". I want to know what it does and that it does so reliably, and has a easy-to-use interface. I complained earlier about its lack of direct support for the boolean type. Teodor's response on the thread wasn't particularly encouraging: https://www.postgresql.org/message-id/5718a59d.4090...@sigaev.ru I also see that the following -hacker thread didn't get resolved: https://www.postgresql.org/message-id/cakfquwykrepeselfwb0b85dat466lely8ao-okpwaqpwtmg...@mail.gmail.com I would not be surprised to see additional problems crop up in the module. Tom's characterization above just reinforces that. David J.