Re: type cache cleanup improvements
Okay, I've applied this piece for now. Not sure I'll have much room to look at the rest. Thank you very much! Rest of patches, rebased. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/From b30af080d7768c2fdb6198e2e40ef93419f60732 Mon Sep 17 00:00:00 2001 From: Teodor Sigaev Date: Fri, 15 Mar 2024 13:55:10 +0300 Subject: [PATCH 3/3] usage of hash_search_with_hash_value --- src/backend/utils/cache/attoptcache.c | 39 src/backend/utils/cache/typcache.c| 52 +++ 2 files changed, 70 insertions(+), 21 deletions(-) diff --git a/src/backend/utils/cache/attoptcache.c b/src/backend/utils/cache/attoptcache.c index af978ccd4b1..3a18b2e9a77 100644 --- a/src/backend/utils/cache/attoptcache.c +++ b/src/backend/utils/cache/attoptcache.c @@ -44,12 +44,10 @@ typedef struct /* * InvalidateAttoptCacheCallback - * Flush all cache entries when pg_attribute is updated. + * Flush cache entry (or entries) when pg_attribute is updated. * * When pg_attribute is updated, we must flush the cache entry at least - * for that attribute. Currently, we just flush them all. Since attribute - * options are not currently used in performance-critical paths (such as - * query execution), this seems OK. + * for that attribute. */ static void InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue) @@ -57,7 +55,16 @@ InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue) HASH_SEQ_STATUS status; AttoptCacheEntry *attopt; - hash_seq_init(, AttoptCacheHash); + /* + * By convection, zero hash value is passed to the callback as a sign + * that it's time to invalidate the cache. See sinval.c, inval.c and + * InvalidateSystemCachesExtended(). + */ + if (hashvalue == 0) + hash_seq_init(, AttoptCacheHash); + else + hash_seq_init_with_hash_value(, AttoptCacheHash, hashvalue); + while ((attopt = (AttoptCacheEntry *) hash_seq_search()) != NULL) { if (attopt->opts) @@ -70,6 +77,18 @@ InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue) } } +/* + * Hash function compatible with two-arg system cache hash function. + */ +static uint32 +relatt_cache_syshash(const void *key, Size keysize) +{ + const AttoptCacheKey* ckey = key; + + Assert(keysize == sizeof(*ckey)); + return GetSysCacheHashValue2(ATTNUM, ckey->attrelid, ckey->attnum); +} + /* * InitializeAttoptCache * Initialize the attribute options cache. @@ -82,9 +101,17 @@ InitializeAttoptCache(void) /* Initialize the hash table. */ ctl.keysize = sizeof(AttoptCacheKey); ctl.entrysize = sizeof(AttoptCacheEntry); + + /* + * AttoptCacheEntry takes hash value from the system cache. For + * AttoptCacheHash we use the same hash in order to speedup search by hash + * value. This is used by hash_seq_init_with_hash_value(). + */ + ctl.hash = relatt_cache_syshash; + AttoptCacheHash = hash_create("Attopt cache", 256, , - HASH_ELEM | HASH_BLOBS); + HASH_ELEM | HASH_FUNCTION); /* Make sure we've initialized CacheMemoryContext. */ if (!CacheMemoryContext) diff --git a/src/backend/utils/cache/typcache.c b/src/backend/utils/cache/typcache.c index 7936c3b46d0..9145088f44d 100644 --- a/src/backend/utils/cache/typcache.c +++ b/src/backend/utils/cache/typcache.c @@ -339,6 +339,15 @@ static TupleDesc find_or_make_matching_shared_tupledesc(TupleDesc tupdesc); static dsa_pointer share_tupledesc(dsa_area *area, TupleDesc tupdesc, uint32 typmod); +/* + * Hash function compatible with one-arg system cache hash function. + */ +static uint32 +type_cache_syshash(const void *key, Size keysize) +{ + Assert(keysize == sizeof(Oid)); + return GetSysCacheHashValue1(TYPEOID, ObjectIdGetDatum(*(const Oid*)key)); +} /* * lookup_type_cache @@ -364,8 +373,15 @@ lookup_type_cache(Oid type_id, int flags) ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(TypeCacheEntry); + /* + * TypeEntry takes hash value from the system cache. For TypeCacheHash + * we use the same hash in order to speedup search by hash value. This + * is used by hash_seq_init_with_hash_value(). + */ + ctl.hash = type_cache_syshash; + TypeCacheHash = hash_create("Type information cache", 64, - , HASH_ELEM | HASH_BLOBS); + , HASH_ELEM | HASH_FUNCTION); ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(mapRelTypeEntry); @@ -421,8 +437,7 @@ lookup_type_cache(Oid type_id, int flags) /* These fields can never change, by definition */ typentry->type_id = type_id; - typentry->type_id_hash = GetSysCacheHashValue1(TYPEOID, - ObjectIdGetDatum(type_id)); + typentry->type_id_hash = get_hash_value(TypeCacheHash, _id); /* Keep this part in sync with the code below */ typentry->typlen = typtup->typlen; @@ -2429,20 +2444,27 @@ TypeCacheTypCallback(Datum arg, int cacheid,
Re: type cache cleanup improvements
One thing that first pops out to me is that we can do the refactor of hash_initial_lookup() as an independent piece, without the extra paths introduced. But rather than returning the bucket hash and have the bucket number as an in/out argument of hash_initial_lookup(), there is an argument for reversing them: hash_search_with_hash_value() does not care about the bucket number. Ok, no problem 02-hash_seq_init_with_hash_value.v5.patch - introduces a hash_seq_init_with_hash_value() method. hash_initial_lookup() is marked as inline, but I suppose, modern compilers are smart enough to inline it automatically. Likely so, though that does not hurt to show the intention to the reader. Agree So I would like to suggest the attached patch for this first piece. What do you think? I have not any objections It may also be an idea to use `git format-patch` when generating a series of patches. That makes for easier reviews. Thanks, will try -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: type cache cleanup improvements
I think that this patch should be split for clarity, as there are a few things that are independently useful. I guess something like that: Done, all patches should be applied consequentially. > - The typcache changes. 01-map_rel_to_type.v5.patch adds map relation to its type - Introduction of hash_initial_lookup(), that simplifies 3 places of dynahash.c where the same code is used. The routine should be inlined. - The split in hash_seq_search to force a different type of search is weird, complicating the dynahash interface by hiding what seems like a search mode. Rather than hasHashvalue that's hidden in the middle of HASH_SEQ_STATUS, could it be better to have an entirely different API for the search? That should be a patch on its own, as well. 02-hash_seq_init_with_hash_value.v5.patch - introduces a hash_seq_init_with_hash_value() method. hash_initial_lookup() is marked as inline, but I suppose, modern compilers are smart enough to inline it automatically. Using separate interface for scanning hash with hash value will make scan code more ugly in case when we need to use special value of hash value as it is done in cache's scans. Look, instead of this simplified code: if (hashvalue == 0) hash_seq_init(, TypeCacheHash); else hash_seq_init_with_hash_value(, TypeCacheHash, hashvalue); while ((typentry = hash_seq_search()) != NULL) { ... } we will need to code something like that: if (hashvalue == 0) { hash_seq_init(, TypeCacheHash); while ((typentry = hash_seq_search()) != NULL) { ... } } else { hash_seq_init_with_hash_value(, TypeCacheHash, hashvalue); while ((typentry = hash_seq_search_with_hash_value()) != NULL) { ... } } Or I didn't understand you. I thought about integrate check inside existing loop in hash_seq_search() : + rerun: if ((curElem = status->curEntry) != NULL) { /* Continuing scan of curBucket... */ status->curEntry = curElem->link; if (status->curEntry == NULL) /* end of this bucket */ + { + if (status->hasHashvalue) + hash_seq_term(status); ++status->curBucket; + } + else if (status->hasHashvalue && status->hashvalue != +curElem->hashvalue) + goto rerun; return (void *) ELEMENTKEY(curElem); } But for me it looks weird and adds some checks which will takes some CPU time. 03-att_with_hash_value.v5.patch - adds usage of previous patch. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/diff --git a/src/backend/utils/cache/attoptcache.c b/src/backend/utils/cache/attoptcache.c index af978ccd4b1..3a18b2e9a77 100644 --- a/src/backend/utils/cache/attoptcache.c +++ b/src/backend/utils/cache/attoptcache.c @@ -44,12 +44,10 @@ typedef struct /* * InvalidateAttoptCacheCallback - * Flush all cache entries when pg_attribute is updated. + * Flush cache entry (or entries) when pg_attribute is updated. * * When pg_attribute is updated, we must flush the cache entry at least - * for that attribute. Currently, we just flush them all. Since attribute - * options are not currently used in performance-critical paths (such as - * query execution), this seems OK. + * for that attribute. */ static void InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue) @@ -57,7 +55,16 @@ InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue) HASH_SEQ_STATUS status; AttoptCacheEntry *attopt; - hash_seq_init(, AttoptCacheHash); + /* + * By convection, zero hash value is passed to the callback as a sign + * that it's time to invalidate the cache. See sinval.c, inval.c and + * InvalidateSystemCachesExtended(). + */ + if (hashvalue == 0) + hash_seq_init(, AttoptCacheHash); + else + hash_seq_init_with_hash_value(, AttoptCacheHash, hashvalue); + while ((attopt = (AttoptCacheEntry *) hash_seq_search()) != NULL) { if (attopt->opts) @@ -70,6 +77,18 @@ InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue) } } +/* + * Hash function compatible with two-arg system cache hash function. + */ +static uint32 +relatt_cache_syshash(const void *key, Size keysize) +{ + const AttoptCacheKey* ckey = key; + + Assert(keysize == sizeof(*ckey)); + return GetSysCacheHashValue2(ATTNUM, ckey->attrelid, ckey->attnum); +} + /* * InitializeAttoptCache * Initialize the attribute options cache. @@ -82,9 +101,17 @@ InitializeAttoptCache(void) /* Initialize the hash table. */ ctl.keysize = sizeof(AttoptCacheKey); ctl.entrysize = sizeof(AttoptCacheEntry); + + /* + * AttoptCacheEntry takes hash value from the system cache. For + * AttoptCacheHash we use the same hash in order to speedup search by hash + * value
Re: type cache cleanup improvements
Got it. Here is an updated patch where I added a corresponding comment. Thank you! Playing around I found one more place which could easily modified with hash_seq_init_with_hash_value() call. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/diff --git a/src/backend/utils/cache/attoptcache.c b/src/backend/utils/cache/attoptcache.c index af978ccd4b1..28980620662 100644 --- a/src/backend/utils/cache/attoptcache.c +++ b/src/backend/utils/cache/attoptcache.c @@ -44,12 +44,10 @@ typedef struct /* * InvalidateAttoptCacheCallback - * Flush all cache entries when pg_attribute is updated. + * Flush cache entry (or entries) when pg_attribute is updated. * * When pg_attribute is updated, we must flush the cache entry at least - * for that attribute. Currently, we just flush them all. Since attribute - * options are not currently used in performance-critical paths (such as - * query execution), this seems OK. + * for that attribute. */ static void InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue) @@ -57,7 +55,16 @@ InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue) HASH_SEQ_STATUS status; AttoptCacheEntry *attopt; - hash_seq_init(, AttoptCacheHash); + /* + * By convection, zero hash value is passed to the callback as a sign + * that it's time to invalidate the cache. See sinval.c, inval.c and + * InvalidateSystemCachesExtended(). + */ + if (hashvalue == 0) + hash_seq_init(, AttoptCacheHash); + else + hash_seq_init_with_hash_value(, AttoptCacheHash, hashvalue); + while ((attopt = (AttoptCacheEntry *) hash_seq_search()) != NULL) { if (attopt->opts) @@ -70,6 +77,17 @@ InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue) } } +/* + * Hash function compatible with two-arg system cache hash function. + */ +static uint32 +relatt_cache_syshash(const void *key, Size keysize) +{ + const AttoptCacheKey* ckey = key; + + return GetSysCacheHashValue2(ATTNUM, ckey->attrelid, ckey->attnum); +} + /* * InitializeAttoptCache * Initialize the attribute options cache. @@ -82,9 +100,17 @@ InitializeAttoptCache(void) /* Initialize the hash table. */ ctl.keysize = sizeof(AttoptCacheKey); ctl.entrysize = sizeof(AttoptCacheEntry); + + /* + * AttoptCacheEntry takes hash value from the system cache. For + * AttoptCacheHash we use the same hash in order to speedup search by hash + * value. This is used by hash_seq_init_with_hash_value(). + */ + ctl.hash = relatt_cache_syshash; + AttoptCacheHash = hash_create("Attopt cache", 256, , - HASH_ELEM | HASH_BLOBS); + HASH_ELEM | HASH_FUNCTION); /* Make sure we've initialized CacheMemoryContext. */ if (!CacheMemoryContext)
Re: type cache cleanup improvements
Yep, exacly. One time from 2^32 we reset whole cache instead of one (or several) entry with hash value = 0. On 08.03.2024 18:35, Tom Lane wrote: Aleksander Alekseev writes: One thing that I couldn't immediately figure out is why 0 hash value is treated as a magic invalid value in TypeCacheTypCallback(): I've not read this patch, but IIRC in some places we have a convention that hash value zero is passed for an sinval reset event (that is, "flush all cache entries"). regards, tom lane -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: type cache cleanup improvements
I would like to tweak the patch a little bit - change some comments, add some Asserts, etc. Don't you mind? You are welcome! -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: type cache cleanup improvements
Hi! Thank you for interesting in it! These changes look very promising. Unfortunately the proposed patches conflict with each other regardless the order of applying: ``` error: patch failed: src/backend/utils/cache/typcache.c:356 error: src/backend/utils/cache/typcache.c: patch does not apply ``` Try increase -F option of patch. Anyway, union of both patches in attachment -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/diff --git a/src/backend/utils/cache/typcache.c b/src/backend/utils/cache/typcache.c index 0d4d0b0a154..ccf798c440c 100644 --- a/src/backend/utils/cache/typcache.c +++ b/src/backend/utils/cache/typcache.c @@ -77,6 +77,15 @@ /* The main type cache hashtable searched by lookup_type_cache */ static HTAB *TypeCacheHash = NULL; +/* The map from relation's oid to its type oid */ +typedef struct mapRelTypeEntry +{ + Oid typrelid; + Oid type_id; +} mapRelTypeEntry; + +static HTAB *mapRelType = NULL; + /* List of type cache entries for domain types */ static TypeCacheEntry *firstDomainTypeEntry = NULL; @@ -330,6 +339,14 @@ static TupleDesc find_or_make_matching_shared_tupledesc(TupleDesc tupdesc); static dsa_pointer share_tupledesc(dsa_area *area, TupleDesc tupdesc, uint32 typmod); +/* + * Hash function compatible with one-arg system cache hash function. + */ +static uint32 +type_cache_syshash(const void *key, Size keysize) +{ + return GetSysCacheHashValue1(TYPEOID, ObjectIdGetDatum(*(const Oid*)key)); +} /* * lookup_type_cache @@ -355,8 +372,21 @@ lookup_type_cache(Oid type_id, int flags) ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(TypeCacheEntry); + /* + * Hash function must be compatible to make possible work + * hash_seq_init_with_hash_value(). Hash value in TypeEntry is taken + * from system cache and we use the same (compatible) to use it's hash + * value to speedup search by hash value instead of scanning whole hash + */ + ctl.hash = type_cache_syshash; + TypeCacheHash = hash_create("Type information cache", 64, - , HASH_ELEM | HASH_BLOBS); + , HASH_ELEM | HASH_FUNCTION); + + ctl.keysize = sizeof(Oid); + ctl.entrysize = sizeof(mapRelTypeEntry); + mapRelType = hash_create("Map reloid to typeoid", 64, + , HASH_ELEM | HASH_BLOBS); /* Also set up callbacks for SI invalidations */ CacheRegisterRelcacheCallback(TypeCacheRelCallback, (Datum) 0); @@ -407,8 +437,7 @@ lookup_type_cache(Oid type_id, int flags) /* These fields can never change, by definition */ typentry->type_id = type_id; - typentry->type_id_hash = GetSysCacheHashValue1(TYPEOID, - ObjectIdGetDatum(type_id)); + typentry->type_id_hash = get_hash_value(TypeCacheHash, _id); /* Keep this part in sync with the code below */ typentry->typlen = typtup->typlen; @@ -470,6 +499,24 @@ lookup_type_cache(Oid type_id, int flags) ReleaseSysCache(tp); } + /* + * Add record to map relation->type. We don't bother even of type became + * disconnected from relation, it seems to be impossible, but anyway, + * storing old data is safe, in a worstest case we will just do an extra + * cleanup cache entry. + */ + if (OidIsValid(typentry->typrelid) && typentry->typtype == TYPTYPE_COMPOSITE) + { + mapRelTypeEntry *relentry; + + relentry = (mapRelTypeEntry*) hash_search(mapRelType, + >typrelid, + HASH_ENTER, NULL); + + relentry->typrelid = typentry->typrelid; + relentry->type_id = typentry->type_id; + } + /* * Look up opclasses if we haven't already and any dependent info is * requested. @@ -2264,6 +2311,37 @@ SharedRecordTypmodRegistryAttach(SharedRecordTypmodRegistry *registry) CurrentSession->shared_typmod_table = typmod_table; } +static void +invalidateCompositeTypeCacheEntry(TypeCacheEntry *typentry) +{ + /* Delete tupdesc if we have it */ + if (typentry->tupDesc != NULL) + { + /* + * Release our refcount, and free the tupdesc if none remain. + * (Can't use DecrTupleDescRefCount because this reference is + * not logged in current resource owner.) + */ + Assert(typentry->tupDesc->tdrefcount > 0); + if (--typentry->tupDesc->tdrefcount == 0) + FreeTupleDesc(typentry->tupDesc); + typentry->tupDesc = NULL; + + /* + * Also clear tupDesc_identifier, so that anything watching + * that will realize that the tupdesc has possibly changed. + * (Alternatively, we could specify that to detect possible + * tupdesc change, one must check for tupDesc != NULL as well + * as tupDesc_identifier being the same as what was previously + * seen. That seems error-prone.) + */ + typentry->tupDesc_identifier = 0; + } + + /* Reset equality/comparison/hashing validity information */ + typentry->flags &= ~TCFLAGS_OPERATOR_FLAGS; +} + /* * TypeCacheRe
type cache cleanup improvements
Hi! I'd like to suggest two independent patches to improve performance of type cache cleanup. I found a case where type cache cleanup was a reason for low performance. In short, customer makes 100 thousand temporary tables in one transaction. 1 mapRelType.patch It just adds a local map between relation and its type as it was suggested in comment above TypeCacheRelCallback(). Unfortunately, using syscache here was impossible because this call back could be called outside transaction and it makes impossible catalog lookups. 2 hash_seq_init_with_hash_value.patch TypeCacheTypCallback() loop over type hash to find entry with given hash value. Here there are two problems: 1) there isn't interface to dynahash to search entry with given hash value and 2) hash value calculation algorithm is differ from system cache. But coming hashvalue is came from system cache. Patch is addressed to both issues. It suggests hash_seq_init_with_hash_value() call which inits hash sequential scan over the single bucket which could contain entry with given hash value, and hash_seq_search() will iterate only over such entries. Anf patch changes hash algorithm to match syscache. Actually, patch makes small refactoring of dynahash, it makes common function hash_do_lookup() which does initial lookup in hash. Some artificial performance test is in attachment, command to test is 'time psql < custom_types_and_array.sql', here I show only last rollback time and total execution time: 1) master 92d2ab7554f92b841ea71bcc72eaa8ab11aae662 Time: 33353,288 ms (00:33,353) psql < custom_types_and_array.sql 0,82s user 0,71s system 1% cpu 1:28,36 total 2) mapRelType.patch Time: 7455,581 ms (00:07,456) psql < custom_types_and_array.sql 1,39s user 1,19s system 6% cpu 41,220 total 3) hash_seq_init_with_hash_value.patch Time: 24975,886 ms (00:24,976) psql < custom_types_and_array.sql 1,33s user 1,25s system 3% cpu 1:19,77 total 4) both Time: 89,446 ms psql < custom_types_and_array.sql 0,72s user 0,52s system 10% cpu 12,137 total -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/diff --git a/src/backend/utils/cache/typcache.c b/src/backend/utils/cache/typcache.c index f411e33b8e7..49e27ca8ba4 100644 --- a/src/backend/utils/cache/typcache.c +++ b/src/backend/utils/cache/typcache.c @@ -331,6 +331,14 @@ static TupleDesc find_or_make_matching_shared_tupledesc(TupleDesc tupdesc); static dsa_pointer share_tupledesc(dsa_area *area, TupleDesc tupdesc, uint32 typmod); +/* + * Hash function compatible with one-arg system cache hash function. + */ +static uint32 +type_cache_syshash(const void *key, Size keysize) +{ + return GetSysCacheHashValue1(TYPEOID, ObjectIdGetDatum(*(const Oid*)key)); +} /* * lookup_type_cache @@ -356,8 +364,16 @@ lookup_type_cache(Oid type_id, int flags) ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(TypeCacheEntry); + /* + * Hash function must be compatible to make possible work + * hash_seq_init_with_hash_value(). Hash value in TypeEntry is taken + * from system cache and we use the same (compatible) to use it's hash + * value to speedup search by hash value instead of scanning whole hash + */ + ctl.hash = type_cache_syshash; + TypeCacheHash = hash_create("Type information cache", 64, - , HASH_ELEM | HASH_BLOBS); + , HASH_ELEM | HASH_FUNCTION); /* Also set up callbacks for SI invalidations */ CacheRegisterRelcacheCallback(TypeCacheRelCallback, (Datum) 0); @@ -408,8 +424,7 @@ lookup_type_cache(Oid type_id, int flags) /* These fields can never change, by definition */ typentry->type_id = type_id; - typentry->type_id_hash = GetSysCacheHashValue1(TYPEOID, - ObjectIdGetDatum(type_id)); + typentry->type_id_hash = get_hash_value(TypeCacheHash, _id); /* Keep this part in sync with the code below */ typentry->typlen = typtup->typlen; @@ -2359,20 +2374,21 @@ TypeCacheTypCallback(Datum arg, int cacheid, uint32 hashvalue) TypeCacheEntry *typentry; /* TypeCacheHash must exist, else this callback wouldn't be registered */ - hash_seq_init(, TypeCacheHash); + if (hashvalue == 0) + hash_seq_init(, TypeCacheHash); + else + hash_seq_init_with_hash_value(, TypeCacheHash, hashvalue); + while ((typentry = (TypeCacheEntry *) hash_seq_search()) != NULL) { - /* Is this the targeted type row (or it's a total cache flush)? */ - if (hashvalue == 0 || typentry->type_id_hash == hashvalue) - { - /* - * Mark the data obtained directly from pg_type as invalid. Also, - * if it's a domain, typnotnull might've changed, so we'll need to - * recalculate its constraints. - */ - typentry->flags &= ~(TCFLAGS_HAVE_PG_TYPE_DATA | - TCFLAGS_CHECKED_DOMAIN_CONSTRAINTS); - } + Assert(hashvalue == 0 || typentry->type_id_hash == hashvalue); + /* + * Ma
Re: CREATE TABLE ( .. STORAGE ..)
Hi! Are they both set to name or ColId? Although they are the same. Thank you, fixed, that was just an oversight. 2 For ColumnDef new member storage_name, did you miss the function _copyColumnDef() _equalColumnDef()? Thank you, fixed Regards Wenjing 2021年12月27日 15:51,Teodor Sigaev 写道: Hi! Working on pluggable toaster (mostly, for JSONB improvements, see links below) I had found that STORAGE attribute on column is impossible to set in CREATE TABLE command but COMPRESS option is possible. It looks unreasonable. Suggested patch implements this possibility. [1] http://www.sai.msu.su/~megera/postgres/talks/jsonb-pgconfnyc-2021.pdf [2] http://www.sai.msu.su/~megera/postgres/talks/jsonb-pgvision-2021.pdf [3] http://www.sai.msu.su/~megera/postgres/talks/jsonb-pgconfonline-2021.pdf [4] http://www.sai.msu.su/~megera/postgres/talks/bytea-pgconfonline-2021.pdf PS I will propose pluggable toaster patch a bit later -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ create_table_storage-v2.patch.gz Description: application/gzip
Re: Pluggable toaster
I agree ... but I'm also worried about what happens when we have multiple table AMs. One can imagine a new table AM that is specifically optimized for TOAST which can be used with an existing heap table. One can imagine a new table AM for the main table that wants to use something different for TOAST. So, I don't think it's right to imagine that the choice of TOASTer depends solely on the column data type. I'm not really sure how this should work exactly ... but it needs careful thought. Right. that's why we propose a validate method (may be, it's a wrong name, but I don't known better one) which accepts several arguments, one of which is table AM oid. If that method returns false then toaster isn't useful with current TAM, storage or/and compression kinds, etc. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Pluggable toaster
Hi! Maybe doing that kind of compression in TOAST is somehow simpler, but I don't see it. Seems, in ideal world, compression should be inside toaster. 2 Current toast storage stores chunks in heap accesses method and to provide fast access by toast id it makes an index. Ideas: - store chunks directly in btree tree, pgsql's btree already has an INCLUDE columns, so, chunks and visibility data will be stored only in leaf pages. Obviously it reduces number of disk's access for "untoasting". - use another access method for chunk storage Maybe, but that probably requires more thought - e.g. btree requires the values to be less than 1/3 page, so I wonder how would that play with toasting of values. That's ok, because chunk size is 2000 bytes right now and its could be saved. Seems you'd need a mapping table, to allow M:N mapping between types and toasters, linking it to all "compatible" types. It's not clear to me how would this work with custom data types, domains etc. If toaster will look into internal structure then it should know type's binary format. So, new custom types have a little chance to work with old custom toaster. Default toaster works with any types. Also, what happens to existing values when you change the toaster? What if the toasters don't use the same access method to store the chunks (heap vs. btree)? And so on. vatatt_custom contains an oid of toaster and toaster is not allowed to delete (at least, in suggested patches). So, if column's toaster has been changed then old values will be detoasted by toaster pointed in varatt_custom structure, not in column definition. This is very similar to storage attribute works: we we alter storage attribute only new values will be stored with pointed storage type. More thought: Now postgres has two options for column: storage and compression and now we add toaster. For me it seems too redundantly. Seems, storage should be binary value: inplace (plain as now) and toastable. All other variation such as toast limit, compression enabling, compression kind should be an per-column option for toaster (that's why we suggest valid toaster oid for any column with varlena/toastable datatype). It looks like a good abstraction but we will have a problem with backward compatibility and I'm afraid I can't implement it very fast. So you suggest we move all of this to toaster? I'd say -1 to that, because it makes it much harder to e.g. add custom compression method, etc. Hmm, I suggested to leave only toaster at upper level. Compression kind could be chosen in toaster's options (not implemented yet) or even make an API interface to compression to make it configurable. Right now, module developer could not implement a module with new compression method and it is a disadvantage. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Pluggable toaster
I'd like to ask your opinion for next small questions 1) May be, we could add toasteroid to pg_type to for default toaster for datatype. ALTER TYPE type SET DEFAULT TOASTER toaster; 2) The name of default toaster is deftoaster, which was choosen at night, may be heap_toaster is better? heap because default toaster stores chunks in heap table. Thank you! -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Pluggable toaster
In my understanding, we want to be able to 1. Access data from a toasted object one slice at a time, by using knowledge of the structure 2. If toasted data is updated, then update a minimum number of slices(s), without rewriting the existing slices 3. If toasted data is expanded, then allownew slices to be appended to the object without rewriting the existing slices There are more options: 1 share common parts between not only versions of row but between all rows in a column. Seems strange but examples: - urls often have a common prefix and so storing in a prefix tree (as SP-GiST does) allows significantly decrease storage size - the same for json - it's often use case with common part of its hierarchical structure - one more usecase for json. If json use only a few schemes (structure) it's possible to store in toast storage only values and don't store keys and structure 2 Current toast storage stores chunks in heap accesses method and to provide fast access by toast id it makes an index. Ideas: - store chunks directly in btree tree, pgsql's btree already has an INCLUDE columns, so, chunks and visibility data will be stored only in leaf pages. Obviously it reduces number of disk's access for "untoasting". - use another access method for chunk storage ISTM that we would want the toast algorithm to be associated with the datatype, not the column? Can you explain your thinking? Hm. I'll try to explain my motivation. 1) Datatype could have more than one suitable toasters. For different usecases: fast retrieving, compact storage, fast update etc. As I told above, for jsonb there are several optimal strategies for toasting: for values with a few different structures, for close to hierarchical structures, for values with different parts by access mode (easy to imagine json with some keys used for search and some keys only for output to user) 2) Toaster could be designed to work with different data type. Suggested appendable toaster is designed to work with bytea but could work with text Looking on this point I have doubts where to store connection between toaster and datatype. If we add toasteroid to pg_type how to deal with several toaster for one datatype? (And we could want to has different toaster on one table!) If we add typoid to pg_toaster then how it will work with several datatypes? An idea to add a new many-to-many connection table seems workable but here there are another questions, such as will any toaster work with any table access method? To resolve this bundle of question we propose validate() method of toaster, which should be called during DDL operation, i.e. toaster is assigned to column or column's datatype is changed. More thought: Now postgres has two options for column: storage and compression and now we add toaster. For me it seems too redundantly. Seems, storage should be binary value: inplace (plain as now) and toastable. All other variation such as toast limit, compression enabling, compression kind should be an per-column option for toaster (that's why we suggest valid toaster oid for any column with varlena/toastable datatype). It looks like a good abstraction but we will have a problem with backward compatibility and I'm afraid I can't implement it very fast. We already have Expanded toast format, in-memory, which was designed specifically to allow us to access sub-structure of the datatype in-memory. So I was expecting to see an Expanded, on-disk, toast format that roughly matched that concept, since Tom has already shown us the way. (varatt_expanded). This would be usable by both JSON and PostGIS. Hm, I don't understand. varatt_custom has variable-length tail which toaster could use it by any way, appandable toaster use it to store appended tail. Some other thoughts: I imagine the data type might want to keep some kind of dictionary inside the main toast pointer, so we could make allowance for some optional datatype-specific private area in the toast pointer itself, allowing a mix of inline and out-of-line data, and/or a table of contents to the slices. I'm thinking could also tackle these things at the same time: * We want to expand TOAST to 64-bit pointers, so we can have more pointers in a table * We want to avoid putting the data length into the toast pointer, so we can allow the toasted data to be expanded without rewriting everything (to avoid O(N^2) cost) Right -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Pluggable toaster
Hi! We are working on custom toaster for JSONB [1], because current TOAST is universal for any data type and because of that it has some disadvantages: - "one toast fits all" may be not the best solution for particular type or/and use cases - it doesn't know the internal structure of data type, so it cannot choose an optimal toast strategy - it can't share common parts between different rows and even versions of rows Modification of current toaster for all tasks and cases looks too complex, moreover, it will not works for custom data types. Postgres is an extensible database, why not to extent its extensibility even further, to have pluggable TOAST! We propose an idea to separate toaster from heap using toaster API similar to table AM API etc. Following patches are applicable over patch in [1] 1) 1_toaster_interface_v1.patch.gz https://github.com/postgrespro/postgres/tree/toaster_interface Introduces syntax for storage and formal toaster API. Adds column atttoaster to pg_attribute, by design this column should not be equal to invalid oid for any toastable datatype, ie it must have correct oid for any type (not column) with non-plain storage. Since toaster may support only particular datatype, core should check correctness of toaster set by toaster validate method. New commands could be found in src/test/regress/sql/toaster.sql On-disk toast pointer structure now has one more possible struct - varatt_custom with fixed header and variable tail which uses as a storage for custom toasters. Format of built-in toaster is kept to allow simple pg_upgrade logic. Since toaster for column could be changed during table's lifetime we had two options about toaster's drop operation: - if column's toaster has been changed, then we need to re-toast all values, which could be extremely expensive. In any case, functions/operators should be ready to work with values toasted by different toasters, although any toaster should execute simple toast/detoast operation, which allows any existing code to work with the new approach. Tracking dependency of toasters and rows looks as bad idea. - disallow drop toaster. We don't believe that there will be many toasters at the same time (number of AM isn't very high too and we don't believe that it will be changed significantly in the near future), so prohibition of dropping of toaster looks reasonable. In this patch set we choose second option. Toaster API includes get_vtable method, which is planned to access the custom toaster features which isn't covered by this API. The idea is, that toaster returns some structure with some values and/or pointers to toaster's methods and caller could use it for particular purposes, see patch 4). Kind of structure identified by magic number, which should be a first field in this structure. Also added contrib/dummy_toaster to simplify checking. psql/pg_dump are modified to support toaster object concept. 2) 2_toaster_default_v1.patch.gz https://github.com/postgrespro/postgres/tree/toaster_default Built-in toaster implemented (with some refactoring) uisng toaster API as generic (or default) toaster. dummy_toaster here is a minimal workable example, it saves value directly in toast pointer and fails if value is greater than 1kb. 3) 3_toaster_snapshot_v1.patch.gz https://github.com/postgrespro/postgres/tree/toaster_snapshot The patch implements technology to distinguish row's versions in toasted values to share common parts of toasted values between different versions of rows 4) 4_bytea_appendable_toaster_v1.patch.gz https://github.com/postgrespro/postgres/tree/bytea_appendable_toaster Contrib module implements toaster for non-compressed bytea columns, which allows fast appending to existing bytea value. Appended tail stored directly in toaster pointer, if there is enough place to do it. Note: patch modifies byteacat() to support contrib toaster. Seems, it's looks ugly and contrib module should create new concatenation function. We are open for any questions, discussions, objections and advices. Thank you. Peoples behind: Oleg Bartunov Nikita Gluhov Nikita Malakhov Teodor Sigaev [1] https://www.postgresql.org/message-id/flat/de83407a-ae3d-a8e1-a788-920eb334f...@sigaev.ru <https://www.postgresql.org/message-id/flat/de83407a-ae3d-a8e1-a788-920eb334f...@sigaev.ru> -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ 4_bytea_appendable_toaster_v1.patch.gz Description: application/gzip 3_toaster_snapshot_v1.patch.gz Description: application/gzip 2_toaster_default_v1.patch.gz Description: application/gzip 1_toaster_interface_v1.patch.gz Description: application/gzip
CREATE TABLE ( .. STORAGE ..)
Hi! Working on pluggable toaster (mostly, for JSONB improvements, see links below) I had found that STORAGE attribute on column is impossible to set in CREATE TABLE command but COMPRESS option is possible. It looks unreasonable. Suggested patch implements this possibility. [1] http://www.sai.msu.su/~megera/postgres/talks/jsonb-pgconfnyc-2021.pdf [2] http://www.sai.msu.su/~megera/postgres/talks/jsonb-pgvision-2021.pdf [3] http://www.sai.msu.su/~megera/postgres/talks/jsonb-pgconfonline-2021.pdf [4] http://www.sai.msu.su/~megera/postgres/talks/bytea-pgconfonline-2021.pdf PS I will propose pluggable toaster patch a bit later -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ create_table_storage-v1.patch.gz Description: application/gzip
Re: On login trigger: take three
counter as propose upthread, since updating that can be made safer. The current coding also isn't instructing concurrent backends to perform relcache rebuild. I think there are pros and cons of each possible approach, but I think I prefer the current way (which I have tweaked a bit) for similar reasons to those explained by the original patch author when debating whether to use reference-counting instead, in the discussion upthread (e.g. it keeps it all in one place). Also, it seems to be more inline with the documented reason why that pg_database flag was added in the first place. I have debugged a few concurrent scenarios with the current mechanism in place. If you still dislike the logic for resetting the flag, please elaborate on the issues you foresee and one of the alternative approaches can be tried. I've attached the updated patch. Regards, Greg Nancarrow Fujitsu Australia -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ 2.sql Description: application/sql
aggregate crash
Hi! Found crash on production instance, assert-enabled build crashes in pfree() call, with default config. v11, v12 and head are affected, but, seems, you need to be a bit lucky. The bug is comparing old and new aggregate pass-by-ref values only by pointer value itself, despite on null flag. Any function which returns null doesn't worry about actual returned Datum value, so that comparison isn't enough. Test case shows bug with ExecInterpExpr() but there several similar places (thanks Nikita Glukhov for help). Attached patch adds check of null flag. How to reproduce: http://sigaev.ru/misc/xdump.sql.bz2 bzcat xdump.sql.bz2 | psql postgres && psql postgres < x.sql Backtrace from v12 (note, newValue and oldValue are differ on current call, but oldValue points into pfreed memory) : #0 0x00c8405a in GetMemoryChunkContext (pointer=0x80a808250) at ../../../../src/include/utils/memutils.h:130 130 AssertArg(MemoryContextIsValid(context)); (gdb) bt #0 0x00c8405a in GetMemoryChunkContext (pointer=0x80a808250) at ../../../../src/include/utils/memutils.h:130 #1 0x00c85ae5 in pfree (pointer=0x80a808250) at mcxt.c:1058 #2 0x0080475e in ExecAggTransReparent (aggstate=0x80a806370, pertrans=0x80a87e830, newValue=34535940744, newValueIsNull=false, oldValue=34535932496, oldValueIsNull=false) at execExprInterp.c:4209 #3 0x007ff51f in ExecInterpExpr (state=0x80a87f4d8, econtext=0x80a8065a8, isnull=0x7fffd7b7) at execExprInterp.c:1747 #4 0x0082c12b in ExecEvalExprSwitchContext (state=0x80a87f4d8, econtext=0x80a8065a8, isNull=0x7fffd7b7) at ../../../src/include/executor/executor.h:308 #5 0x0082bc0f in advance_aggregates (aggstate=0x80a806370) at nodeAgg.c:679 #6 0x0082b8a6 in agg_retrieve_direct (aggstate=0x80a806370) at nodeAgg.c:1847 #7 0x00828782 in ExecAgg (pstate=0x80a806370) at nodeAgg.c:1572 #8 0x0080e712 in ExecProcNode (node=0x80a806370) at ../../../src/include/executor/executor.h:240 #9 0x0080a4a1 in ExecutePlan (estate=0x80a806120, planstate=0x80a806370, use_parallel_mode=false, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x80a851cc0, execute_once=true) at execMain.c:1646 #10 0x0080a362 in standard_ExecutorRun (queryDesc=0x80a853120, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:364 #11 0x0080a114 in ExecutorRun (queryDesc=0x80a853120, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:308 #12 0x00a79d6f in PortalRunSelect (portal=0x80a70d120, forward=true, count=0, dest=0x80a851cc0) at pquery.c:929 #13 0x00a79807 in PortalRun (portal=0x80a70d120, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x80a851cc0, altdest=0x80a851cc0, completionTag=0x7fffdc30 "") at pquery.c:770 #14 0x00a74e49 in exec_simple_query ( query_string=0x800d02950 "SELECT\nT1._Q_001_F_000,\nT1._Q_001_F_001,\nT1._Q_001_F_002RRef,\nT1._Q_001_F_003RRef,\nT1._Q_001_F_004RRef,\nT1._Q_001_F_005RRef,\nMAX(CASE WHEN (T1._Q_001_F_010 > CAST(0 AS NUMERIC)) THEN T2._Q_001_F_009RR"...) at postgres.c:1227 #15 0x00a74123 in PostgresMain (argc=1, argv=0x80a6ef8f0, dbname=0x80a6ef850 "postgres", username=0x80a6ef830 "teodor") at postgres.c:4291 #16 0x009a4c3b in BackendRun (port=0x80a6e6000) at postmaster.c:4498 #17 0x009a403a in BackendStartup (port=0x80a6e6000) at postmaster.c:4189 #18 0x009a2f63 in ServerLoop () at postmaster.c:1727 #19 0x009a0a0a in PostmasterMain (argc=3, argv=0x7fffe3c8) at postmaster.c:1400 #20 0x0088deef in main (argc=3, argv=0x7fffe3c8) at main.c:210 -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ x.sql Description: application/sql diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c index 034970648f3..3b5333716d4 100644 --- a/src/backend/executor/execExprInterp.c +++ b/src/backend/executor/execExprInterp.c @@ -1743,7 +1743,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull) * expanded object that is already a child of the aggcontext, * assume we can adopt that value without copying it. */ - if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue)) + if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue) || +fcinfo->isnull != pergroup->transValueIsNull) newVal = ExecAggTransReparent(aggstate, pertrans, newVal, fcinfo->isnull, pergroup->transValue, diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index 6ee24eab3d2..e848853d6fd 100644 --- a/src/backend/executor/nodeAgg.c +++ b/src/backend/executor/node
Re: Failure in contrib test _int on loach
Hi! So, if other hackers are agreed with my reasoning, the suggested fix is sufficient and can be committed. Patch looks right, but I think that comment should be improved in follow piece: if (stack->blkno != GIST_ROOT_BLKNO && - stack->parent->lsn < GistPageGetNSN(stack->page)) + ((stack->parent->lsn < GistPageGetNSN(stack->page)) || +stack->retry_from_parent == true)) { /* * Concurrent split detected. There's no guarantee that the Not only concurrent split could be deteced here and it was missed long ago. But this patch seems a good chance to change this comment. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: [PATCH] btree_gist: fix union implementation for variable length columns
It would be easier to figure this out if the btree_gist code weren't so desperately undocumented. Teodor, do you remember why it's like this? Will look. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: cost_sort() improvements
At least [1] and [2] hit into to that issues and have an objections/questions about correctness of cost sort estimation. Suggested patch tries to improve current estimation and solve that issues. Sorry for long delay but issue was even more complicated than I thought. When I tried to add cost_sort to GROUP BY patch I found it isn't work well as I hoped :(. The root of problem is suggested cost_sort improvement doesn't take into account uniqueness of first column and it's cost always the same. I tried to make experiments with all the same and all unique column and found that execution time could be significantly differ (up to 3 times on 1e7 randomly generated integer values). And I went to read book and papers. So, I suggest new algorithm based on ideas in [3], [4]. In term of that papers, let I use designations from previous my email and Xi - number of tuples with key Ki, then estimation is: log(N! / (X1! * X2! * ..)) ~ sum(Xi * log(N/Xi)) In our case all Xi are the same because now we don't have an estimation of group sizes, we have only estimation of number of groups. In this case, formula becomes: N * log(NumberOfGroups). Next, to support correct estimation of multicolumn sort we need separately compute each column, so, let k is a column number, Gk - number of groups defined by k columns: N * sum( FCk * log(Gk) ) and FCk is a sum(Cj) over k columns. Gk+1 is defined as estimate_num_groups(NGk) - i.e. it's recursive, that's means that comparison of k-th columns includes all comparison j-th columns < k. Next, I found that this formula gives significant underestimate with N < 1e4 and using [5] (See Chapter 8.2 and Theorem 4.1) found that we can use N + N * log(N) formula which actually adds a multiplier in simple case but it's unclear for me how to add that multimplier to multicolumn formula, so I added just a separate term proportional to N. As demonstration, I add result of some test, *sh and *plt are scripts to reproduce results. Note, all charts are normalized because computed cost as not a milliseconds. p.pl is a parser for JSON format of explain analyze. N test - sort unique values with different number of tuples NN test - same as previous one but sort of single value (all the same values) U test - fixed number of total values (1e7) but differ number of unique values Hope, new estimation much more close to real sort timing. New estimation function gives close result to old estimation function on trivial examples, but ~10% more expensive, and three of regression tests aren't passed, will look closer later. Patch doesn't include regression test changes. [1] https://commitfest.postgresql.org/18/1124/ [2] https://commitfest.postgresql.org/18/1651/ [3] https://algs4.cs.princeton.edu/home/, course Algorithms, Robert Sedgewick, Kevin Wayne, [4] "Quicksort Is Optimal For Many Equal Keys", Sebastian Wild, arXiv:1608.04906v4 [cs.DS] 1 Nov 2017 [5] "Introduction to algorithms", Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, ISBN 0-07-013143-0 -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index a2a7e0c520..3588cff802 100644 --- a/src/backend/optimizer/path/costsize.c +++ b/src/backend/optimizer/path/costsize.c @@ -1612,6 +1612,252 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm) rterm->pathtarget->width); } +/* + * is_fake_var + * Workaround for generate_append_tlist() which generates fake Vars with + * varno == 0, that will cause a fail of estimate_num_group() call + */ +static bool +is_fake_var(Expr *expr) +{ + if (IsA(expr, RelabelType)) + expr = (Expr *) ((RelabelType *) expr)->arg; + + return (IsA(expr, Var) && ((Var *) expr)->varno == 0); +} + +/* + * get_width_cost_multiplier + * Returns relative complexity of comparing two valyes based on it's width. + * The idea behind - long values have more expensive comparison. Return value is + * in cpu_operator_cost unit. + */ +static double +get_width_cost_multiplier(PlannerInfo *root, Expr *expr) +{ + double width = -1.0; /* fake value */ + + if (IsA(expr, RelabelType)) + expr = (Expr *) ((RelabelType *) expr)->arg; + + /* Try to find actual stat in corresonding relation */ + if (IsA(expr, Var)) + { + Var *var = (Var *) expr; + + if (var->varno > 0 && var->varno < root->simple_rel_array_size) + { + RelOptInfo *rel = root->simple_rel_array[var->varno]; + + if (rel != NULL && +var->varattno >= rel->min_attr && +var->varattno <= rel->max_attr) + { +int ndx = var->varattno - rel->min_attr; + +if (rel->attr_widths[ndx] > 0) + width = rel->attr_widths[ndx]; + } + } + } + + /* Didn't find any actual stats, use estimati
Re: POC: GROUP BY optimization
I tried to attack the cost_sort() issues and hope on that basis we can solve problems with 0002 patch and improve incremental sort patch. OK, will do. Thanks for working on this! I hope, now we have a better cost_sort(). The obvious way is a try all combination of pathkeys in get_cheapest_group_keys_order() and choose cheapest one by cost_sort(). But it requires N! operations and potentially could be very expensive in case of large number of pathkeys and doesn't solve the issue with user-knows-what-he-does pathkeys. We could suggest an order of pathkeys as patch suggests now and if cost_sort() estimates cost is less than 80% (arbitrary chosen) cost of user-suggested pathkeys then it use our else user pathkeys. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: cost_sort() improvements
Peter Geoghegan wrote: On Thu, Jun 28, 2018 at 9:47 AM, Teodor Sigaev wrote: Current estimation of sort cost has following issues: - it doesn't differ one and many columns sort - it doesn't pay attention to comparison function cost and column width - it doesn't try to count number of calls of comparison function on per column basis I've been suspicious of the arbitrary way in which I/O for external sorts is costed by cost_sort() for a long time. I'm not 100% sure about how we should think about this question, but I am sure that it needs to be improved in *some* way. Agree, but I think it should be separated patch to attack this issue. And I don't have an idea how to do it, at least right now. Nevertheless, I hope, issue of estimation of disk-based sort isn't a blocker of CPU cost estimation improvements. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: POC: GROUP BY optimization
I'm afraid that's a misunderstanding - my objections did not really go away. I'm just saying my claim that we're not messing with order of grouping keys is not entirely accurate, because we do that in one particular case. I still think we need to be careful when introducing new optimizations in this area - reordering the grouping keys by ndistinct, ORDER BY or whatever. In particular I don't think we should commit these patches that may quite easily cause regressions, and then hope some hypothetical future patch fixes the costing. Those objections still stand. Pls, have a look at https://commitfest.postgresql.org/18/1706/ I tried to attack the cost_sort() issues and hope on that basis we can solve problems with 0002 patch and improve incremental sort patch. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
cost_sort() improvements
Hi! Current estimation of sort cost has following issues: - it doesn't differ one and many columns sort - it doesn't pay attention to comparison function cost and column width - it doesn't try to count number of calls of comparison function on per column basis At least [1] and [2] hit into to that issues and have an objections/questions about correctness of cost sort estimation. Suggested patch tries to improve current estimation and solve that issues. Basic ideas: - let me introduce some designations i - index of column, started with 1 N - number of tuples to sort Gi - number of groups, calculated by i number columns. Obviously, G0 == 1. Number of groups is caculated by estimate_num_groups(). NGi - number of tuples in one group. NG0 == N and NGi = N/G(i-1) Fi - cost of comparison function call of i-th column Wi - average width in bytes of 1-ith column. Ci - total cost of one comparison call of column calculated as Fi * fine_function(Wi) where fine_function(Wi) looks like: if (Wi <= sizeof(Datum)) return 1.0; //any type passed in Datum else return 1 + log16(Wi/sizeof(Datum)); log16 just an empirical choice - So, cost for first column is 2.0 * C0 * log2(N) First column is always compared, multiplier 2.0 is defined per regression test. Seems, it estimates a cost for transferring tuple to CPU cache, starting cost for unpacking tuple, cost of call qsort compare wrapper function, etc. Removing this multiplier causes too optimistic prediction of cost. - cost for i-th columns: Ci * log2(NGi) => Ci * log2(N/G(i-1)) So, here I believe, that i-th comparison function will be called only inside group which number is defined by (i-1) leading columns. Note, following discussion [1] I add multiplier 1.5 here to be closer to worst case when groups are significantly differ. Right now there is no way to determine how differ they could be. Note, this formula describes cost of first column too except difference with multiplier. - Final cost is cpu_operator_cost * N * sum(per column costs described above). Note, for single column with width <= sizeof(datum) and F1 = 1 this formula gives exactly the same result as current one. - for Top-N sort empiric is close to old one: use 2.0 multiplier as constant under log2, and use log2(Min(NGi, output_tuples)) for second and following columns. I believe, suggested method could be easy enhanced to support [1] and used in [2]. Open items: - Fake Var. I faced with generate_append_tlist() which generates Var with varno = 0, it was invented, as I can see, at 2002 with comment 'should be changed in future'. Future hasn't yet come... I've added workaround to prevent call estimate_num_group() with such Vars, but I'm not sure that is enough. - Costs of all comparison functions now are the same and equal to 1. May be, it's time to change that. - Empiric constants. I know, it's impossible to remove them at all, but, may be, we need to find better estimation of them. [1] https://commitfest.postgresql.org/18/1124/ [2] https://commitfest.postgresql.org/18/1651/ -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index a2a7e0c520..d6b19bb9ad 100644 --- a/src/backend/optimizer/path/costsize.c +++ b/src/backend/optimizer/path/costsize.c @@ -1612,6 +1612,231 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm) rterm->pathtarget->width); } +/* + * is_fake_var + * Workaround for generate_append_tlist() which generates fake Vars with + * varno == 0, that will cause a fail of estimate_num_group() call + */ +static bool +is_fake_var(Expr *expr) +{ + if (IsA(expr, RelabelType)) + expr = (Expr *) ((RelabelType *) expr)->arg; + + return (IsA(expr, Var) && ((Var *) expr)->varno == 0); +} + +/* + * get_width_cost_multiplier + * Returns relative complexity of comparing two valyes based on it's width. + * The idea behind - long values have more expensive comparison. Return value is + * in cpu_operator_cost unit. + */ +static double +get_width_cost_multiplier(PlannerInfo *root, Expr *expr) +{ + double width = -1.0; /* fake value */ + + if (IsA(expr, RelabelType)) + expr = (Expr *) ((RelabelType *) expr)->arg; + + /* Try to find actual stat in corresonding relation */ + if (IsA(expr, Var)) + { + Var *var = (Var *) expr; + + if (var->varno > 0 && var->varno < root->simple_rel_array_size) + { + RelOptInfo *rel = root->simple_rel_array[var->varno]; + + if (rel != NULL && +var->varattno >= rel->min_attr && +var->varattno <= rel->max_attr) + { +int ndx = var->
Re: Speedup of relation deletes during recovery
We just had a customer hit this issue. I kind of wonder whether this shouldn't be backpatched: Currently the execution on the primary is O(NBuffers * log(ndrels)) whereas it's O(NBuffers * ndrels) on the standby - with a lot higher constants to boot. That means it's very easy to get into situations where the standy starts to lag behind very significantly. +1, we faced with that too -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: POC: GROUP BY optimization
, try producing the right output order (if the ORDER BY was specified) Yes. But at this point it's kinda mashed together in ad-hoc manner, using very simple heuristics / assumptions. We'll need to figure out how to combine those optimizations. The used heuristics here: 1) if there is a some order - let's used it 2) if there is a required order - let's match that order to prevent extra sort node. Incremental sort patch will improve cases where there is partial match of order. BTW I get compiler warnings that n_preordered_groups may be used uninitialized in add_paths_to_grouping_rel, because it's only set in an if/else branch, and then used further down. Fixed, but I believe that compiler is not smart enough here. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c index 4f93afdebc..ec4095bd2e 100644 --- a/src/backend/optimizer/path/pathkeys.c +++ b/src/backend/optimizer/path/pathkeys.c @@ -328,6 +328,12 @@ pathkeys_contained_in(List *keys1, List *keys2) return false; } +/*/ +bool debug_group_by_reorder_by_pathkeys = true; +bool debug_group_by_match_order_by = true; +bool debug_cheapest_group_by = true; +// + /* * Reorder GROUP BY pathkeys and clauses to match order of pathkeys. Function * returns new lists, original GROUP BY lists stay untouched. @@ -341,6 +347,9 @@ group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys, ListCell *key; int n; + if (debug_group_by_reorder_by_pathkeys == false) + return 0; + if (pathkeys == NIL || *group_pathkeys == NIL) return 0; @@ -380,6 +389,157 @@ group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys, return n; } +/* + * Order tail of list of group pathkeys by uniqueness descendetly. It allows to + * speedup sorting. Returns newly allocated lists, old ones stay untouched. + * n_preordered defines a head of list which order should be prevented. + */ +void +get_cheapest_group_keys_order(PlannerInfo *root, double nrows, + List *target_list, + List **group_pathkeys, List **group_clauses, + int n_preordered) +{ + struct + { + PathKey *pathkey; + SortGroupClause *sgc; + Node *pathkeyExpr; + } + *keys, tmp; + intnkeys = list_length(*group_pathkeys) - n_preordered; + List *pathkeyExprList = NIL, + *new_group_pathkeys = NIL, + *new_group_clauses = NIL; + ListCell *cell; + inti = 0, n_keys_to_est; + + if (nkeys < 2) + return; /* nothing to do */ + + /* + * Will try to match ORDER BY pathkeys in hope that one sort is cheaper than + * two + */ + if (debug_group_by_match_order_by && + n_preordered == 0 && root->sort_pathkeys) + { + bool _save_debug_group_by_reorder_by_pathkeys = + debug_group_by_reorder_by_pathkeys; /* DEBUG ONLY, to be removed */ + + debug_group_by_reorder_by_pathkeys = true; + n_preordered = group_keys_reorder_by_pathkeys(root->sort_pathkeys, + group_pathkeys, + group_clauses); + debug_group_by_reorder_by_pathkeys = _save_debug_group_by_reorder_by_pathkeys; + + nkeys = list_length(*group_pathkeys) - n_preordered; + if (nkeys < 2) + return; /* nothing to do */ + } + + if (!debug_cheapest_group_by) + return; + + keys = palloc(nkeys * sizeof(*keys)); + + /* + * Collect information about pathkey for subsequent usage + */ + for_each_cell(cell, list_nth_cell(*group_pathkeys, n_preordered)) + { + PathKey *pathkey = (PathKey *) lfirst(cell); + + keys[i].pathkey = pathkey; + keys[i].sgc = get_sortgroupref_clause(pathkey->pk_eclass->ec_sortref, + *group_clauses); + keys[i].pathkeyExpr = get_sortgroupclause_expr(keys[i].sgc, + target_list); + i++; + } + + /* + * Find the cheapest to sort order of columns. We will find a first column + * with bigger number of group, then pair (first column in pair is already + * defined in first step), them triple and so on. + */ + for(n_keys_to_est = 1; n_keys_to_est <= nkeys - 1; n_keys_to_est++) + { + ListCell *tail_cell; + int best_i = 0; + double best_est_num_groups = -1; + + /* expand list of columns and remeber last cell */ + pathkeyExprList = lappend(pathkeyExprList, NULL); + tail_cell = list_tail(pathkeyExprList); + + /* + * Find the best last column - the best means bigger number of groups, + * previous columns are already choosen + */ + for(i = n_keys_to_est - 1; i < nkeys; i++) + { + double est_num_groups; + + lfirst(tail_cell) = keys[i].pathkeyExpr; + est_num_groups = estimate_num_groups(root, pathkeyExprList, + nrows, NULL); + + if (est_num_groups > best_est_num_groups) + { +best_est_num_groups = est_num_gro
Re: POC: GROUP BY optimization
Yes. But again, this description is a bit short. First it works after first patch and might get some preordered leading pathkeys. Second, it tries to match ORDER BY clause order if there is no preordered leading pathkeys from first patch (it was introduced in v7). And third, if there is a tail of unmatched pathkeys on previous stages then it will reorder that tail. OK, I haven't looked at v7 yet, but if I understand correctly it tries to maintain the ordering as much as possible? Does that actually help? I mean, the incremental sort patch allows the sorting to happen by pieces, but here we still need to sort all the data, right? Can you give an example demonstrating the benefit? See tst.sql. queries are marked with opt (optimization is on) and noopt. Query 1: select count(*) from btg group by v, r; Query 2: select count(*) from btg group by n, v, r order by n; For both queries it's possible to reorder v and r column, n column has the single distinct value. On my laptop: Query 1opt vs 1noopt: 3177.500 ms vs 6604.493 ms 2opt vs 2noopt: 5800.307 ms vs 7486.967 ms So, what we see: 1) for query 1 optimization gives 2 times better performance, for query 2 only 30%. if column 'n' will be unique then time for query 1 and 2 should be the same. We could add check for preordered pathkeys in get_cheapest_group_keys_order() and if estimate_num_groups(reordered pathkeys) is close to 1 then we could do not reordering of tail of pathkeys. 2) Planing cost is the same for all queries. So, cost_sort() doesn't take into account even number of columns. FWIW I think it would be useful to have "development GUC" that would allow us to enable/disable these options during development, because that makes experiments much easier. But then remove them before commit. Added, v9, debug_enable_group_by_match_order_by and debug_enable_cheapest_group_by. I also checked compatibility with incremental sort patch, and all works except small merge conflict which could be resolved right before committing. Next, I had a look on cost_incremental_sort() provided by incremental sort patch and, it's a pity, it doesn't solve our problem with the impact of the cost of per-column comparison function and number of its calls. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c index b22b36ec0e..c073eb061a 100644 --- a/src/backend/optimizer/path/equivclass.c +++ b/src/backend/optimizer/path/equivclass.c @@ -686,7 +686,18 @@ get_eclass_for_sort_expr(PlannerInfo *root, if (opcintype == cur_em->em_datatype && equal(expr, cur_em->em_expr)) -return cur_ec; /* Match! */ + { +/* + * Match! + * + * Copy sortref if it wasn't set yet, it's possible if ec was + * constructed from WHERE clause, ie it doesn't have target + * reference at all + */ +if (cur_ec->ec_sortref == 0 && sortref > 0) + cur_ec->ec_sortref = sortref; +return cur_ec; + } } } diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c index ec66cb9c3c..4f93afdebc 100644 --- a/src/backend/optimizer/path/pathkeys.c +++ b/src/backend/optimizer/path/pathkeys.c @@ -26,6 +26,7 @@ #include "optimizer/paths.h" #include "optimizer/tlist.h" #include "utils/lsyscache.h" +#include "utils/selfuncs.h" static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys); @@ -327,6 +328,58 @@ pathkeys_contained_in(List *keys1, List *keys2) return false; } +/* + * Reorder GROUP BY pathkeys and clauses to match order of pathkeys. Function + * returns new lists, original GROUP BY lists stay untouched. + */ +int +group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys, + List **group_clauses) +{ + List *new_group_pathkeys= NIL, +*new_group_clauses = NIL; + ListCell *key; + int n; + + if (pathkeys == NIL || *group_pathkeys == NIL) + return 0; + + /* + * For each pathkey it tries to find corresponding GROUP BY pathkey and + * clause. + */ + foreach(key, pathkeys) + { + PathKey *pathkey = (PathKey *) lfirst(key); + SortGroupClause *sgc; + + /* + * Pathkey should use the same allocated struct, so, equiality of + * pointers is enough + */ + if (!list_member_ptr(*group_pathkeys, pathkey)) + break; + + new_group_pathkeys = lappend(new_group_pathkeys, pathkey); + + sgc = get_sortgroupref_clause(pathkey->pk_eclass->ec_sortref, + *group_clauses); + new_group_clauses = lappend(new_group_clauses, sgc); + } + + n = list_length(new_group_pathkeys); + + /* + * Just append the rest of pathkeys and clauses + */ + *group_pathkeys = list_concat_unique_ptr(new_group_pathkeys, + *group_pathkeys); + *group_clause
Re: POC: GROUP BY optimization
Again agree. If we have fixed order of columns (ORDER BY) then we should not try to reorder it. Current patch follows that if I didn't a mistake. This part seems to be more a misunderstanding between me and Claudio. I believe Claudio was referring to the column order in a GROUP BY, not ORDER BY. In which case we don't add any Sort, of course. I hope so I'm still opposed to adding arbitrary handicap to prioritize the order specified by user, for the reasons I explained before. We should aim to make the heuristics/costing reliable enough to make this unnecessary. +1 -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: POC: GROUP BY optimization
I don't see why not to generate all possible orderings (keeping only the cheapest path for each pathkey variant) and let the optimizer to sort it out. I'm assuming costing the full N! possible orderings would be prohibitively expensive. That's true, but for the first step we need to improve cost_sort and only then try to find the best pathkeys order by optimal way. - If the user requested that order, we assume it "somewhat subjectively better" (by giving it a slightly reduced cost). I don't think so. It's not a SQL way - DB should define the optimal way to execute query. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: POC: GROUP BY optimization
So the costing was fairly trivial, we simply do something like comparison_cost = 2.0 * cpu_operator_cost; sort_cost = comparison_cost * tuples * LOG2(tuples); which essentially ignores that there might be multiple columns, or that the columns may have sort operator with different costs. Agree. And distribution of keys. The question is how reliable the heuristics can be. The current patch uses just plain ndistinct, but that seems rather unreliable but I don't have a clear idea how to improve that - we may have MCV for the columns and perhaps some extended statistics, but I'm not sure how far we can run with that. v8 already uses another algorithm. Essentially what we need to estimate the number of comparisons for each column, to compute better comparison_cost. Exactly Priorization of the user-provided order can be as simple as giving that comparison_cost a small handicap. I see no point in doing that, and I don't recall a single place in the planner where we do that. If the user specified ORDER BY, we'll slap an explicit Sort on top when needed (which acts as the handicap, but in a clear way). Otherwise we don't do such things - it'd be just plain confusing (consider "ORDER BY a,b" vs. "ORDER BY b,c" with same data types, ndistinct etc. but unexpectedly different costs). Also, what would be a good value for the handicap? Again agree. If we have fixed order of columns (ORDER BY) then we should not try to reorder it. Current patch follows that if I didn't a mistake. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: POC: GROUP BY optimization
OK, I haven't looked at v7 yet, but if I understand correctly it tries to maintain the ordering as much as possible? Does that actually help? I mean, the incremental sort patch allows the sorting to happen by pieces, but here we still need to sort all the data, right? Can you give an example demonstrating the benefit? SELECT a, SUM(x) FROM ( SELECT a, b, COUNT(c) AS x FROM t1 GROUP BY a, b UNION ALL SELECT a, b, COUNT(c) AS x FROM t2 GROUP BY a, b ) foo GROUP BY a; and indexes on (a,b) and (b,a) for both relations. The "deduplication" by pathkeys I suggested would mean we might keep only index (a,b) on t1 and (b,a) on t2, which means the grouping by "a" can't leverage index scans on both relations. But if we keep paths for both indexes on each relation, we can. yes, one of option Isn't "estimation of cost of comparing function/number of unique values in column could be not very accurate and so planner could make a wrong choice" is more an argument against relying on it when doing these optimizations? FWIW it's one of the arguments Tom made in the incremental sort patch, which relies on it too when computing cost of the incremental sort. I'm sure it's going to be an obstacle there too. > I saw 2 times difference in real-world application. Again, improving sort cost estimation is a separate task. Sure. But we also need to ask the other question, i.e. how many people would be negatively affected by the optimization. And I admit I don't know the answer to that, the next example is entirely made up. Hm, seems, the best way here is a improving cost_sort estimation. Will try, but I think that is separated patch FWIW I think it would be useful to have "development GUC" that would allow us to enable/disable these options during development, because that makes experiments much easier. But then remove them before commit. Will do -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: POC: GROUP BY optimization
tistic, you can do either a greedy search (get the next column with the highest ndistinct coefficient) or exhaustive search (computing the estimated number of comparisons). Another challenge is that using only the ndistinct coefficient assumes uniform distribution of the values. But you can have a column with 1M distinct values, where a single value represents 99% of the rows. And another column with 100k distinct values, with actual uniform distribution. I'm pretty sure it'd be more efficient to place the 100k column first. Interesting. Will think, thank you -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c index b22b36ec0e..c073eb061a 100644 --- a/src/backend/optimizer/path/equivclass.c +++ b/src/backend/optimizer/path/equivclass.c @@ -686,7 +686,18 @@ get_eclass_for_sort_expr(PlannerInfo *root, if (opcintype == cur_em->em_datatype && equal(expr, cur_em->em_expr)) -return cur_ec; /* Match! */ + { +/* + * Match! + * + * Copy sortref if it wasn't set yet, it's possible if ec was + * constructed from WHERE clause, ie it doesn't have target + * reference at all + */ +if (cur_ec->ec_sortref == 0 && sortref > 0) + cur_ec->ec_sortref = sortref; +return cur_ec; + } } } diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c index ec66cb9c3c..4f93afdebc 100644 --- a/src/backend/optimizer/path/pathkeys.c +++ b/src/backend/optimizer/path/pathkeys.c @@ -26,6 +26,7 @@ #include "optimizer/paths.h" #include "optimizer/tlist.h" #include "utils/lsyscache.h" +#include "utils/selfuncs.h" static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys); @@ -327,6 +328,58 @@ pathkeys_contained_in(List *keys1, List *keys2) return false; } +/* + * Reorder GROUP BY pathkeys and clauses to match order of pathkeys. Function + * returns new lists, original GROUP BY lists stay untouched. + */ +int +group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys, + List **group_clauses) +{ + List *new_group_pathkeys= NIL, +*new_group_clauses = NIL; + ListCell *key; + int n; + + if (pathkeys == NIL || *group_pathkeys == NIL) + return 0; + + /* + * For each pathkey it tries to find corresponding GROUP BY pathkey and + * clause. + */ + foreach(key, pathkeys) + { + PathKey *pathkey = (PathKey *) lfirst(key); + SortGroupClause *sgc; + + /* + * Pathkey should use the same allocated struct, so, equiality of + * pointers is enough + */ + if (!list_member_ptr(*group_pathkeys, pathkey)) + break; + + new_group_pathkeys = lappend(new_group_pathkeys, pathkey); + + sgc = get_sortgroupref_clause(pathkey->pk_eclass->ec_sortref, + *group_clauses); + new_group_clauses = lappend(new_group_clauses, sgc); + } + + n = list_length(new_group_pathkeys); + + /* + * Just append the rest of pathkeys and clauses + */ + *group_pathkeys = list_concat_unique_ptr(new_group_pathkeys, + *group_pathkeys); + *group_clauses = list_concat_unique_ptr(new_group_clauses, + *group_clauses); + + return n; +} + /* * get_cheapest_path_for_pathkeys * Find the cheapest path (according to the specified criterion) that @@ -1611,6 +1664,39 @@ pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys) return 0; /* path ordering not useful */ } +/* + * pathkeys_useful_for_grouping + * Count the number of pathkeys that are useful for grouping (instead of + * explicit sort) + * + * Group pathkeys could be reordered, so we don't bother about actual order in + * pathkeys + */ +static int +pathkeys_useful_for_grouping(PlannerInfo *root, List *pathkeys) +{ + ListCell *key; + int n = 0; + + if (root->group_pathkeys == NIL) + return 0;/* no special ordering requested */ + + if (pathkeys == NIL) + return 0;/* unordered path */ + + foreach(key, pathkeys) + { + PathKey *pathkey = (PathKey *) lfirst(key); + + if (!list_member_ptr(root->group_pathkeys, pathkey)) + break; + + n++; + } + + return n; +} + /* * truncate_useless_pathkeys * Shorten the given pathkey list to just the useful pathkeys. @@ -1625,6 +1711,9 @@ truncate_useless_pathkeys(PlannerInfo *root, nuseful = pathkeys_useful_for_merging(root, rel, pathkeys); nuseful2 = pathkeys_useful_for_ordering(root, pathkeys); + if (nuseful2 > nuseful) + nuseful = nuseful2; + nuseful2 = pathkeys_useful_for_grouping(root, pathkeys); if (nuseful2 > nuseful) nuseful = nuseful2; @@ -1660,6 +1749,8 @@ has_useful_pathkeys(PlannerInfo *root, RelOptInfo *rel) { if (rel->joininfo != NIL || rel->has_eclass_joins) return true; /* might be able to use pathkeys for merging */ + if (root->group_pathk
Re: \d t: ERROR: XX000: cache lookup failed for relation
Teodor Sigaev wrote: Ah, I think this is the missing, essential component: CREATE INDEX ON t(right(i::text,1)) WHERE i::text LIKE '%1'; Finally, I reproduce it with attached script. In attachment simplified version of script. psql uses ordinary sql query to get info about index with usual transaction isolation/MVCC. To create a description of index it calls pg_get_indexdef() which doesn't use transaction snapshot, it uses catalog snapshot because it accesses to catalog through system catalog cache. So the difference is used snapshot between ordinary SQL query and pg_get_indexdef(). I'm not sure that easy to fix and should it be fixed at all. Simplified query: SELECT c2.relname, i.indexrelid, pg_catalog.pg_get_indexdef(i.indexrelid, 0, true) FROM pg_catalog.pg_class c, pg_catalog.pg_class c2, pg_catalog.pg_index i WHERE c.relname = 't' AND c.oid = i.indrelid AND i.indexrelid = c2.oid -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ 1.sh Description: application/shellscript
Re: [PATCH] Trim trailing whitespace in vim and emacs
I use FileStly plugin to vim [1]. But I slightly modify it, see in attachment. FileStyle, sorry. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: [PATCH] Trim trailing whitespace in vim and emacs
I once tried to have vim highlight trailing whitespace, spaces before tabs, and overlength lines (>80 chars), and while I could do each thing in isolation, I wasn't able to get it to highlight the three of them at the same time. If you have a recipe for that, I welcome it. I use FileStly plugin to vim [1]. But I slightly modify it, see in attachment. And addition in .vimrc: if expand('%:e') == "c" || expand('%:e') == "h" || expand('%:e') == "cpp" || expand('%:e') == "m" || expand('%:e') == "hpp" || expand('%:e') == "pl" || expand('%:e') == "pm" || expand('%:e') == "y" || expand('%:e') == "l" else let g:filestyle_plugin = 1 endif [1] https://www.vim.org/scripts/script.php?script_id=5065 -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ " Copyright 2014 Alexander Serebryakov " " Licensed under the Apache License, Version 2.0 (the "License"); " you may not use this file except in compliance with the License. " You may obtain a copy of the License at " " http://www.apache.org/licenses/LICENSE-2.0 " " Unless required by applicable law or agreed to in writing, software " distributed under the License is distributed on an "AS IS" BASIS, " WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. " See the License for the specific language governing permissions and " limitations under the License. "Plugin checking the file to follow Your Vim settings if !exists('g:filestyle_plugin') let g:filestyle_plugin = 1 highligh FileStyleTabsError ctermbg=Red guibg=Red highligh FileStyleTrailngSpacesError ctermbg=Cyan guibg=Cyan highligh FileStyleSpacesError ctermbg=Yellow guibg=Yellow highligh FileStyleTooLongLine cterm=inverse gui=inverse "Defining auto commands augroup filestyle_auto_commands autocmd! autocmd BufReadPost,BufNewFile * call FileStyleActivate() autocmd FileType * call FileStyleCheckFiletype() autocmd WinEnter * call FileStyleCheck() augroup end "Defining plugin commands command! FileStyleActivate call FileStyleActivate() command! FileStyleDeactivate call FileStyleDeactivate() command! FileStyleCheck call FileStyleCheck() endif "Turn plugin on function FileStyleActivate() let b:filestyle_active = 1 call FileStyleCheck() endfunction "Turn plugin off function FileStyleDeactivate() let b:filestyle_active = 0 call clearmatches() endfunction "Check filetype to handle specific cases function FileStyleCheckFiletype() "Avoid checking of help files if =='help' call FileStyleDeactivate() endif endfunction "Highlighting specified pattern function FileStyleHighlightPattern(highlight) call matchadd(a:highlight['highlight'], a:highlight['pattern']) endfunction "Checking expandtab option function FileStyleExpandtabCheck() if let l:highlight = {'highlight' : 'FileStyleTabsError', \ 'pattern': '\t\+'} else let l:highlight = {'highlight' : 'FileStyleSpacesError', \ 'pattern': '^\t* \{'.',\}'} endif call FileStyleHighlightPattern(l:highlight) endfunction "Checking trailing spaces function FileStyleTrailingSpaces() let l:highlight = {'highlight' : 'FileStyleTrailngSpacesError', \ 'pattern': '\s\+$'} call FileStyleHighlightPattern(l:highlight) endfunction "Checking inner spaces function FileStyleInnerSpaces() let l:highlight = {'highlight' : 'FileStyleTabsError', \ 'pattern': '\ \+\t'} call FileStyleHighlightPattern(l:highlight) endfunction function FileStyleAllInnerSpaces() let l:highlight = {'highlight' : 'FileStyleTabsError', \ 'pattern': '\ \{4,\}'} call FileStyleHighlightPattern(l:highlight) endfunction "Checking long lines function FileStyleLongLines() if > 0 let l:highlight = {'highlight' : 'FileStyleTooLongLine', \ 'pattern': '\%' . (+1) . 'v.*' } call FileStyleHighlightPattern(l:highlight) endif endfunction "Checking file dependenly on settings function FileStyleCheck() if b:filestyle_active == 1 call clearmatches() call FileStyleExpandtabCheck() call FileStyleTrailingSpaces() call FileStyleInnerSpaces() call FileStyleAllInnerSpaces() call FileStyleLongLines() endif endfunction
Re: POC: GROUP BY optimization
on of the ordering. But as I argued in other threads, such implicit guarantees are really fragile, can silently break for arbitrary reasons (say, parallel query will do just that) and can be easily fixed by adding a proper ORDER BY. So I don't think this is an issue. Agree. SQL by design doesn't give a warranty of particular order without explicit ORDER BY clause. The other random thought is how will/could this interact with the incremental sorts patch. I know Alexander wanted to significantly limit where we actually consider the incremental sort, but I'm not sure if this was one of those places or not (is sure looks like a place where we would greatly benefit from it). Seems, they should not directly interact. Patch tries to find cheapest column order, Alexander's patch tries to make sort cheaper - they are a different tasks. But I will try. So to summarize this initial review - I do suggest splitting the patch into two parts. One that does the index magic, and one for this reordering optimization. The first part (additional indexes) seems quite fairly safe, likely to get committable soon. The other part (ndistinct reordering) IMHO requires more thought regarding costing and interaction with other query parts. Thank you for review! -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c index 4f93afdebc..8897157817 100644 --- a/src/backend/optimizer/path/pathkeys.c +++ b/src/backend/optimizer/path/pathkeys.c @@ -380,6 +380,128 @@ group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys, return n; } +/* + * Work struct from sorting pathkeys + */ +typedef struct PathKeyNumGroup +{ + PathKey *key; + double numGroups; + int order; /* just to make qsort stable */ +} PathKeyNumGroup; + +static int +pathkey_numgroups_cmp(const void *a, const void *b) +{ + const PathKeyNumGroup *pka = (const PathKeyNumGroup*)a, + *pkb = (const PathKeyNumGroup*)b; + + if (pka->numGroups == pkb->numGroups) + /* make qsort stable */ + return (pka->order > pkb->order) ? 1 : -1; + return (pka->numGroups > pkb->numGroups) ? -1 : 1; +} + +/* + * Order tail of list of group pathkeys by uniqueness descendetly. It allows to + * speedup sorting. Returns newly allocated lists, old ones stay untouched. + * startNum defines a head of list which order should be prevented. + */ +void +get_cheapest_group_keys_order(PlannerInfo *root, double nrows, + List *target_list, + List **group_pathkeys, List **group_clauses, + int startNum) +{ + PathKeyNumGroup *keys; + int nkeys = list_length(*group_pathkeys) - startNum; + List *pathkeyExprList, + *new_group_pathkeys = NIL, + *new_group_clauses = NIL; + ListCell *cell; + int i = 0; + + if (nkeys < 2) + return; /* nothing to do */ + + /* + * Will try to match ORDER BY pathkeys in hope that one sort is cheaper than + * two + */ + if (startNum == 0 && root->sort_pathkeys) + { + startNum = group_keys_reorder_by_pathkeys(root->sort_pathkeys, + group_pathkeys, + group_clauses); + + nkeys = list_length(*group_pathkeys) - startNum; + if (nkeys < 2) + return; /* nothing to do */ + } + + keys = palloc(nkeys * sizeof(*keys)); + pathkeyExprList = list_make1(NULL); + + /* + * for each pathkeys to be ordered we count a number of group to sort them + * in descending order + */ + for_each_cell(cell, list_nth_cell(*group_pathkeys, startNum)) + { + PathKey *pathkey = (PathKey *) lfirst(cell); + Node *pathkeyExpr; + SortGroupClause *sgc; + + sgc = get_sortgroupref_clause(pathkey->pk_eclass->ec_sortref, + *group_clauses); + pathkeyExpr = get_sortgroupclause_expr(sgc, target_list); + linitial(pathkeyExprList) = pathkeyExpr; + keys[i].numGroups= estimate_num_groups(root, pathkeyExprList, + nrows, NULL); + keys[i].key = pathkey; + keys[i].order = i; + + i++; + } + + list_free(pathkeyExprList); + + qsort(keys, nkeys, sizeof(*keys), pathkey_numgroups_cmp); + + /* + * Construct result lists + */ + i = 0; + foreach(cell, *group_pathkeys) + { + PathKey *pathkey; + SortGroupClause *sgc; + + if (i < startNum) + /* presorted head */ + pathkey = (PathKey *) lfirst(cell); + else + /* pathkey, sorted above */ + pathkey = keys[i - startNum].key; + + new_group_pathkeys = lappend(new_group_pathkeys, pathkey); + + sgc = get_sortgroupref_clause(pathkey->pk_eclass->ec_sortref, + *group_clauses); + new_group_clauses = lappend(new_group_clauses, sgc); + + i++; + } + + pfree(keys); + + /* Just append the rest GROUP BY clauses */ + new_group_clauses = list_concat_unique_ptr(new_group_clauses, *group_clauses); + + *group_pathkeys = new_group_pathkeys; + *group_clauses = new_group_clauses; +} +
Re: \d t: ERROR: XX000: cache lookup failed for relation
Ah, I think this is the missing, essential component: CREATE INDEX ON t(right(i::text,1)) WHERE i::text LIKE '%1'; Finally, I reproduce it with attached script. INSERT 0 99 <- first insertion ERROR: cache lookup failed for relation 1032219 ALTER TABLE ERROR: cache lookup failed for relation 1033478 ALTER TABLE ERROR: cache lookup failed for relation 1034073 ALTER TABLE ERROR: cache lookup failed for relation 1034650 ALTER TABLE ERROR: cache lookup failed for relation 1035238 ALTER TABLE ERROR: cache lookup failed for relation 1035837 will investigate -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ 1.sh Description: application/shellscript
Re: \d t: ERROR: XX000: cache lookup failed for relation
Is that considered an actionable problem? I think so. but I'm not able to reproduce that, I wrote a script to simplify but it doesn't reproduce too. And how long to wait to reproduce? I waited for one hour -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ 1.sh Description: application/shellscript
Re: New committers announced at PGCon 2018
Etsuro Fujita Peter Geoghegan Amit Kapila Alexander Korotkov Thomas Munro Michael Paquier Tomas Vondra Congratulations to all! +7! -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Few comments on commit 857f9c36 (skip full index scans )
I think it will always be set to BTREE_VERSION (See _bt_restore_meta). You are right, pushed. Thank you! -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: POC: GROUP BY optimization
Cosmetics change: remove find_sort_group_clause_by_sortref() function added in v5 patch version because it duplicates existsing get_sortgroupref_clause(). -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c index b22b36ec0e..c073eb061a 100644 --- a/src/backend/optimizer/path/equivclass.c +++ b/src/backend/optimizer/path/equivclass.c @@ -686,7 +686,18 @@ get_eclass_for_sort_expr(PlannerInfo *root, if (opcintype == cur_em->em_datatype && equal(expr, cur_em->em_expr)) -return cur_ec; /* Match! */ + { +/* + * Match! + * + * Copy sortref if it wasn't set yet, it's possible if ec was + * constructed from WHERE clause, ie it doesn't have target + * reference at all + */ +if (cur_ec->ec_sortref == 0 && sortref > 0) + cur_ec->ec_sortref = sortref; +return cur_ec; + } } } diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c index ec66cb9c3c..ca3c78dbbe 100644 --- a/src/backend/optimizer/path/pathkeys.c +++ b/src/backend/optimizer/path/pathkeys.c @@ -26,6 +26,7 @@ #include "optimizer/paths.h" #include "optimizer/tlist.h" #include "utils/lsyscache.h" +#include "utils/selfuncs.h" static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys); @@ -327,6 +328,171 @@ pathkeys_contained_in(List *keys1, List *keys2) return false; } +/* + * Reorder GROUP BY pathkeys and clauses to match order of pathkeys. Function + * returns new lists, original GROUP BY lists stay untouched. + */ +int +group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys, + List **group_clauses) +{ + List *new_group_pathkeys= NIL, +*new_group_clauses = NIL; + ListCell *key; + int n; + + if (pathkeys == NIL || *group_pathkeys == NIL) + return 0; + + /* + * For each pathkey it tries to find corresponding GROUP BY pathkey and + * clause. + */ + foreach(key, pathkeys) + { + PathKey *pathkey = (PathKey *) lfirst(key); + SortGroupClause *sgc; + + /* + * Pathkey should use the same allocated struct, so, equiality of + * pointers is enough + */ + if (!list_member_ptr(*group_pathkeys, pathkey)) + break; + + new_group_pathkeys = lappend(new_group_pathkeys, pathkey); + + sgc = get_sortgroupref_clause(pathkey->pk_eclass->ec_sortref, + *group_clauses); + new_group_clauses = lappend(new_group_clauses, sgc); + } + + n = list_length(new_group_pathkeys); + + /* + * Just append the rest of pathkeys and clauses + */ + *group_pathkeys = list_concat_unique_ptr(new_group_pathkeys, + *group_pathkeys); + *group_clauses = list_concat_unique_ptr(new_group_clauses, + *group_clauses); + + return n; +} + +/* + * Work struct from sorting pathkeys + */ +typedef struct PathKeyNumGroup +{ + PathKey *key; + double numGroups; + int order; /* just to make qsort stable */ +} PathKeyNumGroup; + +static int +pathkey_numgroups_cmp(const void *a, const void *b) +{ + const PathKeyNumGroup *pka = (const PathKeyNumGroup*)a, + *pkb = (const PathKeyNumGroup*)b; + + if (pka->numGroups == pkb->numGroups) + /* make qsort stable */ + return (pka->order > pkb->order) ? 1 : -1; + return (pka->numGroups > pkb->numGroups) ? -1 : 1; +} + +/* + * Order tail of list of group pathkeys by uniqueness descendetly. It allows to + * speedup sorting. Returns newly allocated lists, old ones stay untouched. + */ +void +sort_group_key_by_distinct(PlannerInfo *root, double nrows, List *targetList, + List **groupPathkeys, List **groupClauses, + int startNum) +{ + PathKeyNumGroup *keys; + int nkeys = list_length(*groupPathkeys) - startNum; + List *pathkeyExprList, + *newGroupPathkeys = NIL, + *newGroupClauses = NIL; + ListCell *cell; + int i = 0; + + if (nkeys < 2) + return; /* nothing to do */ + + keys = palloc(nkeys * sizeof(*keys)); + pathkeyExprList = list_make1(NULL); + + /* + * for each pathkeys which aren't supported underlied index (or something + * else) we count a number of group to sort them in descending order + */ + for_each_cell(cell, list_nth_cell(*groupPathkeys, startNum)) + { + PathKey *pathkey = (PathKey *) lfirst(cell); + Node *pathkeyExpr; + SortGroupClause *sgc; + + sgc = get_sortgroupref_clause(pathkey->pk_eclass->ec_sortref, + *groupClauses); + pathkeyExpr = get_sortgroupclause_expr(sgc, targetList); + linitial(pathkeyExprList) = pathkeyExpr; + keys[i].numGroups= estimate_num_groups(root, pathkeyExprList, + nrows, NULL); + keys[i].key = pathkey; + keys[i].order = i; + + i++; + } + + list_free(pathkeyExprList); + + qsort(keys, nkeys, sizeof(*keys), pathkey_numgroups_cmp); + i = 0
Re: Few comments on commit 857f9c36 (skip full index scans )
The metapage upgrade should be performed under critical section. Agree. But after close look I found that btm_version change isn't wal-logged (near line 2251 in _bt_newroot). So btm_version is not propagated to replica/backup/etc. I believe it should be fixed. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
POC: GROUP BY optimization
Hi! As far as I known, columns in GROUP BY could be reordered without loss of correctness. But reorder could give a benefits in performance. Suggested patch implements that in several ways: 1) Reorder GROUP BY columns by number of unique values in descent order. Simple example shows a performance difference by two times (see opt_group_by_demostrator.sql, on my notebook first two queries demonstrate that). The idea is: if first column is unique then sort comparator will never compute difference of following columns. 2) Reorder GROUP BY columns to match existing pathkeys which are result of index scan or ORDER BY clause. It prevents to add sort node - suppose, it's a clear win. 3) Patch makes suggestion to use index by GROUP BY clause, but unlike to ORDER BY or merge join case it doesn't pay attention to actual order of columns because of 2) Patch doesn't change any cost estimation computation, although 1) could take an advantage of it. Some issues/problems/notices: 1) I didn't play around GROUPING SETS at all. As I can see, current coding prevents any optimization around it and, suppose, it should be addressed to another patch 2) I'm not completely satisfied with counting of number of groups per column, it looks ugly without refactoring estimate_num_groups(). At least, now it could be called multiple times for each column. May be, this number should be added to PathKey structure? 3) EquivalenceClass->ec_sortref. If planner faced with column first time in WHERE clause it doesn't fill target reference field because it is unknown yet. Patch looks for accordance of PathKey (group_pathkeys) and SortGroupClause (groupClause) and fails in this case. So, get_eclass_for_sort_expr() now updates ec_sortref field if it's not initialized yet. 4) Algorithms to reorder columns is proportional to N^2 where N is number of columns, but I hope it isn't a problem because number of GROUP BY columns isn't very big usually. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c index b22b36ec0e..c073eb061a 100644 --- a/src/backend/optimizer/path/equivclass.c +++ b/src/backend/optimizer/path/equivclass.c @@ -686,7 +686,18 @@ get_eclass_for_sort_expr(PlannerInfo *root, if (opcintype == cur_em->em_datatype && equal(expr, cur_em->em_expr)) -return cur_ec; /* Match! */ + { +/* + * Match! + * + * Copy sortref if it wasn't set yet, it's possible if ec was + * constructed from WHERE clause, ie it doesn't have target + * reference at all + */ +if (cur_ec->ec_sortref == 0 && sortref > 0) + cur_ec->ec_sortref = sortref; +return cur_ec; + } } } diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c index ec66cb9c3c..6f61bd20de 100644 --- a/src/backend/optimizer/path/pathkeys.c +++ b/src/backend/optimizer/path/pathkeys.c @@ -26,6 +26,7 @@ #include "optimizer/paths.h" #include "optimizer/tlist.h" #include "utils/lsyscache.h" +#include "utils/selfuncs.h" static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys); @@ -327,6 +328,192 @@ pathkeys_contained_in(List *keys1, List *keys2) return false; } +/* + * Find a group clause by it's tle reference + */ +static SortGroupClause* +find_sort_group_clause_by_sortref(List *groupClauses, Index tleSortGroupRef) +{ + ListCell *cell; + + foreach(cell, groupClauses) + { + SortGroupClause *sgc = (SortGroupClause*) lfirst(cell); + + if (sgc->tleSortGroupRef == tleSortGroupRef) + return sgc; + } + + elog(ERROR, "could not find target reference as sort group clause"); + + return NULL; +} + +/* + * Reorder GROUP BY pathkeys and clauses to match order of pathkeys. Function + * returns new lists, original GROUP BY lists stay untouched. + */ +int +group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys, + List **group_clauses) +{ + List *new_group_pathkeys= NIL, +*new_group_clauses = NIL; + ListCell *key; + int n; + + if (pathkeys == NIL || *group_pathkeys == NIL) + return 0; + + /* + * For each pathkey it tries to find corresponding GROUP BY pathkey and + * clause. + */ + foreach(key, pathkeys) + { + PathKey *pathkey = (PathKey *) lfirst(key); + SortGroupClause *sgc; + + /* + * Pathkey should use the same allocated struct, so, equiality of + * pointers is enough + */ + if (!list_member_ptr(*group_pathkeys, pathkey)) + break; + + new_group_pathkeys = lappend(new_group_pathkeys, pathkey); + + sgc = find_sort_group_clause_by_sortref(*group_clauses, +pathkey->pk_eclass->ec_sortref); + new_group_clauses = lappend(new_group_clauses, sgc); + } + + n = list_length(new_group_pathkeys); + + /* + *
Re: index scan over composite type
Thank you. Seems, it works, at least I can't find a counter-example for that. Tom Lane wrote: Teodor Sigaev <teo...@sigaev.ru> writes: I'm not understand why postgres prefers to sort table instead of using index only scan when query is a simple inner join on composite type. Query with equality clause with constant works fine with index scan but join not. Could somebody point me why? Thank you. Hmm ... the reason why not seems to be that canonicalize_ec_expression() improperly adds a RelabelType node, causing the composite-type Vars to not be recognized as matching the eclass they should match. The attached patch fixes it and doesn't seem to break anything in the regression tests. This raises the question of why we don't treat type RECORD more like a true polymorphic type, but that's a can of worms I don't particularly want to open right now. For the moment, this is the only IsPolymorphicType call in the planner AFAICS, so there's some reason to hope that we don't have more bugs of the same ilk. regards, tom lane -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
index scan over composite type
Hi! I'm not understand why postgres prefers to sort table instead of using index only scan when query is a simple inner join on composite type. Query with equality clause with constant works fine with index scan but join not. Could somebody point me why? Thank you. And I'm not able to force merge_join with index scans with any combination of enable_* variables. Attached script is a self-contained test script. Pg config file is default. explain select a.idv, b.idv from a, b where a.idv = b.idv; Merge Join (cost=25751.64..27751.64 rows=10 width=74) Merge Cond: (a.idv = b.idv) -> Sort (cost=12875.82..13125.82 rows=10 width=37) Sort Key: a.idv -> Seq Scan on a (cost=0.00..1834.00 rows=10 width=37) -> Sort (cost=12875.82..13125.82 rows=10 width=37) Sort Key: b.idv -> Seq Scan on b (cost=0.00..1834.00 rows=10 width=37) -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ test.sql Description: application/sql
Re: Postgres 11 release notes
857f9c36cda520030381bd8c2af20adf0ce0e1d4 Skip full index scan during cleanup of B-tree indexes when possible I read that and thought it was too details to be in the release notes. It is not that it is unimportant, but it is hard to see how people would notice the difference or change their behavior based on this change. Hm, it could be something like: Add possibility to miss scan indexes on append-only table, it decreases vacuum time on such tables. c0cbe00fee6d0a5e0ec72c6d68a035e674edc4cc Add explicit cast from scalar jsonb to all numeric and bool types. Pls, edit: Add functionjson(b)_to_tsvector to create usable text search queries matching JSON/JSONB values (Dmitry Dolgov) to Add function json(b)_to_tsvector to create usable vectors to search in json(b) (Dmitry Dolgov) or somehow else. Your wording is about query but actually that functions are about how to create tsvector from json(b) OK, how is this text? Add function json(b)_to_tsvector() to create text search query for matching JSON/JSONB values (Dmitry Dolgov) Not query - *_to_tsvector functions produce a tsvector. So: Add function json(b)_to_tsvector() to use customizable full text search over json(b) columns. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Postgres 11 release notes
Bruce Momjian wrote: I have committed the first draft of the Postgres 11 release notes. I will add more markup soon. You can view the most current version here: http://momjian.us/pgsql_docs/release-11.html I expect a torrent of feedback. ;-) Hi! Seems, you miss: 857f9c36cda520030381bd8c2af20adf0ce0e1d4 Skip full index scan during cleanup of B-tree indexes when possible c0cbe00fee6d0a5e0ec72c6d68a035e674edc4cc Add explicit cast from scalar jsonb to all numeric and bool types. Pls, edit: Add functionjson(b)_to_tsvector to create usable text search queries matching JSON/JSONB values (Dmitry Dolgov) to Add function json(b)_to_tsvector to create usable vectors to search in json(b) (Dmitry Dolgov) or somehow else. Your wording is about query but actually that functions are about how to create tsvector from json(b) Thank you! -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: doc fixes: vacuum_cleanup_index_scale_factor
thanks to everyone, pushed -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Cast jsonb to numeric, int, float, bool
1) Does this really pass muster from the translatability standpoint? I doubt it. Huh, I missed that. I think you want the callers to look like if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric) cannotCastJsonbValue(v.type, "double precision"); where the subroutine contains the whole ereport() call, and its lookup table entries are e.g. gettext_noop("cannot cast jsonb string to type %s") Thanks for your idea, patch is attached -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c index 9d2b89f90c..3ef71bbade 100644 --- a/src/backend/utils/adt/jsonb.c +++ b/src/backend/utils/adt/jsonb.c @@ -1857,7 +1857,7 @@ jsonb_object_agg_finalfn(PG_FUNCTION_ARGS) /* * Extract scalar value from raw-scalar pseudo-array jsonb. */ -static JsonbValue * +static bool JsonbExtractScalar(JsonbContainer *jbc, JsonbValue *res) { JsonbIterator *it; @@ -1865,7 +1865,11 @@ JsonbExtractScalar(JsonbContainer *jbc, JsonbValue *res) JsonbValue tmp; if (!JsonContainerIsArray(jbc) || !JsonContainerIsScalar(jbc)) - return NULL; + { + /* inform caller about actual type of container */ + res->type = (JsonContainerIsArray(jbc)) ? jbvArray : jbvObject; + return false; + } /* * A root scalar is stored as an array of one element, so we get the array @@ -1887,7 +1891,40 @@ JsonbExtractScalar(JsonbContainer *jbc, JsonbValue *res) tok = JsonbIteratorNext(, , true); Assert(tok == WJB_DONE); - return res; + return true; +} + +/* + * Emit correct, translable cast error message + */ +static void +cannotCastJsonbValue(enum jbvType type, const char *sqltype) +{ + static struct + { + enum jbvType type; + char *msg; + } + messages[] = + { + { jbvNull, gettext_noop("cannot cast jsonb null to type %s") }, + { jbvString, gettext_noop("cannot cast jsonb string to type %s") }, + { jbvNumeric, gettext_noop("cannot cast jsonb numeric to type %s") }, + { jbvBool, gettext_noop("cannot cast jsonb boolean to type %s") }, + { jbvArray, gettext_noop("cannot cast jsonb array to type %s") }, + { jbvObject, gettext_noop("cannot cast jsonb object to type %s") }, + { jbvBinary, gettext_noop("cannot cast jsonb array or object to type %s") } + }; + int i; + + for(i=0; i<lengthof(messages); i++) + if (messages[i].type == type) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg(messages[i].msg, sqltype))); + + /* should be unreachable */ + elog(ERROR, "unknown jsonb type: %d", (int)type); } Datum @@ -1897,9 +1934,7 @@ jsonb_bool(PG_FUNCTION_ARGS) JsonbValue v; if (!JsonbExtractScalar(>root, ) || v.type != jbvBool) - ereport(ERROR, -(errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("jsonb value must be boolean"))); + cannotCastJsonbValue(v.type, "boolean"); PG_FREE_IF_COPY(in, 0); @@ -1914,9 +1949,7 @@ jsonb_numeric(PG_FUNCTION_ARGS) Numeric retValue; if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric) - ereport(ERROR, -(errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("jsonb value must be numeric"))); + cannotCastJsonbValue(v.type, "numeric"); /* * v.val.numeric points into jsonb body, so we need to make a copy to @@ -1937,9 +1970,7 @@ jsonb_int2(PG_FUNCTION_ARGS) Datum retValue; if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric) - ereport(ERROR, -(errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("jsonb value must be numeric"))); + cannotCastJsonbValue(v.type, "smallint"); retValue = DirectFunctionCall1(numeric_int2, NumericGetDatum(v.val.numeric)); @@ -1957,9 +1988,7 @@ jsonb_int4(PG_FUNCTION_ARGS) Datum retValue; if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric) - ereport(ERROR, -(errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("jsonb value must be numeric"))); + cannotCastJsonbValue(v.type, "integer"); retValue = DirectFunctionCall1(numeric_int4, NumericGetDatum(v.val.numeric)); @@ -1977,9 +2006,7 @@ jsonb_int8(PG_FUNCTION_ARGS) Datum retValue; if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric) - ereport(ERROR, -(errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("jsonb value must be numeric"))); + cannotCastJsonbValue(v.type, "bigint"); retValue = DirectFunctionCall1(numeric_int8, NumericGetDatum(v.val.numeric)); @@ -1997,9 +2024,7 @@ jsonb_float4(PG_FUNCTION_ARGS) Datum retValue; if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric) - ereport(ERROR, -(errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("jsonb value must be numeric"))); + ca
Re: Cast jsonb to numeric, int, float, bool
How about "cannot cast jsonb $json_type to $sql_type" where $json_type is the type inside the jsonb (e.g. string, number, boolean, array, object)? Yes, that sounds pretty good. Does anybody have an objections to patch? -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c index 9d2b89f90c..9f72984fac 100644 --- a/src/backend/utils/adt/jsonb.c +++ b/src/backend/utils/adt/jsonb.c @@ -1857,7 +1857,7 @@ jsonb_object_agg_finalfn(PG_FUNCTION_ARGS) /* * Extract scalar value from raw-scalar pseudo-array jsonb. */ -static JsonbValue * +static bool JsonbExtractScalar(JsonbContainer *jbc, JsonbValue *res) { JsonbIterator *it; @@ -1865,7 +1865,11 @@ JsonbExtractScalar(JsonbContainer *jbc, JsonbValue *res) JsonbValue tmp; if (!JsonContainerIsArray(jbc) || !JsonContainerIsScalar(jbc)) - return NULL; + { + /* inform caller about actual type of container */ + res->type = (JsonContainerIsArray(jbc)) ? jbvArray : jbvObject; + return false; + } /* * A root scalar is stored as an array of one element, so we get the array @@ -1887,7 +1891,40 @@ JsonbExtractScalar(JsonbContainer *jbc, JsonbValue *res) tok = JsonbIteratorNext(, , true); Assert(tok == WJB_DONE); - return res; + return true; +} + +/* + * Map jsonb types to human-readable names + */ +static char* +JsonbTypeName(enum jbvType type) +{ + static struct + { + enum jbvType type; + char *typename; + } + names[] = + { + { jbvNull, "null" }, + { jbvString, "string" }, + { jbvNumeric, "numeric" }, + { jbvBool, "boolean" }, + { jbvArray, "array" }, + { jbvObject, "object" }, + { jbvBinary, "array or object" } + }; + int i; + + for(i=0; i<lengthof(names); i++) + if (names[i].type == type) + return names[i].typename; + + /* should be unreachable */ + elog(ERROR, "unknown jsonb type: %d", (int)type); + + return NULL; } Datum @@ -1899,7 +1936,7 @@ jsonb_bool(PG_FUNCTION_ARGS) if (!JsonbExtractScalar(>root, ) || v.type != jbvBool) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("jsonb value must be boolean"))); + errmsg("cannot cast jsonb %s to boolean", JsonbTypeName(v.type; PG_FREE_IF_COPY(in, 0); @@ -1916,7 +1953,7 @@ jsonb_numeric(PG_FUNCTION_ARGS) if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("jsonb value must be numeric"))); + errmsg("cannot cast jsonb %s to numeric", JsonbTypeName(v.type; /* * v.val.numeric points into jsonb body, so we need to make a copy to @@ -1939,7 +1976,7 @@ jsonb_int2(PG_FUNCTION_ARGS) if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("jsonb value must be numeric"))); + errmsg("cannot cast jsonb %s to int2", JsonbTypeName(v.type; retValue = DirectFunctionCall1(numeric_int2, NumericGetDatum(v.val.numeric)); @@ -1959,7 +1996,7 @@ jsonb_int4(PG_FUNCTION_ARGS) if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("jsonb value must be numeric"))); + errmsg("cannot cast jsonb %s to int4", JsonbTypeName(v.type; retValue = DirectFunctionCall1(numeric_int4, NumericGetDatum(v.val.numeric)); @@ -1979,7 +2016,7 @@ jsonb_int8(PG_FUNCTION_ARGS) if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("jsonb value must be numeric"))); + errmsg("cannot cast jsonb %s to int8", JsonbTypeName(v.type; retValue = DirectFunctionCall1(numeric_int8, NumericGetDatum(v.val.numeric)); @@ -1999,7 +2036,7 @@ jsonb_float4(PG_FUNCTION_ARGS) if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("jsonb value must be numeric"))); + errmsg("cannot cast jsonb %s to float4", JsonbTypeName(v.type; retValue = DirectFunctionCall1(numeric_float4, NumericGetDatum(v.val.numeric)); @@ -2019,7 +2056,7 @@ jsonb_float8(PG_FUNCTION_ARGS) if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("jsonb value must be numeric"))); + errmsg("cannot cast jsonb %s to float8", JsonbTypeName(v.type; retValue = DirectFunctionCall1(numeric_float8, NumericGetDatum(v.val.numeric)); diff --git a/src/test/regress/expected/jsonb.out b/src/tes
Re: nbtsort.c performs unneeded (though harmless) truncation
Thank you, pushed. Peter Geoghegan wrote: I noticed that we're calling _bt_nonkey_truncate() needlessly when a minimum key is needed at the leftmost page on each level of the tree. This was always a special case, and I think that it should remain as one. Attached patch avoids unneeded truncations, while preserving the generic BTreeTupleGetNAtts() assertions. This isn't a correctness issue, and the extra overhead of unneeded truncation should be negligible, but what we have now seems confusing to me. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)
Thanks to everyone, v3 is pushed. Teodor Sigaev wrote: I don't very happy with rootBuffer added everywhere. ginInsertItemPointers() and ginPrepareDataScan() now take both args, rootBlkno and rootBuffer, second could be invalid. As I can see, you did it to call CheckForSerializableConflictIn() in ginEntryInsert(). Seems, it could be reverted and CheckForSerializableConflictIn() should be added to ginFindLeafPage() with searchMode = false. Implemented, v3 is attached. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: FinishPreparedTransaction missing HOLD_INTERRUPTS section
Thank you, pushed! Stas Kelvich wrote: Hello. It seems that during COMMIT PREPARED FinishPreparedTransaction() doesn't hold interrupts around writing to wal and cleaning up ProcArray and GXact entries. At least RemoveTwoPhaseFile (which is called in between) can print a warning with ereport(), which, in turn will check for interrupts and therefore can cancel backend or throw an error before GXact clean-up. Other similar places like CommitTransaction and PrepareTransaction have such hold interrupts sections. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)
I don't very happy with rootBuffer added everywhere. ginInsertItemPointers() and ginPrepareDataScan() now take both args, rootBlkno and rootBuffer, second could be invalid. As I can see, you did it to call CheckForSerializableConflictIn() in ginEntryInsert(). Seems, it could be reverted and CheckForSerializableConflictIn() should be added to ginFindLeafPage() with searchMode = false. Implemented, v3 is attached. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ commit 816a9d9cc81a193a10fb49cf26ff2b0214e0ff9b Author: Teodor Sigaev <teo...@sigaev.ru> Date: Sat Apr 28 20:17:46 2018 +0300 Re-think predicate locking on GIN indexes. The principle behind the locking was not very well thought-out, and not documented. Add a section in the README to explain how it's supposed to work, and change the code so that it actually works that way. This fixes two bugs: 1. If fast update was turned on concurrently, subsequent inserts to the pending list would not conflict with predicate locks that were acquired earlier, on entry pages. The included 'predicate-gin-fastupdate' test demonstrates that. To fix, make all scans acquire a predicate lock on the metapage. That lock represents a scan of the pending list, whether or not there is a pending list at the moment. Forget about the optimization to skip locking/checking for locks, when fastupdate=off. Maybe some of that was safe, but I couldn't convince myself of it, so better to rip it out and keep things simple. 2. If a scan finds no match, it still needs to lock the entry page. The point of predicate locks is to lock the gabs between values, whether or not there is a match. The included 'predicate-gin-nomatch' test tests that case. In addition to those two bug fixes, this removes some unnecessary locking, following the principle laid out in the README. Because all items in a posting tree have the same key value, a lock on the posting tree root is enough to cover all the items. (With a very large posting tree, it would possibly be better to lock the posting tree leaf pages instead, so that a "skip scan" with a query like "A & B", you could avoid unnecessary conflict if a new tuple is inserted with A but !B. But let's keep this simple.) Also, some spelling fixes. diff --git a/src/backend/access/gin/README b/src/backend/access/gin/README index 990b5ffa58..cc434b1feb 100644 --- a/src/backend/access/gin/README +++ b/src/backend/access/gin/README @@ -331,6 +331,40 @@ page-deletions safe; it stamps the deleted pages with an XID and keeps the deleted pages around with the right-link intact until all concurrent scans have finished.) +Predicate Locking +- + +GIN supports predicate locking, for serializable snapshot isolation. +A predicate locks represent that a scan has scanned a range of values. They +are not concerned with physical pages as such, but the logical key values. +A predicate lock on a page covers the key range that would belong on that +page, whether or not there are any matching tuples there currently. In other +words, a predicate lock on an index page covers the "gaps" between the index +tuples. To minimize false positives, predicate locks are acquired at the +finest level possible. + +* Like in the B-tree index, it is enough to lock only leaf pages, because all + insertions happen at the leaf level. + +* In an equality search (i.e. not a partial match search), if a key entry has + a posting tree, we lock the posting tree root page, to represent a lock on + just that key entry. Otherwise, we lock the entry tree page. We also lock + the entry tree page if no match is found, to lock the "gap" where the entry + would've been, had there been one. + +* In a partial match search, we lock all the entry leaf pages that we scan, + in addition to locks on posting tree roots, to represent the "gaps" between + values. + +* In addition to the locks on entry leaf pages and posting tree roots, all + scans grab a lock the metapage. This is to interlock with insertions to + the fast update pending list. An insertion to the pending list can really + belong anywhere in the tree, and the lock on the metapage represents that. + +The interlock for fastupdate pending lists means that with fastupdate=on, +we effectively always grab a full-index lock, so you could get a lot of false +positives. + Compatibility - diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c index 828c7074b7..030d0f4418 100644 --- a/src/backend/access/gin/ginbtree.c +++ b/src/backend/access/gin/ginbtree.c @@ -84,6 +84,9 @@ ginFindLeafPage(GinBtree btree, bool searchMode, Snapshot snapshot) stack->
Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)
I took a stab at fixing those issues, as well as the bug when fastupdate is turned on concurrently. Does the attached patch look sane to you? That's look sane, and I believe it should be applied but I see some issue in your patch: I don't very happy with rootBuffer added everywhere. ginInsertItemPointers() and ginPrepareDataScan() now take both args, rootBlkno and rootBuffer, second could be invalid. As I can see, you did it to call CheckForSerializableConflictIn() in ginEntryInsert(). Seems, it could be reverted and CheckForSerializableConflictIn() should be added to ginFindLeafPage() with searchMode = false. Rebased patch is attached. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ commit 42f73743d9ddf576d2dd9ece3979b407cd70cbfe Author: Teodor Sigaev <teo...@sigaev.ru> Date: Fri Apr 27 13:14:57 2018 +0300 Re-think predicate locking on GIN indexes. The principle behind the locking was not very well thought-out, and not documented. Add a section in the README to explain how it's supposed to work, and change the code so that it actually works that way. This fixes two bugs: 1. If fast update was turned on concurrently, subsequent inserts to the pending list would not conflict with predicate locks that were acquired earlier, on entry pages. The included 'predicate-gin-fastupdate' test demonstrates that. To fix, make all scans acquire a predicate lock on the metapage. That lock represents a scan of the pending list, whether or not there is a pending list at the moment. Forget about the optimization to skip locking/checking for locks, when fastupdate=off. Maybe some of that was safe, but I couldn't convince myself of it, so better to rip it out and keep things simple. 2. If a scan finds no match, it still needs to lock the entry page. The point of predicate locks is to lock the gabs between values, whether or not there is a match. The included 'predicate-gin-nomatch' test tests that case. In addition to those two bug fixes, this removes some unnecessary locking, following the principle laid out in the README. Because all items in a posting tree have the same key value, a lock on the posting tree root is enough to cover all the items. (With a very large posting tree, it would possibly be better to lock the posting tree leaf pages instead, so that a "skip scan" with a query like "A & B", you could avoid unnecessary conflict if a new tuple is inserted with A but !B. But let's keep this simple.) Also, some spelling fixes. diff --git a/src/backend/access/gin/README b/src/backend/access/gin/README index 990b5ffa58..cc434b1feb 100644 --- a/src/backend/access/gin/README +++ b/src/backend/access/gin/README @@ -331,6 +331,40 @@ page-deletions safe; it stamps the deleted pages with an XID and keeps the deleted pages around with the right-link intact until all concurrent scans have finished.) +Predicate Locking +- + +GIN supports predicate locking, for serializable snapshot isolation. +A predicate locks represent that a scan has scanned a range of values. They +are not concerned with physical pages as such, but the logical key values. +A predicate lock on a page covers the key range that would belong on that +page, whether or not there are any matching tuples there currently. In other +words, a predicate lock on an index page covers the "gaps" between the index +tuples. To minimize false positives, predicate locks are acquired at the +finest level possible. + +* Like in the B-tree index, it is enough to lock only leaf pages, because all + insertions happen at the leaf level. + +* In an equality search (i.e. not a partial match search), if a key entry has + a posting tree, we lock the posting tree root page, to represent a lock on + just that key entry. Otherwise, we lock the entry tree page. We also lock + the entry tree page if no match is found, to lock the "gap" where the entry + would've been, had there been one. + +* In a partial match search, we lock all the entry leaf pages that we scan, + in addition to locks on posting tree roots, to represent the "gaps" between + values. + +* In addition to the locks on entry leaf pages and posting tree roots, all + scans grab a lock the metapage. This is to interlock with insertions to + the fast update pending list. An insertion to the pending list can really + belong anywhere in the tree, and the lock on the metapage represents that. + +The interlock for fastupdate pending lists means that with fastupdate=on, +we effectively always grab a full-index lock, so you could get a lot of false +positives. + Compatibility - diff --git a/src/backend/access/gin/ginbtree.c b/src/backe
Re: Corrupted btree index on HEAD because of covering indexes
Teodor, are you caught up to the point where it'd be okay to run pgindent, or are there still patches waiting? There is a gin predicate lock patch from Heikki, I will work on it. But I have not objection to run pgindent, I believe gin patch will be easy to change. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Corrupted btree index on HEAD because of covering indexes
It looks like you pushed v1, which didn't have the tests and other changes you asked for. Attached patch adds those back. Oops, I missed, I don't know how... Thank you very much for your quick eye! -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Corrupted btree index on HEAD because of covering indexes
2) Pls, add to test DELETE as it done in 6db4b49986be3fe59a1f6ba6fabf9852864efc3e Done. I will leave it to you to decide whether or not the original create_index.sql test can now be removed. Leave it in both places. In ideal world, in amcheck test we need to create a broken index, but I don't know a correct way to do it :) -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Corrupted btree index on HEAD because of covering indexes
Thank you, pushed. Peter Geoghegan wrote: On Tue, Apr 24, 2018 at 9:06 AM, Teodor Sigaev <teo...@sigaev.ru> wrote: Perfect! -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Corrupted btree index on HEAD because of covering indexes
Perfect! I would like to commit it but have some suggestions: 1) TRUNCATE bttest_multi; INSERT INTO bttest_multi SELECT i, i%2 FROM generate_series(1, 10) as i; SELECT bt_index_parent_check('bttest_multi_idx', true); to improve test stability it would be better to disable autovacuum: ALTER TABLE bttest_multi SET (autovacuum_enabled = false) 2) Pls, add to test DELETE as it done in 6db4b49986be3fe59a1f6ba6fabf9852864efc3e 3) It's not directly connected to this patch, but allocation of BtreeCheckState is not needed, it could be allocated on stack. 4) Nevertheless, I suggest to use palloc0 (or memset(0)) for BtreeCheckState. Now several fields of that structure could be not inited. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: bms_prev_member won't work correctly if bitmapword is 64-bits
Thank you, pushed David Rowley wrote: bms_prev_member mistakenly has a hardcoded 24 in the function. This should really be BITS_PER_BITMAPWORD - 8 so that it is properly set to 56 if someone compiles with 64-bit bitmapwords. The attached fixes this and also adds a test to exercise the function a bit. [1] indicates there's currently no coverage of this function at all. [1] https://coverage.postgresql.org/src/backend/nodes/bitmapset.c.gcov.html -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Corrupted btree index on HEAD because of covering indexes
Thank you, pushed I see your point. Maybe don't have the newline between the get and set, though, to match the existing style. And, the note about the assertion seems unnecessary. Removed newline and note "Get/set leaf page highkey's link. During the second phase of deletion, the target leaf page's high key may point to an ancestor page (at all other times, the leaf level high key's link is not used). See the nbtree README for full details." Changed as you suggested. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Corrupted btree index on HEAD because of covering indexes
Should we use BTreeInnerTupleGetDownLink() as soon as we use BTreeInnerTupleSetDownLink() for setting this? Or even invent BTreeInnerTupleDownLinkIsValid() macro? I am not sure. Here we actually store UP link - to top parent to remove. I'm afraid using BTreeInnerTupleGetDownLink/BTreeInnerTupleSetDownLink could cause a confusion, in other hand, introducing TreeInnerTupleGetUpLink/BTreeInnerTupleSetUpLink seems over-engineering After close look I change my opinion. To have a clean code it's much better to have new pair get/set macroses specialy to manage link to top pare during page deletion. This removes last naked usage of ItemPointer(SetInvalid/IsInvalid/GetBlockNumberNoCheck) and uses self-described macroses. Patch is attached. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c index beef089ba8..3be229db1f 100644 --- a/src/backend/access/nbtree/nbtpage.c +++ b/src/backend/access/nbtree/nbtpage.c @@ -1602,10 +1602,9 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack) MemSet(, 0, sizeof(IndexTupleData)); trunctuple.t_info = sizeof(IndexTupleData); if (target != leafblkno) - ItemPointerSetBlockNumber(_tid, target); + BTreeTupleSetTopParent(, target); else - ItemPointerSetInvalid(_tid); - BTreeTupleSetNAtts(, 0); + BTreeTupleSetTopParent(, InvalidBlockNumber); if (PageAddItem(page, (Item) , sizeof(IndexTupleData), P_HIKEY, false, false) == InvalidOffsetNumber) @@ -1690,7 +1689,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty) BTPageOpaque opaque; bool rightsib_is_rightmost; int targetlevel; - ItemPointer leafhikey; + IndexTuple leafhikey; BlockNumber nextchild; page = BufferGetPage(leafbuf); @@ -1702,7 +1701,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty) * Remember some information about the leaf page. */ itemid = PageGetItemId(page, P_HIKEY); - leafhikey = &((IndexTuple) PageGetItem(page, itemid))->t_tid; + leafhikey = (IndexTuple) PageGetItem(page, itemid); leafleftsib = opaque->btpo_prev; leafrightsib = opaque->btpo_next; @@ -1714,9 +1713,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty) * parent in the branch. Set 'target' and 'buf' to reference the page * actually being unlinked. */ - if (ItemPointerIsValid(leafhikey)) + target = BTreeTupleGetTopParent(leafhikey); + + if (target != InvalidBlockNumber) { - target = ItemPointerGetBlockNumberNoCheck(leafhikey); Assert(target != leafblkno); /* fetch the block number of the topmost parent's left sibling */ @@ -1919,12 +1919,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty) * branch. */ if (target != leafblkno) - { - if (nextchild == InvalidBlockNumber) - ItemPointerSetInvalid(leafhikey); - else - ItemPointerSetBlockNumber(leafhikey, nextchild); - } + BTreeTupleSetTopParent(leafhikey, nextchild); /* * Mark the page itself deleted. It can be recycled when all current diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c index fb8c769df9..67a94cb80a 100644 --- a/src/backend/access/nbtree/nbtxlog.c +++ b/src/backend/access/nbtree/nbtxlog.c @@ -800,11 +800,7 @@ btree_xlog_mark_page_halfdead(uint8 info, XLogReaderState *record) */ MemSet(, 0, sizeof(IndexTupleData)); trunctuple.t_info = sizeof(IndexTupleData); - if (xlrec->topparent != InvalidBlockNumber) - ItemPointerSetBlockNumber(_tid, xlrec->topparent); - else - ItemPointerSetInvalid(_tid); - BTreeTupleSetNAtts(, 0); + BTreeTupleSetTopParent(, xlrec->topparent); if (PageAddItem(page, (Item) , sizeof(IndexTupleData), P_HIKEY, false, false) == InvalidOffsetNumber) @@ -912,11 +908,7 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record) /* Add a dummy hikey item */ MemSet(, 0, sizeof(IndexTupleData)); trunctuple.t_info = sizeof(IndexTupleData); - if (xlrec->topparent != InvalidBlockNumber) - ItemPointerSetBlockNumber(_tid, xlrec->topparent); - else - ItemPointerSetInvalid(_tid); - BTreeTupleSetNAtts(, 0); + BTreeTupleSetTopParent(, xlrec->topparent); if (PageAddItem(page, (Item) , sizeof(IndexTupleData), P_HIKEY, false, false) == InvalidOffsetNumber) diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h index 1194be9281..34c71978d5 100644 --- a/src/include/access/nbtree.h +++ b/src/include/access/nbtree.h @@ -226,6 +226,20 @@ typedef struct BTMetaPageData #define BTreeInnerTupleSetDownLink(itup, blkno) \ ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno)) +/* + * Get/set link to top parent to remove. It will be better to have assertion + * BTreeTupleGetNAtts() == 0 but pre-v11 version hasn't set INDEX_ALT_TID_MASK + *
Re: Corrupted btree index on HEAD because of covering indexes
heard of people using bt_index_parent_check() in production, but only when they already knew that their database was corrupt, and wanted to isolate the problem. I imagine that people that use bt_index_parent_check() in production do so because they want as much information as possible, and are willing to do whatever it takes to get more information. That why I think we need improve amcheck - ideally, it should not miss any mistakes in tree structure. Agree, but at least this place needs a comment - why it's safe. Good idea. Could you write it? I'm afraid I can't give good explanation why we believe that nobody update this page ant it's high key while page is unlocked but pinned. I also think that we could have better conventional regression test coverage here. Will try to invent not so large test.oif it means they may get a little extra Your smaller test takes about 350ms to run on a debug build on my laptop, which seems worth it (honestly, this test should have been there already). Maybe we can add it to the amcheck regression tests instead, since those are run less often. This also allows us to add a test case specifically as part of the amcheck enhancement, which makes a bit more sense. Hm, it seems to me, that 350ms is short enough to place it in both core and amcheck test. I think, basic functionality should be covered by core tests as we test insert/create. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Corrupted btree index on HEAD because of covering indexes
I tried to minimize Michael's test case and add it to patch. -if (ItemPointerIsValid(leafhikey)) +if (ItemPointerGetBlockNumberNoCheck(leafhikey) != InvalidBlockNumber) Should we use BTreeInnerTupleGetDownLink() as soon as we use BTreeInnerTupleSetDownLink() for setting this? Or even invent BTreeInnerTupleDownLinkIsValid() macro? I am not sure. Here we actually store UP link - to top parent to remove. I'm afraid using BTreeInnerTupleGetDownLink/BTreeInnerTupleSetDownLink could cause a confusion, in other hand, introducing TreeInnerTupleGetUpLink/BTreeInnerTupleSetUpLink seems over-engineering -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Corrupted btree index on HEAD because of covering indexes
I also think that we could have better conventional regression test coverage here. I tried to minimize Michael's test case and add it to patch. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c index beef089ba8..e6fe4ba90e 100644 --- a/src/backend/access/nbtree/nbtpage.c +++ b/src/backend/access/nbtree/nbtpage.c @@ -1714,7 +1714,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty) * parent in the branch. Set 'target' and 'buf' to reference the page * actually being unlinked. */ - if (ItemPointerIsValid(leafhikey)) + if (ItemPointerGetBlockNumberNoCheck(leafhikey) != InvalidBlockNumber) { target = ItemPointerGetBlockNumberNoCheck(leafhikey); Assert(target != leafblkno); diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out index fe5b698669..fc81088d4b 100644 --- a/src/test/regress/expected/create_index.out +++ b/src/test/regress/expected/create_index.out @@ -3052,6 +3052,16 @@ explain (costs off) Filter: (NOT b) (4 rows) +-- +-- Test for multilevel page deletion +-- +CREATE TABLE delete_test_table (a bigint, b bigint, c bigint, d bigint); +INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,8) i; +ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d); +DELETE FROM delete_test_table WHERE a > 4; +VACUUM delete_test_table; +DELETE FROM delete_test_table WHERE a > 10; +VACUUM delete_test_table; -- -- REINDEX (VERBOSE) -- diff --git a/src/test/regress/expected/sanity_check.out b/src/test/regress/expected/sanity_check.out index 8afb1f2f7e..0aa5357917 100644 --- a/src/test/regress/expected/sanity_check.out +++ b/src/test/regress/expected/sanity_check.out @@ -38,6 +38,7 @@ d_star|f date_tbl|f default_tbl|f defaultexpr_tbl|f +delete_test_table|t dept|f dupindexcols|t e_star|f diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql index f7731265a0..f9e7118f0d 100644 --- a/src/test/regress/sql/create_index.sql +++ b/src/test/regress/sql/create_index.sql @@ -1061,6 +1061,17 @@ explain (costs off) explain (costs off) select * from boolindex where not b order by i limit 10; +-- +-- Test for multilevel page deletion +-- +CREATE TABLE delete_test_table (a bigint, b bigint, c bigint, d bigint); +INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,8) i; +ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d); +DELETE FROM delete_test_table WHERE a > 4; +VACUUM delete_test_table; +DELETE FROM delete_test_table WHERE a > 10; +VACUUM delete_test_table; + -- -- REINDEX (VERBOSE) --
Re: Corrupted btree index on HEAD because of covering indexes
I would like to go and implement this check now, and if all goes well I may propose that you commit it to the master branch for v11. I don't see this as a new feature. I see it as essential testing infrastructure. What do you think about adding this new check soon? Agree. Hope, nobody found a way how to use amcheck module in production to serve user requests. But, it should be implemented before BETA stage, in opposite case we will get a lot of objections. Nevertheless, seems, I found. In _bt_mark_page_halfdead() we use truncated high key IndexTuple as a storage of blocknumber of top parent to remove. And sets BTreeTupleSetNAtts(, 0) - it's stored in ip_posid. But some later, in _bt_unlink_halfdead_page() we check ItemPointer high key with ItemPointerIsValid macro - and it returns false, because offset is actually InvalidOffsetNumber - i.e. 0 which was set by BTreeTupleSetNAtts. And some wrong decisions are follows, I didn't look at that. I'm not at all surprised. I strongly suspected that it was some simple issue with the representation, since the INCLUDE patch didn't actually change the page deletion algorithm in any way. +1 Trivial and naive fix is attached, but for me it looks a bit annoing that we store pointer (leafhikey) somewhere inside unlocked page. Well, it has worked that way for a long time. No reason to change it now. Agree, but at least this place needs a comment - why it's safe. I also think that we could have better conventional regression test coverage here. Will try to invent not so large test. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Corrupted btree index on HEAD because of covering indexes
I'll take a look tomorrow. Interesting, contrib/amcheck doesn't find any error in index. Seems, it's subject for further improvement. Nevertheless, seems, I found. In _bt_mark_page_halfdead() we use truncated high key IndexTuple as a storage of blocknumber of top parent to remove. And sets BTreeTupleSetNAtts(, 0) - it's stored in ip_posid. But some later, in _bt_unlink_halfdead_page() we check ItemPointer high key with ItemPointerIsValid macro - and it returns false, because offset is actually InvalidOffsetNumber - i.e. 0 which was set by BTreeTupleSetNAtts. And some wrong decisions are follows, I didn't look at that. Trivial and naive fix is attached, but for me it looks a bit annoing that we store pointer (leafhikey) somewhere inside unlocked page. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c index beef089ba8..e6fe4ba90e 100644 --- a/src/backend/access/nbtree/nbtpage.c +++ b/src/backend/access/nbtree/nbtpage.c @@ -1714,7 +1714,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty) * parent in the branch. Set 'target' and 'buf' to reference the page * actually being unlinked. */ - if (ItemPointerIsValid(leafhikey)) + if (ItemPointerGetBlockNumberNoCheck(leafhikey) != InvalidBlockNumber) { target = ItemPointerGetBlockNumberNoCheck(leafhikey); Assert(target != leafblkno);
Re: WIP: Covering + unique indexes.
Thank you, pushed Peter Geoghegan wrote: On Wed, Apr 18, 2018 at 10:47 PM, Teodor Sigaev <teo...@sigaev.ru> wrote: Thank you, pushed. Thanks. I saw another preexisting issue, this time one that has been around since 2007. Commit bc292937 forgot to remove a comment above _bt_insertonpg() (the 'afteritem' stuff ended up being moved to the bottom of _bt_findinsertloc(), where it remains today). The attached patch fixes this, and in passing mentions the fact that _bt_insertonpg() only performs retail insertions, and specifically never inserts high key items. I don't think it's necessary to add something about negative infinity items to the same comment block. While it's true that _bt_insertonpg() cannot truncate downlinks to make new minus infinity items, I see that as a narrower issue. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
Thank you, pushed. Actually, I see one tiny issue with extra '*' characters here: +* The number of attributes won't be explicitly represented if the +* negative infinity tuple was generated during a page split that +* occurred with a version of Postgres before v11. There must be a +* problem when there is an explicit representation that is +* non-zero, * or when there is no explicit representation and the +* tuple is * evidently not a pre-pg_upgrade tuple. I also suggest fixing this indentation before commit: + /* +*Cannot leak memory here, TupleDescCopy() doesn't allocate any +* inner structure, so, plain pfree() should clean all allocated memory +*/ fixed -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Corrupted btree index on HEAD because of covering indexes
Will see... Michael Paquier wrote: Hi all, I was just testing the VACUUM truncation logic, and bumped into what looks like a corrupted btree index. Here is a reproducer: create table aa (a int primary key, b bool); insert into aa values (generate_series(1,100), false); checkpoint; update aa set b = false where a > 50; -- Dirties a set of shared buffers delete from aa where a > 75; -- Delete a set of rows vacuum aa; delete from aa where a > 10; vacuum aa; -- error on btree with right sibling And here is the actual failure when the second vacuum: ERROR: XX000: right sibling 4132 of block 2128 is not next child 5396 of block 412 in index "aa_pkey" LOCATION: _bt_mark_page_halfdead, nbtpage.c:1564 This works on REL_10_STABLE, so I am adding an open item. I have not investigated the exact problem yet, but bisect is showing me covering indexes as the culprit (8224de4). Thanks, -- Michael -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
I don't understand. We do check the number of attributes on rightmost pages, but we do so separately, in the main loop. For every item that isn't the high key. Comment added, pls, verify. And refactored _bt_check_natts(), I hope, now it's a bit more readable. 4) BTreeTupSetNAtts - seems, it's better to add check of nattrs to fits to BT_N_KEYS_OFFSET_MASK mask, and it should not touch BT_RESERVED_OFFSET_MASK bits, now it will overwrite that bits. An assertion sounds like it would be an improvement, though I don't see that in the patch you posted. I didn't do that in v1, sorry, I was unclear. Attached patch contains all changes suggested in my previous email. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c index be0206d58e..1a605f9944 100644 --- a/contrib/amcheck/verify_nbtree.c +++ b/contrib/amcheck/verify_nbtree.c @@ -698,6 +698,9 @@ nextpage: * "real" data item on the page to the right (if such a first item is * available). * + * - That tuples report that they have the expected number of attributes. + * INCLUDE index pivot tuples should not contain non-key attributes. + * * Furthermore, when state passed shows ShareLock held, and target page is * internal page, function also checks: * @@ -722,43 +725,35 @@ bt_target_page_check(BtreeCheckState *state) elog(DEBUG2, "verifying %u items on %s block %u", max, P_ISLEAF(topaque) ? "leaf" : "internal", state->targetblock); - - /* Check the number of attributes in high key if any */ - if (!P_RIGHTMOST(topaque)) + /* + * Check the number of attributes in high key. Note, rightmost page doesn't + * contain a high key, so nothing to check + */ + if (!P_RIGHTMOST(topaque) && + !_bt_check_natts(state->rel, state->target, P_HIKEY)) { - if (!_bt_check_natts(state->rel, state->target, P_HIKEY)) - { - ItemId itemid; - IndexTuple itup; - char *itid, - *htid; + ItemId itemid; + IndexTuple itup; - itemid = PageGetItemId(state->target, P_HIKEY); - itup = (IndexTuple) PageGetItem(state->target, itemid); - itid = psprintf("(%u,%u)", state->targetblock, P_HIKEY); - htid = psprintf("(%u,%u)", - ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)), - ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid))); + itemid = PageGetItemId(state->target, P_HIKEY); + itup = (IndexTuple) PageGetItem(state->target, itemid); - ereport(ERROR, - (errcode(ERRCODE_INDEX_CORRUPTED), - errmsg("wrong number of index tuple attributes for index \"%s\"", - RelationGetRelationName(state->rel)), - errdetail_internal("Index tid=%s natts=%u points to %s tid=%s page lsn=%X/%X.", - itid, - BTreeTupGetNAtts(itup, state->rel), - P_ISLEAF(topaque) ? "heap" : "index", - htid, - (uint32) (state->targetlsn >> 32), - (uint32) state->targetlsn))); - } + ereport(ERROR, +(errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("wrong number of high key index tuple attributes in index \"%s\"", + RelationGetRelationName(state->rel)), + errdetail_internal("Index block=%u natts=%u block type=%s page lsn=%X/%X.", + state->targetblock, + BTreeTupleGetNAtts(itup, state->rel), + P_ISLEAF(topaque) ? "heap" : "index", + (uint32) (state->targetlsn >> 32), + (uint32) state->targetlsn))); } - /* * Loop over page items, starting from first non-highkey item, not high - * key (if any). Also, immediately skip "negative infinity" real item (if - * any). + * key (if any). Most tests are not performed for the "negative infinity" + * real item (if any). */ for (offset = P_FIRSTDATAKEY(topaque); offset <= max; @@ -791,7 +786,7 @@ bt_target_page_check(BtreeCheckState *state) tupsize, ItemIdGetLength(itemid), (uint32) (state->targetlsn >> 32), (uint32) state->targetlsn), - errhint("This could be a torn page problem"))); + errhint("This could be a torn page problem."))); /* Check the number of index tuple attributes */ if (!_bt_check_natts(state->rel, state->target, offset)) @@ -806,11 +801,11 @@ bt_target_page_check(BtreeCheckState *state) ereport(ERROR, (errcode(ERRCODE_INDEX_CORRUPTED), - errmsg("wrong number of index tuple attributes for index \"%s\"", + errmsg("wrong number of index tuple attributes in index \"%s\"", RelationGetRelationName(state->rel)), errdetail_internal("Index tid=%s natts=%u points to %s tid=%s pa
Re: pgindent run soon?
If there are large refactoring or bug-fix patches that haven't landed yet, then it'd be appropriate to wait for those to get in, but I'm not aware of such at the moment. Pls, wait https://www.postgresql.org/message-id/9c63951d-7696-ecbb-b832-70db7ed3f39b%40sigaev.ru Thank you. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
I mostly agree with your patch, nice work, but I have some notices for your patch: 1) bt_target_page_check(): if (!P_RIGHTMOST(topaque) && !_bt_check_natts(state->rel, state->target, P_HIKEY)) Seems not very obvious: it looks like we don't need to check nattrs on rightmost page. Okay, I remember that on rightmost page there is no hikey at all, but at least comment should added. Implicitly bt_target_page_check() already takes into account 'is page rightmost or not?' by using P_FIRSTDATAKEY, so, may be better to move rightmost check into bt_target_page_check() with some refactoring if-logic: if (offset > maxoff) return true; //nothing to check, also covers empty rightmost page if (P_ISLEAF) { if (offnum >= P_FIRSTDATAKEY) ... else /* if (offnum == P_HIKEY) */ ... } else // !P_ISLEAF { if (offnum == P_FIRSTDATAKEY) ... else if (offnum > P_FIRSTDATAKEY) ... else /* if (offnum == P_HIKEY) */ ... } I see it's possible only 3 nattrs value: 0, nkey and nkey+nincluded, but collapsing if-clause to three branches causes difficulties for code readers. Let compiler optimize that. Sorry for late notice, but it takes my attention only when I noticed (!P_RIGHTMOST && !_bt_check_natts) condition. 2) Style notice: ItemPointerSetInvalid(_tid); + BTreeTupSetNAtts(, 0); if (PageAddItem(page, (Item) , sizeof(IndexTupleData), P_HIKEY, It's better to have blank line between BTreeTupSetNAtts() and if clause. 3) Naming BTreeTupGetNAtts/BTreeTupSetNAtts - several lines above we use full Tuple word in dowlink macroses, here we use just Tup. Seems, better to have Tuple in both cases. Or Tup, but still in both cases. 4) BTreeTupSetNAtts - seems, it's better to add check of nattrs to fits to BT_N_KEYS_OFFSET_MASK mask, and it should not touch BT_RESERVED_OFFSET_MASK bits, now it will overwrite that bits. Attached patch is rebased to current head and contains some comment improvement in index_truncate_tuple() - you save some amount of memory with TupleDescCopy() call but didn't explain why pfree is enough to free all allocated memory. Peter Geoghegan wrote: On Tue, Apr 17, 2018 at 3:12 AM, Alexander Korotkov <a.korot...@postgrespro.ru> wrote: Hmm, what do you think about making BTreeTupGetNAtts() take tupledesc argument, not relation> It anyway doesn't need number of key attributes, only total number of attributes. Then _bt_isequal() would be able to use BTreeTupGetNAtts(). That would make the BTreeTupGetNAtts() assertions quite a bit more verbose, since there is usually no existing tuple descriptor variable, but there is almost always a "rel" variable. The coverage within _bt_isequal() does not seem important, because we only use it with the page high key in rare cases, where _bt_moveright() will already have tested the highkey. I think it's completely OK to fix broken things when you've to touch them. Probably, Teodor would decide to make that by separate commit. So, it's up to him. You're right to say that this old negative infinity tuple code within _bt_insertonpg() is broken code, and not just dead code. The code doesn't just confuse things (e.g. see recent commit 2a67d644). It also seems like it could actually be harmful. This is code that could only ever corrupt your database. I'm fine if Teodor wants to commit that change separately, of course. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c index be0206d58e..7efd7ac47b 100644 --- a/contrib/amcheck/verify_nbtree.c +++ b/contrib/amcheck/verify_nbtree.c @@ -698,6 +698,9 @@ nextpage: * "real" data item on the page to the right (if such a first item is * available). * + * - That tuples report that they have the expected number of attributes. + * INCLUDE index pivot tuples should not contain non-key attributes. + * * Furthermore, when state passed shows ShareLock held, and target page is * internal page, function also checks: * @@ -722,43 +725,32 @@ bt_target_page_check(BtreeCheckState *state) elog(DEBUG2, "verifying %u items on %s block %u", max, P_ISLEAF(topaque) ? "leaf" : "internal", state->targetblock); - - /* Check the number of attributes in high key if any */ - if (!P_RIGHTMOST(topaque)) + /* Check the number of attributes in high key */ + if (!P_RIGHTMOST(topaque) && + !_bt_check_natts(state->rel, state->target, P_HIKEY)) { - if (!_bt_check_natts(state->rel, state->target, P_HIKEY)) - { - ItemId itemid; - IndexTuple itup; - char *itid, - *htid; + ItemId itemid; + IndexTuple itup; - itemid = PageGetItemId(state-&g
Re: psql leaks memory on query cancellation
I imagine that this indicates that control-C processing allocates some memory it doesn't free, resulting in an "island" up at the end of memory that prevents glibc from releasing all the free memory it's got. Whether that's an actual leak, or just memory we're holding onto in hopes of reusing it, isn't clear. (valgrind might be useful.) malloc could request memory from kernel by two ways: sbrk() and mmap(), first one has described problem, mmap hasn't. It's described in mallopt(3) in section M_MMAP_THRESHOLD, to test that try to repeat test with MALLOC_MMAP_THRESHOLD_ environment set to 8192. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
Peter Geoghegan wrote: On Tue, Apr 10, 2018 at 5:45 PM, Peter Geoghegan <p...@bowt.ie> wrote: I did find another problem, though. Looks like the idea to explicitly represent the number of attributes directly has paid off already: pg@~[3711]=# create table covering_bug (f1 int, f2 int, f3 text); create unique index cov_idx on covering_bug (f1) include(f2); insert into covering_bug select i, i * random() * 1000, i * random() * 10 from generate_series(0,10) i; DEBUG: building index "pg_toast_16451_index" on table "pg_toast_16451" serially CREATE TABLE DEBUG: building index "cov_idx" on table "covering_bug" serially CREATE INDEX ERROR: tuple has wrong number of attributes in index "cov_idx" Actually, this was an error on my part (though I'd still maintain that the check paid off here!). I'll still add defensive assertions inside _bt_newroot(), and anywhere else that they're needed. There is no reason to not add defensive assertions in all code that handles page splits, and needs to fetch a highkey from some other page. We missed a few of those. Agree, I prefer to add more Assert, even. may be, more than actually needed. Assert-documented code :) I'll add an item to "Decisions to Recheck Mid-Beta" section of the open items page for this patch. We should review the decision to make a call to _bt_check_natts() within _bt_compare(). It might work just as well as an assertion, and it would be unfortunate if workloads that don't use covering indexes had to pay a price for the _bt_check_natts() call, even if it was a small price. I've seen _bt_compare() appear prominently in profiles quite a few times. Could you show a patch? I think, we need move _bt_check_natts() and its call under USE_ASSERT_CHECKING to prevent performance degradation. Users shouldn't pay for unused feature. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Partitioned tables and covering indexes
pushed. Hope, second try will be successful. Teodor Sigaev wrote: Thank you, pushed Amit Langote wrote: Hi. On 2018/04/11 0:36, Teodor Sigaev wrote: Does the attached fix look correct? Haven't checked the fix with ATTACH PARTITION though. Attached patch seems to fix the problem. However, I would rather get rid of modifying stmt->indexParams. That seems to be more logical for me. Also, it would be good to check some covering indexes on partitioned tables. See the attached patch. Seems right way, do not modify incoming object and do not copy rather large and deep nested structure as suggested by Amit. Yeah, Alexander's suggested way of using a separate variable for indexParams is better. But it will be better to have a ATTACH PARTITION test too. I have added tests. Actually, instead of modifying existing tests, I think it might be better to have a separate section at the end of indexing.sql to test covering indexes feature for partitioned tables. Attached find updated patch. Thanks, Amit -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Partitioned tables and covering indexes
Thanks to everyone, this part is pushed. I will waiting a bit before pushing topic-stater patch to have a time to look on buildfarm. Alvaro, could you add a comment to CompareIndexInfo() to clarify why it doesn't compare indoptions (like DESC/ASC etc)? It doesn't matter for uniqueness of index but changes an order and it's not clear at least for me why we don't pay attention for that. In attached path I did: 1) fix a bug in CompareIndexInfo() which checks opfamily for including column 2) correct size for all vectors 3) Add assertion in various places to be sure that we don't try to use including column as key column 4) per Peter Geoghegan idea add a error message if somebody adds options to include column instead silently ignore it. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Covering GiST indexes
Interesting work. I don't have a time now to learn deep your patch, so, add it to next commitfest, pls. First of all I'd like to see more tests in patch, not only CREATE INDEX. Andrey Borodin wrote: Hi, hackers! Looks like we finally have covering indexes! And that's cool! So I decided to create a thread to discuss covering GiST indexes. Here's a prototype patch implementing this functionality. It is quite small (+80 -30) and lacks tests and docs. But it creates a context. I have two concerns. First one is about INDEX_AM_RESERVED_BIT. B-tree uses it as a base for prefix truncation (I'm not quite sure why it is usually called suffix truncation, but this is a matter for other thread). GiST , probably, will not use [pre\su]fix truncation. But I'd like to use that 13th bit to implement intra-page indexing - a way to improve search within gist page. See [0,1] Second, currently including indexes do not allow same attributes in both keys and include parts. # create index on x using gist(c) include (c); ERROR: included columns must not intersect with key columns But it makes sense for example for geometries like PostGIS. Index keys are truncated to small MBRs while having whole complex geometry right in an index could be handy. Any feedback will be appreciated. Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/7780A07B-4D04-41E2-B228-166B41D07EEE%40yandex-team.ru [1] https://www.postgresql.org/message-id/CAJEAwVE0rrr+OBT-P0gDCtXbVDkBBG_WcXwCBK=gho4fewu...@mail.gmail.com -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Partitioned tables and covering indexes
That's the idea that I tried to express. The point is that we need to tell the user that there is no need to worry about it, rather than that they're wrong to ask about it. Though we should probably actually just throw an error. Or maybe it should be the collation of the underlying table columns. Otherwise the collation returned by an index-only scan would be different from a table scan, no? +1, dangerous -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Partitioned tables and covering indexes
Peter Geoghegan wrote: On Wed, Apr 11, 2018 at 2:29 PM, Peter Eisentraut <peter.eisentr...@2ndquadrant.com> wrote: But in this case it doesn't even do equality comparison, it just returns the value. That's the idea that I tried to express. The point is that we need to tell the user that there is no need to worry about it, rather than that they're wrong to ask about it. Though we should probably actually just throw an error. Any operation over including columns is a deal for filter, not an index, so collation will not be used somewhere inside index. 2Alexander: ComputeIndexAttrs() may use whole vectors (for including columns too) just to simplify coding, in other places, seems, better to have exact size of vectors. Using full-sized vectors could mask a error, for exmaple, with setting opfamily to InvalidOid for key column. But if you insist on that, I'll modify attached patch to use full-sized vectors anywhere. In attached path I did: 1) fix a bug in CompareIndexInfo() which checks opfamily for including column 2) correct size for all vectors 3) Add assertion in various places to be sure that we don't try to use including column as key column 4) per Peter Geoghegan idea add a error message if somebody adds options to include column instead silently ignore it. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index 5d73e92901..58b4249784 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -297,6 +297,7 @@ ConstructTupleDescriptor(Relation heapRelation, Oid *classObjectId) { int numatts = indexInfo->ii_NumIndexAttrs; + int numkeyatts = indexInfo->ii_NumIndexKeyAttrs; ListCell *colnames_item = list_head(indexColNames); ListCell *indexpr_item = list_head(indexInfo->ii_Expressions); IndexAmRoutine *amroutine; @@ -375,7 +376,8 @@ ConstructTupleDescriptor(Relation heapRelation, to->attidentity = '\0'; to->attislocal = true; to->attinhcount = 0; - to->attcollation = collationObjectId[i]; + to->attcollation = (i < numkeyatts) ? + collationObjectId[i] : InvalidOid; } else { @@ -411,7 +413,8 @@ ConstructTupleDescriptor(Relation heapRelation, to->attcacheoff = -1; to->atttypmod = exprTypmod(indexkey); to->attislocal = true; - to->attcollation = collationObjectId[i]; + to->attcollation = (i < numkeyatts) ? + collationObjectId[i] : InvalidOid; ReleaseSysCache(tuple); @@ -608,9 +611,9 @@ UpdateIndexRelation(Oid indexoid, indkey = buildint2vector(NULL, indexInfo->ii_NumIndexAttrs); for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++) indkey->values[i] = indexInfo->ii_KeyAttrNumbers[i]; - indcollation = buildoidvector(collationOids, indexInfo->ii_NumIndexAttrs); + indcollation = buildoidvector(collationOids, indexInfo->ii_NumIndexKeyAttrs); indclass = buildoidvector(classOids, indexInfo->ii_NumIndexKeyAttrs); - indoption = buildint2vector(coloptions, indexInfo->ii_NumIndexAttrs); + indoption = buildint2vector(coloptions, indexInfo->ii_NumIndexKeyAttrs); /* * Convert the index expressions (if any) to a text datum @@ -1080,7 +1083,7 @@ index_create(Relation heapRelation, /* Store dependency on collations */ /* The default collation is pinned, so don't bother recording it */ - for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++) + for (i = 0; i < indexInfo->ii_NumIndexKeyAttrs; i++) { if (OidIsValid(collationObjectId[i]) && collationObjectId[i] != DEFAULT_COLLATION_OID) @@ -1831,6 +1834,11 @@ CompareIndexInfo(IndexInfo *info1, IndexInfo *info2, if (info1->ii_NumIndexAttrs != info2->ii_NumIndexAttrs) return false; + /* and same number of key attributes */ + if (info1->ii_NumIndexKeyAttrs != info2->ii_NumIndexKeyAttrs) + return false; + + /* * and columns match through the attribute map (actual attribute numbers * might differ!) Note that this implies that index columns that are @@ -1848,6 +1856,10 @@ CompareIndexInfo(IndexInfo *info1, IndexInfo *info2, info1->ii_KeyAttrNumbers[i])) return false; + /* collation and opfamily is not valid for including columns */ + if (i >= info1->ii_NumIndexKeyAttrs) + continue; + if (collations1[i] != collations2[i]) return false; if (opfamilies1[i] != opfamilies2[i]) diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index 860a60d109..4933608e61 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -1359,7 +1359,8 @@ CheckPredicate(Expr *predicate) /* * Compute per-index-column information, including indexed column numbers - * or index expressions, opclasses, and indoptions. + * or index expressions, opclasses, and indoptions. Note, all output v
Re: Partitioned tables and covering indexes
Actually, discovered bug is not related to patch except new test faces with it, problem is: CompareIndexInfo() checks rd_opfamily for equality for all attributes, not only for key attribute. Patch attached. But it seems to me, field's names of IndexInfo structure are a bit confused now: int ii_NumIndexAttrs; /* total number of columns in index */ int ii_NumIndexKeyAttrs;/* number of key columns in index */ AttrNumber ii_KeyAttrNumbers[INDEX_MAX_KEYS]; ii_KeyAttrNumbers contains all columns, i.e. it contains ii_NumIndexAttrs number of columns, not a ii_NumIndexKeyAttrs number as easy to think. I suggest rename ii_KeyAttrNumbers to ii_AttrNumbers or ii_IndexAttrNumbers. Opinions? for (i = 0; i < info1->ii_NumIndexAttrs; i++) { if (maplen < info2->ii_KeyAttrNumbers[i]) -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index 5d73e92901..0002816fcc 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -1831,6 +1831,10 @@ CompareIndexInfo(IndexInfo *info1, IndexInfo *info2, if (info1->ii_NumIndexAttrs != info2->ii_NumIndexAttrs) return false; + /* and same number of key attributes */ + if (info1->ii_NumIndexKeyAttrs != info2->ii_NumIndexKeyAttrs) + return false; + /* * and columns match through the attribute map (actual attribute numbers * might differ!) Note that this implies that index columns that are @@ -1850,7 +1854,9 @@ CompareIndexInfo(IndexInfo *info1, IndexInfo *info2, if (collations1[i] != collations2[i]) return false; - if (opfamilies1[i] != opfamilies2[i]) + + /* opfamily is valid on for key attributes */ + if (i < info2->ii_NumIndexKeyAttrs && opfamilies1[i] != opfamilies2[i]) return false; }
Re: Partitioned tables and covering indexes
Patch makes buildfarm almost red, patch is temporary reverted. Actually, discovered bug is not related to patch except new test faces with it, problem is: CompareIndexInfo() checks rd_opfamily for equality for all attributes, not only for key attribute. Obviously, CompareIndexInfo() needs more work: for (i = 0; i < info1->ii_NumIndexAttrs; i++) { if (maplen < info2->ii_KeyAttrNumbers[i]) Seems, we can go out from ii_KeyAttrNumbers array. Amit Langote wrote: Hi. On 2018/04/11 0:36, Teodor Sigaev wrote: Does the attached fix look correct? Haven't checked the fix with ATTACH PARTITION though. Attached patch seems to fix the problem. However, I would rather get rid of modifying stmt->indexParams. That seems to be more logical for me. Also, it would be good to check some covering indexes on partitioned tables. See the attached patch. Seems right way, do not modify incoming object and do not copy rather large and deep nested structure as suggested by Amit. Yeah, Alexander's suggested way of using a separate variable for indexParams is better. But it will be better to have a ATTACH PARTITION test too. I have added tests. Actually, instead of modifying existing tests, I think it might be better to have a separate section at the end of indexing.sql to test covering indexes feature for partitioned tables. Attached find updated patch. Thanks, Amit -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
_bt_mark_page_halfdead() looked like it had a problem, but it now looks like I was wrong. I also verified every other BTreeInnerTupleGetDownLink() caller. It now looks like everything is good here. Right - it tries to find right page by conlsulting in parent page, by taking of next key. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Partitioned tables and covering indexes
Thank you, pushed Amit Langote wrote: Hi. On 2018/04/11 0:36, Teodor Sigaev wrote: Does the attached fix look correct? Haven't checked the fix with ATTACH PARTITION though. Attached patch seems to fix the problem. However, I would rather get rid of modifying stmt->indexParams. That seems to be more logical for me. Also, it would be good to check some covering indexes on partitioned tables. See the attached patch. Seems right way, do not modify incoming object and do not copy rather large and deep nested structure as suggested by Amit. Yeah, Alexander's suggested way of using a separate variable for indexParams is better. But it will be better to have a ATTACH PARTITION test too. I have added tests. Actually, instead of modifying existing tests, I think it might be better to have a separate section at the end of indexing.sql to test covering indexes feature for partitioned tables. Attached find updated patch. Thanks, Amit -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
* There is no pfree() within _bt_buildadd() for truncated tuples, even though that's a context where it's clearly not okay. Agree * It might be a good idea to also pfree() the truncated tuple for most other _bt_buildadd() callers. Even though it's arguably okay in other cases, it seems worth being consistent about it (consistent with old nbtree code). Seems, I don't see other calls to pfree after. * There should probably be some documentation around why it's okay that we call index_truncate_tuple() with an exclusive buffer lock held (during a page split). For example, there should probably be a comment on the VARATT_IS_EXTERNAL() situation. I havn't objection to improve docs/comments. * Not sure that all calls to BTreeInnerTupleGetDownLink() are limited to inner tuples, which might be worth doing something about (perhaps just renaming the macro). What is suspicious place for you opinion? I do not have the time to write a patch right away, but I should be able to post one in a few days. I want to avoid sending several small patches. no problem, we can wait -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Partitioned tables and covering indexes
Does the attached fix look correct? Haven't checked the fix with ATTACH PARTITION though. Attached patch seems to fix the problem. However, I would rather get rid of modifying stmt->indexParams. That seems to be more logical for me. Also, it would be good to check some covering indexes on partitioned tables. See the attached patch. Seems right way, do not modify incoming object and do not copy rather large and deep nested structure as suggested by Amit. But it will be better to have a ATTACH PARTITION test too. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Optimization of range queries
Hi! 12 years ago I proposed patch to which could "union" OR clauses into one range clause if it's possible. In that time pgsql could not use IS NULL as index clause, so patch doesn't support that https://www.postgresql.org/message-id/flat/45742C51.9020602%40sigaev.ru option number 4), all other are already committed. Konstantin Knizhnik wrote: Hi hackers, Postgres optimizer is not able to build efficient execution plan for the following query: explain select * from people_raw where not ("ID"<2068113880 AND "INN" is not null) and "ID"<=2068629726 AND "INN" is not null; QUERY PLAN Bitmap Heap Scan on people_raw (cost=74937803.72..210449640.49 rows=121521030 width=336) Recheck Cond: ("ID" <= 2068629726) Filter: (("INN" IS NOT NULL) AND (("ID" >= 2068113880) OR ("INN" IS NULL))) -> Bitmap Index Scan on "People_pkey" (cost=0.00..74907423.47 rows=2077021718 width=0) Index Cond: ("ID" <= 2068629726) (5 rows) Here the table is very large, but query effects only relatively small number of rows located in the range: [2068113880,2068629726] But unfortunately optimizer doesn't take it into the account. Moreover, using "is not null" and "not null" is both operands of AND is not smart: (("INN" IS NOT NULL) AND (("ID" >= 2068113880) OR ("INN" IS NULL))) If I remove "is not null" condition, then plan is perfect: explain select * from people_raw where not ("ID"<2068113880) and "ID"<=2068629726; QUERY PLAN Index Scan using "People_pkey" on people_raw (cost=0.58..196745.57 rows=586160 width=336) Index Cond: (("ID" >= 2068113880) AND ("ID" <= 2068629726)) (2 rows) Before starting investigation of the problem, I will like to know opinion and may be some advise of people familiar with optimizer: how difficult will be to handle this case and where to look. Thanks in advance, -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)
Alvaro Herrera wrote: Heikki Linnakangas wrote: Remember, the purpose of predicate locks is to lock key ranges, not physical pages or tuples in the index. We use leaf pages as handy shortcut for "any key value that would belong on this page", but it is just an implementation detail. Hmm ... so, thinking about pending list locking, would it work to acquire locks on the posting tree's root of each item in the pending list, when the item is put in the pending list? (even if we insert the item in the pending list instead of its posting tree). Items in pending list doesn't use posting tree or list, pending list is just list of pair (ItemPointer to heap, entry) represented as IndexTuple. There is no order in pending list, so Heikki suggests to lock metapage always instead of locking whole relation if fastupdate=on. If fastupdate is off then insertion process will not change metapage. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
Thanks to both of you, pushed Shinoda, Noriyoshi wrote: Hi! Thank you for your response. I think that it is good with your proposal. Regards, Noriyoshi Shinoda *From:*Alexander Korotkov [mailto:a.korot...@postgrespro.ru] *Sent:* Monday, April 9, 2018 11:22 PM *To:* Shinoda, Noriyoshi <noriyoshi.shin...@hpe.com> *Cc:* PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>; Teodor Sigaev <teo...@sigaev.ru>; Peter Geoghegan <p...@bowt.ie>; Jeff Janes <jeff.ja...@gmail.com>; Anastasia Lubennikova <a.lubennik...@postgrespro.ru> *Subject:* Re: WIP: Covering + unique indexes. Hi! On Mon, Apr 9, 2018 at 5:07 PM, Shinoda, Noriyoshi <noriyoshi.shin...@hpe.com <mailto:noriyoshi.shin...@hpe.com>> wrote: I tested this feature and found a document shortage in the columns added to the pg_constraint catalog. The attached patch will add the description of the 'conincluding' column to the manual of the pg_constraint catalog. Thank you for pointing this! I think we need more wordy explanation here. My proposal is attached. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com <http://www.postgrespro.com/> The Russian Postgres Company -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)
Hi! 1. Why do we lock all posting tree pages, even though they all represent the same value? Isn't it enough to lock the root of the posting tree? 2. Why do we lock any posting tree pages at all, if we lock the entry tree page anyway? Isn't the lock on the entry tree page sufficient to cover the key value? 3. Why do we *not* lock the entry leaf page, if there is no match? We still need a lock to remember that we probed for that value and there was no match, so that we conflict with a tuple that might be inserted later. At least #3 is a bug. The attached patch adds an isolation test that demonstrates it. #1 and #2 are weird, and cause unnecessary locking, so I think we should fix those too, even if they don't lead to incorrect results. I can't find a hole here. Agree. I took a stab at fixing those issues, as well as the bug when fastupdate is turned on concurrently. Does the attached patch look sane to you? I like an idea use metapage locking, thank you. Patch seems good, will you push it? -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)
Ugh, I miss your last email where you another locking protocol. Reading. Teodor Sigaev wrote: Attached is a test case that demonstrates a case where we miss a serialization failure, when fastupdate is turned on concurrently. It works on v10, but fails to throw a serialization error on v11. Thank you for reserching! Proof of concept: diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 43b2fce2c5..b8291f96e2 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -10745,6 +10745,7 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation, case RELKIND_INDEX: case RELKIND_PARTITIONED_INDEX: (void) index_reloptions(rel->rd_amroutine->amoptions, newOptions, true); + TransferPredicateLocksToHeapRelation(rel); break; default: ereport(ERROR, it fixes pointed bug, but will gives false positives. Right place for that in ginoptions function, but ginoptions doesn't has an access to relation structure and I don't see a reason why it should. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)
Attached is a test case that demonstrates a case where we miss a serialization failure, when fastupdate is turned on concurrently. It works on v10, but fails to throw a serialization error on v11. Thank you for reserching! Proof of concept: diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 43b2fce2c5..b8291f96e2 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -10745,6 +10745,7 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation, case RELKIND_INDEX: case RELKIND_PARTITIONED_INDEX: (void) index_reloptions(rel->rd_amroutine->amoptions, newOptions, true); + TransferPredicateLocksToHeapRelation(rel); break; default: ereport(ERROR, it fixes pointed bug, but will gives false positives. Right place for that in ginoptions function, but ginoptions doesn't has an access to relation structure and I don't see a reason why it should. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
Thank you, pushed. Peter Geoghegan wrote: On Sun, Apr 8, 2018 at 11:18 AM, Teodor Sigaev <teo...@sigaev.ru> wrote: Thank you, fixed I suggest that we remove some unneeded amcheck tests, as in the attached patch. They don't seem to add anything. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
Thanks to everyone, pushed. Indeed thanks, this will be a nice feature. It is giving me a compiler warning on non-cassert builds using gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609: indextuple.c: In function 'index_truncate_tuple': indextuple.c:462:6: warning: unused variable 'indnatts' [-Wunused-variable] int indnatts = tupleDescriptor->natts; Thank you, fixed -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Verbosity of genbki.pl
1. Print nothing at all. That's more in keeping with our modern build practices, but maybe it's too big a change? 2. Print just one message like "Generating postgres.bki and related files", and I guess a second one for fmgroids.h and related files. I don't have a strong preference. Opinions? Second point, pls. I'd like to see some stage done -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
Thank you, fixed Jeff Janes wrote: On Sat, Apr 7, 2018 at 4:02 PM, Teodor Sigaev <teo...@sigaev.ru <mailto:teo...@sigaev.ru>> wrote: Thanks to everyone, pushed. Indeed thanks, this will be a nice feature. It is giving me a compiler warning on non-cassert builds using gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609: indextuple.c: In function 'index_truncate_tuple': indextuple.c:462:6: warning: unused variable 'indnatts' [-Wunused-variable] int indnatts = tupleDescriptor->natts; Cheers, Jeff -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
Support for testing amcaninclude via pg_indexam_has_property(oid,'can_include') seems to be missing? Also the return values of pg_index_column_has_property for included columns seem a bit dubious... should probably be returning NULL for most properties except 'returnable'. Damn, you right, it's missed. I can look at fixing these for you if you like? If you will do that I will be very grateful -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
Thank you, I looked to buildfarm and completely forget about commitfest site Andres Freund wrote: On 2018-04-07 23:02:08 +0300, Teodor Sigaev wrote: Thanks to everyone, pushed. Marked CF entry as committed. Greetings, Andres Freund -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: WIP: Covering + unique indexes.
I'll keep an eye on the buildfarm, since it's late in Russia. Thank you very much! Now 23:10 MSK and I'll be able to follow during approximately hour. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/