from:"Teodor Sigaev"

Re: type cache cleanup improvements

2024-03-15 Thread Teodor Sigaev


Okay, I've applied this piece for now.  Not sure I'll have much room
to look at the rest.


Thank you very much!

Rest of patches, rebased.

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/From b30af080d7768c2fdb6198e2e40ef93419f60732 Mon Sep 17 00:00:00 2001
From: Teodor Sigaev 
Date: Fri, 15 Mar 2024 13:55:10 +0300
Subject: [PATCH 3/3] usage of hash_search_with_hash_value

---
 src/backend/utils/cache/attoptcache.c | 39 
 src/backend/utils/cache/typcache.c| 52 +++
 2 files changed, 70 insertions(+), 21 deletions(-)

diff --git a/src/backend/utils/cache/attoptcache.c b/src/backend/utils/cache/attoptcache.c
index af978ccd4b1..3a18b2e9a77 100644
--- a/src/backend/utils/cache/attoptcache.c
+++ b/src/backend/utils/cache/attoptcache.c
@@ -44,12 +44,10 @@ typedef struct
 
 /*
  * InvalidateAttoptCacheCallback
- *		Flush all cache entries when pg_attribute is updated.
+ *		Flush cache entry (or entries) when pg_attribute is updated.
  *
  * When pg_attribute is updated, we must flush the cache entry at least
- * for that attribute.  Currently, we just flush them all.  Since attribute
- * options are not currently used in performance-critical paths (such as
- * query execution), this seems OK.
+ * for that attribute.
  */
 static void
 InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue)
@@ -57,7 +55,16 @@ InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue)
 	HASH_SEQ_STATUS status;
 	AttoptCacheEntry *attopt;
 
-	hash_seq_init(, AttoptCacheHash);
+	/*
+	 * By convection, zero hash value is passed to the callback as a sign
+	 * that it's time to invalidate the cache. See sinval.c, inval.c and
+	 * InvalidateSystemCachesExtended().
+	 */
+	if (hashvalue == 0)
+		hash_seq_init(, AttoptCacheHash);
+	else
+		hash_seq_init_with_hash_value(, AttoptCacheHash, hashvalue);
+
 	while ((attopt = (AttoptCacheEntry *) hash_seq_search()) != NULL)
 	{
 		if (attopt->opts)
@@ -70,6 +77,18 @@ InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue)
 	}
 }
 
+/*
+ * Hash function compatible with two-arg system cache hash function.
+ */
+static uint32
+relatt_cache_syshash(const void *key, Size keysize)
+{
+	const AttoptCacheKey* ckey = key;
+
+	Assert(keysize == sizeof(*ckey));
+	return GetSysCacheHashValue2(ATTNUM, ckey->attrelid, ckey->attnum);
+}
+
 /*
  * InitializeAttoptCache
  *		Initialize the attribute options cache.
@@ -82,9 +101,17 @@ InitializeAttoptCache(void)
 	/* Initialize the hash table. */
 	ctl.keysize = sizeof(AttoptCacheKey);
 	ctl.entrysize = sizeof(AttoptCacheEntry);
+
+	/*
+	 * AttoptCacheEntry takes hash value from the system cache. For
+	 * AttoptCacheHash we use the same hash in order to speedup search by hash
+	 * value. This is used by hash_seq_init_with_hash_value().
+	 */
+	ctl.hash = relatt_cache_syshash;
+
 	AttoptCacheHash =
 		hash_create("Attopt cache", 256, ,
-	HASH_ELEM | HASH_BLOBS);
+	HASH_ELEM | HASH_FUNCTION);
 
 	/* Make sure we've initialized CacheMemoryContext. */
 	if (!CacheMemoryContext)
diff --git a/src/backend/utils/cache/typcache.c b/src/backend/utils/cache/typcache.c
index 7936c3b46d0..9145088f44d 100644
--- a/src/backend/utils/cache/typcache.c
+++ b/src/backend/utils/cache/typcache.c
@@ -339,6 +339,15 @@ static TupleDesc find_or_make_matching_shared_tupledesc(TupleDesc tupdesc);
 static dsa_pointer share_tupledesc(dsa_area *area, TupleDesc tupdesc,
    uint32 typmod);
 
+/*
+ * Hash function compatible with one-arg system cache hash function.
+ */
+static uint32
+type_cache_syshash(const void *key, Size keysize)
+{
+	Assert(keysize == sizeof(Oid));
+	return GetSysCacheHashValue1(TYPEOID, ObjectIdGetDatum(*(const Oid*)key));
+}
 
 /*
  * lookup_type_cache
@@ -364,8 +373,15 @@ lookup_type_cache(Oid type_id, int flags)
 
 		ctl.keysize = sizeof(Oid);
 		ctl.entrysize = sizeof(TypeCacheEntry);
+		/*
+		 * TypeEntry takes hash value from the system cache. For TypeCacheHash
+		 * we use the same hash in order to speedup search by hash value. This
+		 * is used by hash_seq_init_with_hash_value().
+		 */
+		ctl.hash = type_cache_syshash;
+
 		TypeCacheHash = hash_create("Type information cache", 64,
-	, HASH_ELEM | HASH_BLOBS);
+	, HASH_ELEM | HASH_FUNCTION);
 
 		ctl.keysize = sizeof(Oid);
 		ctl.entrysize = sizeof(mapRelTypeEntry);
@@ -421,8 +437,7 @@ lookup_type_cache(Oid type_id, int flags)
 
 		/* These fields can never change, by definition */
 		typentry->type_id = type_id;
-		typentry->type_id_hash = GetSysCacheHashValue1(TYPEOID,
-	   ObjectIdGetDatum(type_id));
+		typentry->type_id_hash = get_hash_value(TypeCacheHash, _id);
 
 		/* Keep this part in sync with the code below */
 		typentry->typlen = typtup->typlen;
@@ -2429,20 +2444,27 @@ TypeCacheTypCallback(Datum arg, int cacheid,

Re: type cache cleanup improvements

2024-03-14 Thread Teodor Sigaev


One thing that first pops out to me is that we can do the refactor of
hash_initial_lookup() as an independent piece, without the extra paths
introduced.  But rather than returning the bucket hash and have the
bucket number as an in/out argument of hash_initial_lookup(), there is
an argument for reversing them: hash_search_with_hash_value() does not
care about the bucket number.

Ok, no problem




02-hash_seq_init_with_hash_value.v5.patch - introduces a
hash_seq_init_with_hash_value() method. hash_initial_lookup() is marked as
inline, but I suppose, modern compilers are smart enough to inline it
automatically.


Likely so, though that does not hurt to show the intention to the
reader.

Agree



So I would like to suggest the attached patch for this first piece.
What do you think?

I have not any objections



It may also be an idea to use `git format-patch` when generating a
series of patches.  That makes for easier reviews.

Thanks, will try

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: type cache cleanup improvements

2024-03-13 Thread Teodor Sigaev




I think that this patch should be split for clarity, as there are a
few things that are independently useful.  I guess something like
that:

Done, all patches should be applied consequentially.

> - The typcache changes.
01-map_rel_to_type.v5.patch adds map relation to its type


- Introduction of hash_initial_lookup(), that simplifies 3 places of
dynahash.c where the same code is used.  The routine should be
inlined.
- The split in hash_seq_search to force a different type of search is
weird, complicating the dynahash interface by hiding what seems like a
search mode.  Rather than hasHashvalue that's hidden in the middle of
HASH_SEQ_STATUS, could it be better to have an entirely different API
for the search?  That should be a patch on its own, as well.


02-hash_seq_init_with_hash_value.v5.patch - introduces a 
hash_seq_init_with_hash_value() method. hash_initial_lookup() is marked as 
inline, but I suppose, modern compilers are smart enough to inline it automatically.


Using separate interface for scanning hash with hash value will make scan code 
more ugly in case when we need to use special value of hash value as it is done 
in cache's scans. Look, instead of this simplified code:

   if (hashvalue == 0)
   hash_seq_init(, TypeCacheHash);
   else
   hash_seq_init_with_hash_value(, TypeCacheHash, hashvalue);
   while ((typentry = hash_seq_search()) != NULL) {
  ...
   }
we will need to code something like that:
   if (hashvalue == 0)
   {
   hash_seq_init(, TypeCacheHash);

while ((typentry = hash_seq_search()) != NULL) {
...
}
   }
   else
   {
   hash_seq_init_with_hash_value(, TypeCacheHash, hashvalue);
while ((typentry = hash_seq_search_with_hash_value()) != NULL) {
...
}
   }
Or I didn't understand you.

I thought about integrate check inside existing loop in  hash_seq_search() :
+ rerun:
if ((curElem = status->curEntry) != NULL)
{
/* Continuing scan of curBucket... */
status->curEntry = curElem->link;
if (status->curEntry == NULL)   /* end of this bucket */
+   {
+   if (status->hasHashvalue)
+   hash_seq_term(status);  

++status->curBucket;
+   }
+   else if (status->hasHashvalue && status->hashvalue !=
+curElem->hashvalue)
+   goto rerun;
return (void *) ELEMENTKEY(curElem);
}

But for me it looks weird and adds some checks which will takes some CPU time.


03-att_with_hash_value.v5.patch - adds usage of previous patch.

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/diff --git a/src/backend/utils/cache/attoptcache.c b/src/backend/utils/cache/attoptcache.c
index af978ccd4b1..3a18b2e9a77 100644
--- a/src/backend/utils/cache/attoptcache.c
+++ b/src/backend/utils/cache/attoptcache.c
@@ -44,12 +44,10 @@ typedef struct
 
 /*
  * InvalidateAttoptCacheCallback
- *		Flush all cache entries when pg_attribute is updated.
+ *		Flush cache entry (or entries) when pg_attribute is updated.
  *
  * When pg_attribute is updated, we must flush the cache entry at least
- * for that attribute.  Currently, we just flush them all.  Since attribute
- * options are not currently used in performance-critical paths (such as
- * query execution), this seems OK.
+ * for that attribute.
  */
 static void
 InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue)
@@ -57,7 +55,16 @@ InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue)
 	HASH_SEQ_STATUS status;
 	AttoptCacheEntry *attopt;
 
-	hash_seq_init(, AttoptCacheHash);
+	/*
+	 * By convection, zero hash value is passed to the callback as a sign
+	 * that it's time to invalidate the cache. See sinval.c, inval.c and
+	 * InvalidateSystemCachesExtended().
+	 */
+	if (hashvalue == 0)
+		hash_seq_init(, AttoptCacheHash);
+	else
+		hash_seq_init_with_hash_value(, AttoptCacheHash, hashvalue);
+
 	while ((attopt = (AttoptCacheEntry *) hash_seq_search()) != NULL)
 	{
 		if (attopt->opts)
@@ -70,6 +77,18 @@ InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue)
 	}
 }
 
+/*
+ * Hash function compatible with two-arg system cache hash function.
+ */
+static uint32
+relatt_cache_syshash(const void *key, Size keysize)
+{
+	const AttoptCacheKey* ckey = key;
+
+	Assert(keysize == sizeof(*ckey));
+	return GetSysCacheHashValue2(ATTNUM, ckey->attrelid, ckey->attnum);
+}
+
 /*
  * InitializeAttoptCache
  *		Initialize the attribute options cache.
@@ -82,9 +101,17 @@ InitializeAttoptCache(void)
 	/* Initialize the hash table. */
 	ctl.keysize = sizeof(AttoptCacheKey);
 	ctl.entrysize = sizeof(AttoptCacheEntry);
+
+	/*
+	 * AttoptCacheEntry takes hash value from the system cache. For
+	 * AttoptCacheHash we use the same hash in order to speedup search by hash
+	 * value

Re: type cache cleanup improvements

2024-03-12 Thread Teodor Sigaev


Got it. Here is an updated patch where I added a corresponding comment.

Thank you!

Playing around I found one more place which could easily modified with 
hash_seq_init_with_hash_value() call.

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/diff --git a/src/backend/utils/cache/attoptcache.c b/src/backend/utils/cache/attoptcache.c
index af978ccd4b1..28980620662 100644
--- a/src/backend/utils/cache/attoptcache.c
+++ b/src/backend/utils/cache/attoptcache.c
@@ -44,12 +44,10 @@ typedef struct
 
 /*
  * InvalidateAttoptCacheCallback
- *		Flush all cache entries when pg_attribute is updated.
+ *		Flush cache entry (or entries) when pg_attribute is updated.
  *
  * When pg_attribute is updated, we must flush the cache entry at least
- * for that attribute.  Currently, we just flush them all.  Since attribute
- * options are not currently used in performance-critical paths (such as
- * query execution), this seems OK.
+ * for that attribute.
  */
 static void
 InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue)
@@ -57,7 +55,16 @@ InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue)
 	HASH_SEQ_STATUS status;
 	AttoptCacheEntry *attopt;
 
-	hash_seq_init(, AttoptCacheHash);
+	/*
+	 * By convection, zero hash value is passed to the callback as a sign
+	 * that it's time to invalidate the cache. See sinval.c, inval.c and
+	 * InvalidateSystemCachesExtended().
+	 */
+	if (hashvalue == 0)
+		hash_seq_init(, AttoptCacheHash);
+	else
+		hash_seq_init_with_hash_value(, AttoptCacheHash, hashvalue);
+
 	while ((attopt = (AttoptCacheEntry *) hash_seq_search()) != NULL)
 	{
 		if (attopt->opts)
@@ -70,6 +77,17 @@ InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue)
 	}
 }
 
+/*
+ * Hash function compatible with two-arg system cache hash function.
+ */
+static uint32
+relatt_cache_syshash(const void *key, Size keysize)
+{
+	const AttoptCacheKey* ckey = key;
+
+	return GetSysCacheHashValue2(ATTNUM, ckey->attrelid, ckey->attnum);
+}
+
 /*
  * InitializeAttoptCache
  *		Initialize the attribute options cache.
@@ -82,9 +100,17 @@ InitializeAttoptCache(void)
 	/* Initialize the hash table. */
 	ctl.keysize = sizeof(AttoptCacheKey);
 	ctl.entrysize = sizeof(AttoptCacheEntry);
+
+	/*
+	 * AttoptCacheEntry takes hash value from the system cache. For
+	 * AttoptCacheHash we use the same hash in order to speedup search by hash
+	 * value. This is used by hash_seq_init_with_hash_value().
+	 */
+	ctl.hash = relatt_cache_syshash;
+
 	AttoptCacheHash =
 		hash_create("Attopt cache", 256, ,
-	HASH_ELEM | HASH_BLOBS);
+	HASH_ELEM | HASH_FUNCTION);
 
 	/* Make sure we've initialized CacheMemoryContext. */
 	if (!CacheMemoryContext)

Re: type cache cleanup improvements

2024-03-08 Thread Teodor Sigaev

Yep, exacly. One time from 2^32 we reset whole cache instead of one (or several) 
entry with hash value = 0.


On 08.03.2024 18:35, Tom Lane wrote:

Aleksander Alekseev  writes:

One thing that I couldn't immediately figure out is why 0 hash value
is treated as a magic invalid value in TypeCacheTypCallback():


I've not read this patch, but IIRC in some places we have a convention
that hash value zero is passed for an sinval reset event (that is,
"flush all cache entries").

regards, tom lane




--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: type cache cleanup improvements

2024-03-05 Thread Teodor Sigaev






I would like to tweak the patch a little bit - change some comments,
add some Asserts, etc. Don't you mind?

You are welcome!
--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: type cache cleanup improvements

2024-03-05 Thread Teodor Sigaev


Hi!

Thank you for interesting in it!


These changes look very promising. Unfortunately the proposed patches
conflict with each other regardless the order of applying:

```
error: patch failed: src/backend/utils/cache/typcache.c:356
error: src/backend/utils/cache/typcache.c: patch does not apply
```

Try increase -F option of patch.

Anyway, union of both patches in attachment


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/diff --git a/src/backend/utils/cache/typcache.c b/src/backend/utils/cache/typcache.c
index 0d4d0b0a154..ccf798c440c 100644
--- a/src/backend/utils/cache/typcache.c
+++ b/src/backend/utils/cache/typcache.c
@@ -77,6 +77,15 @@
 /* The main type cache hashtable searched by lookup_type_cache */
 static HTAB *TypeCacheHash = NULL;
 
+/* The map from relation's oid to its type oid */
+typedef struct mapRelTypeEntry
+{
+	Oid	typrelid;
+	Oid type_id;
+} mapRelTypeEntry;
+
+static HTAB *mapRelType = NULL;
+
 /* List of type cache entries for domain types */
 static TypeCacheEntry *firstDomainTypeEntry = NULL;
 
@@ -330,6 +339,14 @@ static TupleDesc find_or_make_matching_shared_tupledesc(TupleDesc tupdesc);
 static dsa_pointer share_tupledesc(dsa_area *area, TupleDesc tupdesc,
    uint32 typmod);
 
+/*
+ * Hash function compatible with one-arg system cache hash function.
+ */
+static uint32
+type_cache_syshash(const void *key, Size keysize)
+{
+	return GetSysCacheHashValue1(TYPEOID, ObjectIdGetDatum(*(const Oid*)key));
+}
 
 /*
  * lookup_type_cache
@@ -355,8 +372,21 @@ lookup_type_cache(Oid type_id, int flags)
 
 		ctl.keysize = sizeof(Oid);
 		ctl.entrysize = sizeof(TypeCacheEntry);
+		/*
+		 * Hash function must be compatible to make possible work
+		 * hash_seq_init_with_hash_value(). Hash value in TypeEntry is taken
+		 * from system cache and we use the same (compatible) to use it's hash
+		 * value to speedup search by hash value instead of scanning whole hash
+		 */
+		ctl.hash = type_cache_syshash;
+
 		TypeCacheHash = hash_create("Type information cache", 64,
-	, HASH_ELEM | HASH_BLOBS);
+	, HASH_ELEM | HASH_FUNCTION);
+
+		ctl.keysize = sizeof(Oid);
+		ctl.entrysize = sizeof(mapRelTypeEntry);
+		mapRelType = hash_create("Map reloid to typeoid", 64,
+ , HASH_ELEM | HASH_BLOBS);
 
 		/* Also set up callbacks for SI invalidations */
 		CacheRegisterRelcacheCallback(TypeCacheRelCallback, (Datum) 0);
@@ -407,8 +437,7 @@ lookup_type_cache(Oid type_id, int flags)
 
 		/* These fields can never change, by definition */
 		typentry->type_id = type_id;
-		typentry->type_id_hash = GetSysCacheHashValue1(TYPEOID,
-	   ObjectIdGetDatum(type_id));
+		typentry->type_id_hash = get_hash_value(TypeCacheHash, _id);
 
 		/* Keep this part in sync with the code below */
 		typentry->typlen = typtup->typlen;
@@ -470,6 +499,24 @@ lookup_type_cache(Oid type_id, int flags)
 		ReleaseSysCache(tp);
 	}
 
+	/*
+	 * Add record to map relation->type. We don't bother even of type became
+	 * disconnected from relation, it seems to be impossible, but anyway,
+	 * storing old data is safe, in a worstest case we will just do an extra
+	 * cleanup cache entry.
+	 */
+	if (OidIsValid(typentry->typrelid) && typentry->typtype == TYPTYPE_COMPOSITE)
+	{
+		mapRelTypeEntry *relentry;
+
+		relentry = (mapRelTypeEntry*) hash_search(mapRelType,
+  >typrelid,
+  HASH_ENTER, NULL);
+
+		relentry->typrelid = typentry->typrelid;
+		relentry->type_id = typentry->type_id;
+	}
+
 	/*
 	 * Look up opclasses if we haven't already and any dependent info is
 	 * requested.
@@ -2264,6 +2311,37 @@ SharedRecordTypmodRegistryAttach(SharedRecordTypmodRegistry *registry)
 	CurrentSession->shared_typmod_table = typmod_table;
 }
 
+static void
+invalidateCompositeTypeCacheEntry(TypeCacheEntry *typentry)
+{
+	/* Delete tupdesc if we have it */
+	if (typentry->tupDesc != NULL)
+	{
+		/*
+		 * Release our refcount, and free the tupdesc if none remain.
+		 * (Can't use DecrTupleDescRefCount because this reference is
+		 * not logged in current resource owner.)
+		 */
+		Assert(typentry->tupDesc->tdrefcount > 0);
+		if (--typentry->tupDesc->tdrefcount == 0)
+			FreeTupleDesc(typentry->tupDesc);
+		typentry->tupDesc = NULL;
+
+		/*
+		 * Also clear tupDesc_identifier, so that anything watching
+		 * that will realize that the tupdesc has possibly changed.
+		 * (Alternatively, we could specify that to detect possible
+		 * tupdesc change, one must check for tupDesc != NULL as well
+		 * as tupDesc_identifier being the same as what was previously
+		 * seen.  That seems error-prone.)
+		 */
+		typentry->tupDesc_identifier = 0;
+	}
+
+	/* Reset equality/comparison/hashing validity information */
+	typentry->flags &= ~TCFLAGS_OPERATOR_FLAGS;
+}
+
 /*
  * TypeCacheRe

type cache cleanup improvements

2024-02-29 Thread Teodor Sigaev


Hi!

I'd like to suggest two independent patches to improve performance of type cache 
cleanup. I found a case where type cache cleanup was a reason for low 
performance. In short, customer makes 100 thousand temporary tables in one 
transaction.


1 mapRelType.patch
  It just adds a local map between relation and its type as it was suggested in 
comment above TypeCacheRelCallback(). Unfortunately, using syscache here was 
impossible because this call back could be called outside transaction and it 
makes impossible catalog lookups.


2 hash_seq_init_with_hash_value.patch
 TypeCacheTypCallback() loop over type hash  to find entry with given hash 
value. Here there are two problems: 1) there isn't  interface to dynahash to 
search entry with given hash value and 2) hash value calculation algorithm is 
differ from system cache. But coming hashvalue is came from system cache. Patch 
is addressed to both issues. It suggests hash_seq_init_with_hash_value() call 
which inits hash sequential scan over the single bucket which could contain 
entry with given hash value, and hash_seq_search() will iterate only over such 
entries. Anf patch changes hash algorithm to match syscache. Actually, patch 
makes small refactoring of dynahash, it makes common function hash_do_lookup() 
which does initial lookup in hash.


Some artificial performance test is in attachment, command to test is 'time psql 
< custom_types_and_array.sql', here I show only last rollback time and total 
execution time:

1) master 92d2ab7554f92b841ea71bcc72eaa8ab11aae662
Time: 33353,288 ms (00:33,353)
psql < custom_types_and_array.sql  0,82s user 0,71s system 1% cpu 1:28,36 total

2) mapRelType.patch
Time: 7455,581 ms (00:07,456)
psql < custom_types_and_array.sql  1,39s user 1,19s system 6% cpu 41,220 total

3) hash_seq_init_with_hash_value.patch
Time: 24975,886 ms (00:24,976)
psql < custom_types_and_array.sql  1,33s user 1,25s system 3% cpu 1:19,77 total

4) both
Time: 89,446 ms
psql < custom_types_and_array.sql  0,72s user 0,52s system 10% cpu 12,137 total


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/diff --git a/src/backend/utils/cache/typcache.c b/src/backend/utils/cache/typcache.c
index f411e33b8e7..49e27ca8ba4 100644
--- a/src/backend/utils/cache/typcache.c
+++ b/src/backend/utils/cache/typcache.c
@@ -331,6 +331,14 @@ static TupleDesc find_or_make_matching_shared_tupledesc(TupleDesc tupdesc);
 static dsa_pointer share_tupledesc(dsa_area *area, TupleDesc tupdesc,
    uint32 typmod);
 
+/*
+ * Hash function compatible with one-arg system cache hash function.
+ */
+static uint32
+type_cache_syshash(const void *key, Size keysize)
+{
+	return GetSysCacheHashValue1(TYPEOID, ObjectIdGetDatum(*(const Oid*)key));
+}
 
 /*
  * lookup_type_cache
@@ -356,8 +364,16 @@ lookup_type_cache(Oid type_id, int flags)
 
 		ctl.keysize = sizeof(Oid);
 		ctl.entrysize = sizeof(TypeCacheEntry);
+		/*
+		 * Hash function must be compatible to make possible work
+		 * hash_seq_init_with_hash_value(). Hash value in TypeEntry is taken
+		 * from system cache and we use the same (compatible) to use it's hash
+		 * value to speedup search by hash value instead of scanning whole hash
+		 */
+		ctl.hash = type_cache_syshash;
+
 		TypeCacheHash = hash_create("Type information cache", 64,
-	, HASH_ELEM | HASH_BLOBS);
+	, HASH_ELEM | HASH_FUNCTION);
 
 		/* Also set up callbacks for SI invalidations */
 		CacheRegisterRelcacheCallback(TypeCacheRelCallback, (Datum) 0);
@@ -408,8 +424,7 @@ lookup_type_cache(Oid type_id, int flags)
 
 		/* These fields can never change, by definition */
 		typentry->type_id = type_id;
-		typentry->type_id_hash = GetSysCacheHashValue1(TYPEOID,
-	   ObjectIdGetDatum(type_id));
+		typentry->type_id_hash = get_hash_value(TypeCacheHash, _id);
 
 		/* Keep this part in sync with the code below */
 		typentry->typlen = typtup->typlen;
@@ -2359,20 +2374,21 @@ TypeCacheTypCallback(Datum arg, int cacheid, uint32 hashvalue)
 	TypeCacheEntry *typentry;
 
 	/* TypeCacheHash must exist, else this callback wouldn't be registered */
-	hash_seq_init(, TypeCacheHash);
+	if (hashvalue == 0)
+		hash_seq_init(, TypeCacheHash);
+	else
+		hash_seq_init_with_hash_value(, TypeCacheHash, hashvalue);
+
 	while ((typentry = (TypeCacheEntry *) hash_seq_search()) != NULL)
 	{
-		/* Is this the targeted type row (or it's a total cache flush)? */
-		if (hashvalue == 0 || typentry->type_id_hash == hashvalue)
-		{
-			/*
-			 * Mark the data obtained directly from pg_type as invalid.  Also,
-			 * if it's a domain, typnotnull might've changed, so we'll need to
-			 * recalculate its constraints.
-			 */
-			typentry->flags &= ~(TCFLAGS_HAVE_PG_TYPE_DATA |
- TCFLAGS_CHECKED_DOMAIN_CONSTRAINTS);
-		}
+		Assert(hashvalue == 0 || typentry->type_id_hash == hashvalue);
+		/*
+		 * Ma

Re: CREATE TABLE ( .. STORAGE ..)

2022-02-02 Thread Teodor Sigaev


Hi!


Are they both set to name or ColId? Although they are the same.



Thank you, fixed, that was just an oversight.


2 For ColumnDef new member storage_name, did you miss the function 
_copyColumnDef()  _equalColumnDef()?


Thank you, fixed




Regards
Wenjing



2021年12月27日 15:51，Teodor Sigaev  写道：

Hi!

Working on pluggable toaster (mostly, for JSONB improvements, see links below) 
I had found that STORAGE attribute on column is impossible to set  in CREATE 
TABLE command but COMPRESS option is possible. It looks unreasonable. Suggested 
patch implements this possibility.

[1] http://www.sai.msu.su/~megera/postgres/talks/jsonb-pgconfnyc-2021.pdf
[2] http://www.sai.msu.su/~megera/postgres/talks/jsonb-pgvision-2021.pdf
[3] http://www.sai.msu.su/~megera/postgres/talks/jsonb-pgconfonline-2021.pdf
[4] http://www.sai.msu.su/~megera/postgres/talks/bytea-pgconfonline-2021.pdf

PS I will propose pluggable toaster patch a bit later
--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: 
http://www.sigaev.ru/




--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

create_table_storage-v2.patch.gz
Description: application/gzip

Re: Pluggable toaster

2022-02-01 Thread Teodor Sigaev


I agree ... but I'm also worried about what happens when we have
multiple table AMs. One can imagine a new table AM that is
specifically optimized for TOAST which can be used with an existing
heap table. One can imagine a new table AM for the main table that
wants to use something different for TOAST. So, I don't think it's
right to imagine that the choice of TOASTer depends solely on the
column data type. I'm not really sure how this should work exactly ...
but it needs careful thought.


Right. that's why we propose a validate method  (may be, it's a wrong 
name, but I don't known better one) which accepts several arguments, one 
of which is table AM oid. If that method returns false then toaster 
isn't useful with current TAM, storage or/and compression kinds, etc.


--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Pluggable toaster

2022-01-18 Thread Teodor Sigaev


Hi!

Maybe doing that kind of compression in TOAST is somehow simpler, but I 
don't see it.

Seems, in ideal world, compression should be inside toaster.



2 Current toast storage stores chunks in heap accesses method and to 
provide fast access by toast id it makes an index. Ideas:

   - store chunks directly in btree tree, pgsql's btree already has an
 INCLUDE columns, so, chunks and visibility data will be stored only
 in leaf pages. Obviously it reduces number of disk's access for
 "untoasting".
   - use another access method for chunk storage



Maybe, but that probably requires more thought - e.g. btree requires the 
values to be less than 1/3 page, so I wonder how would that play with 
toasting of values.
That's ok, because chunk size is 2000 bytes right now and its could be 
saved.




Seems you'd need a mapping table, to allow M:N mapping between types and 
toasters, linking it to all "compatible" types. It's not clear to me how 
would this work with custom data types, domains etc.
If toaster will look into internal structure then it should know type's 
binary format. So, new custom types have a little chance to work with 
old custom toaster. Default toaster works with any types.


Also, what happens to existing values when you change the toaster? What 
if the toasters don't use the same access method to store the chunks 
(heap vs. btree)? And so on.


vatatt_custom contains an oid of toaster and toaster is not allowed to 
delete (at least, in suggested patches). So, if column's toaster has 
been changed then old values will be detoasted  by toaster pointed in 
varatt_custom structure, not in column definition. This is very similar 
to storage attribute works: we we alter storage attribute only new 
values will be stored with pointed storage type.





More thought:
Now postgres has two options for column: storage and compression and 
now we add toaster. For me it seems too redundantly. Seems, storage 
should be binary value: inplace (plain as now) and toastable. All 
other variation such as toast limit, compression enabling, compression 
kind should be an per-column option for toaster (that's why we suggest 
valid toaster oid for any column with varlena/toastable datatype). It 
looks like a good abstraction but we will have a problem with backward 
compatibility and I'm afraid I can't implement it very fast.




So you suggest we move all of this to toaster? I'd say -1 to that, 
because it makes it much harder to e.g. add custom compression method, etc.
Hmm, I suggested to leave only toaster at upper level. Compression kind 
could be chosen in toaster's options (not implemented yet) or even make 
an API interface to compression to make it configurable. Right now, 
module developer could not implement a module with new compression 
method and it is a disadvantage.

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Pluggable toaster

2022-01-14 Thread Teodor Sigaev


I'd like to ask your opinion for next small questions

1) May be, we could add toasteroid to pg_type to for default toaster for 
datatype.  ALTER TYPE type SET DEFAULT TOASTER toaster;


2) The name of default toaster is deftoaster, which was choosen at 
night, may be heap_toaster is better? heap because default toaster 
stores chunks in heap table.


Thank you!

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Pluggable toaster

2022-01-14 Thread Teodor Sigaev





In my understanding, we want to be able to
1. Access data from a toasted object one slice at a time, by using
knowledge of the structure
2. If toasted data is updated, then update a minimum number of
slices(s), without rewriting the existing slices
3. If toasted data is expanded, then allownew slices to be appended to
the object without rewriting the existing slices


There are more options:
1 share common parts between not only versions of row but between all 
rows in a column. Seems strange but examples:

  - urls often have a common prefix and so storing in a prefix tree (as
SP-GiST does) allows significantly decrease storage size
  - the same for json - it's often use case with common part of its
hierarchical structure
  - one more usecase for json. If json use only a few schemes
(structure) it's possible to store in toast storage only values and
don't store keys and structure
2 Current toast storage stores chunks in heap accesses method and to 
provide fast access by toast id it makes an index. Ideas:

  - store chunks directly in btree tree, pgsql's btree already has an
INCLUDE columns, so, chunks and visibility data will be stored only
in leaf pages. Obviously it reduces number of disk's access for
"untoasting".
  - use another access method for chunk storage


ISTM that we would want the toast algorithm to be associated with the
datatype, not the column?
Can you explain your thinking?

Hm. I'll try to explain my motivation.
1) Datatype could have more than one suitable toasters. For different
   usecases: fast retrieving, compact storage, fast update etc. As I
   told   above, for jsonb there are several optimal strategies for
   toasting:   for values with a few different structures, for close to
   hierarchical structures,  for values with different parts by access
   mode (easy to imagine json with some keys used for search and some
   keys only for   output to user)
2) Toaster could be designed to work with different data type. Suggested
   appendable toaster is designed to work with bytea but could work with
   text

Looking on this point I have doubts where to store connection between 
toaster and datatype. If we add toasteroid to pg_type how to deal with 
several toaster for one datatype? (And we could want to has different 
toaster on one table!) If we add typoid to pg_toaster then how it will 
work with several datatypes? An idea to add a new many-to-many 
connection table seems workable but here there are another questions, 
such as will any toaster work with any table access method?


To resolve this bundle of question we propose validate() method of 
toaster, which should be called during DDL operation, i.e. toaster is 
assigned to column or column's datatype is changed.


More thought:
Now postgres has two options for column: storage and compression and now 
we add toaster. For me it seems too redundantly. Seems, storage should 
be binary value: inplace (plain as now) and toastable. All other 
variation such as toast limit, compression enabling, compression kind 
should be an per-column option for toaster (that's why we suggest valid 
toaster oid for any column with varlena/toastable datatype). It looks 
like a good abstraction but we will have a problem with backward 
compatibility and I'm afraid I can't implement it very fast.






We already have Expanded toast format, in-memory, which was designed
specifically to allow us to access sub-structure of the datatype
in-memory. So I was expecting to see an Expanded, on-disk, toast
format that roughly matched that concept, since Tom has already shown
us the way. (varatt_expanded). This would be usable by both JSON and
PostGIS.
Hm, I don't understand. varatt_custom has variable-length tail which 
toaster could use it by any way, appandable toaster use it to store 
appended tail.





Some other thoughts:

I imagine the data type might want to keep some kind of dictionary
inside the main toast pointer, so we could make allowance for some
optional datatype-specific private area in the toast pointer itself,
allowing a mix of inline and out-of-line data, and/or a table of
contents to the slices.

I'm thinking could also tackle these things at the same time:
* We want to expand TOAST to 64-bit pointers, so we can have more
pointers in a table
* We want to avoid putting the data length into the toast pointer, so
we can allow the toasted data to be expanded without rewriting
everything (to avoid O(N^2) cost)

Right

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Pluggable toaster

2021-12-30 Thread Teodor Sigaev

Hi!

We are working on custom toaster for JSONB [1], because current TOAST is
universal for any data type and because of that it has some disadvantages:

- "one toast fits all" may be not the best solution for particular
type or/and use cases
- it doesn't know the internal structure of data type, so it cannot
choose an optimal toast strategy
- it can't share common parts between different rows and even
versions of rows

Modification of current toaster for all tasks and cases looks too
complex, moreover, it will not works for custom data types. Postgres
is an extensible database, why not to extent its extensibility even
further, to have pluggable TOAST! We propose an idea to separate
toaster from heap using toaster API similar to table AM API etc.
Following patches are applicable over patch in [1]

1) 1_toaster_interface_v1.patch.gz
https://github.com/postgrespro/postgres/tree/toaster_interface
Introduces syntax for storage and formal toaster API. Adds column
atttoaster to pg_attribute, by design this column should not be equal to
invalid oid for any toastable datatype, ie it must have correct oid for
any type (not column) with non-plain storage. Since toaster may support
only particular datatype, core should check correctness of toaster set
by toaster validate method. New commands could be found in
src/test/regress/sql/toaster.sql

On-disk toast pointer structure now has one more possible struct -
varatt_custom with fixed header and variable tail which uses as a
storage for custom toasters. Format of built-in toaster is kept to allow
simple pg_upgrade logic.

Since toaster for column could be changed during table's lifetime we had
two options about toaster's drop operation:

- if column's toaster has been changed, then we need to re-toast all
values, which could be extremely expensive. In any case,
functions/operators should be ready to work with values toasted by
different toasters, although any toaster should execute simple
toast/detoast operation, which allows any existing code to
work with the new approach. Tracking dependency of toasters and
rows looks as bad idea.
- disallow drop toaster. We don't believe that there will be many
toasters at the same time (number of AM isn't very high too and
we don't believe that it will be changed significantly in the near
future), so prohibition of dropping of toaster looks reasonable.
In this patch set we choose second option.

Toaster API includes get_vtable method, which is planned to access the
custom toaster features which isn't covered by this API. The idea is,
that toaster returns some structure with some values and/or pointers to
toaster's methods and caller could use it for particular purposes, see
patch 4). Kind of structure identified by magic number, which should be
a first field in this structure.

Also added contrib/dummy_toaster to simplify checking.

psql/pg_dump are modified to support toaster object concept.

2) 2_toaster_default_v1.patch.gz
https://github.com/postgrespro/postgres/tree/toaster_default
Built-in toaster implemented (with some refactoring) uisng toaster API
as generic (or default) toaster. dummy_toaster here is a minimal
workable example, it saves value directly in toast pointer and fails if
value is greater than 1kb.

3) 3_toaster_snapshot_v1.patch.gz
https://github.com/postgrespro/postgres/tree/toaster_snapshot
The patch implements technology to distinguish row's versions in toasted
values to share common parts of toasted values between different
versions of rows

4) 4_bytea_appendable_toaster_v1.patch.gz
https://github.com/postgrespro/postgres/tree/bytea_appendable_toaster
Contrib module implements toaster for non-compressed bytea columns,
which allows fast appending to existing bytea value. Appended tail
stored directly in toaster pointer, if there is enough place to do it.

Note: patch modifies byteacat() to support contrib toaster. Seems, it's
looks ugly and contrib module should create new concatenation function.

We are open for any questions, discussions, objections and advices.
Thank you.

Peoples behind:
Oleg Bartunov
Nikita Gluhov
Nikita Malakhov
Teodor Sigaev

[1]
https://www.postgresql.org/message-id/flat/de83407a-ae3d-a8e1-a788-920eb334f...@sigaev.ru
<https://www.postgresql.org/message-id/flat/de83407a-ae3d-a8e1-a788-920eb334f...@sigaev.ru>

--
Teodor Sigaev E-mail: teo...@sigaev.ru
WWW: http://www.sigaev.ru/

4_bytea_appendable_toaster_v1.patch.gz
Description: application/gzip

3_toaster_snapshot_v1.patch.gz
Description: application/gzip

2_toaster_default_v1.patch.gz
Description: application/gzip

1_toaster_interface_v1.patch.gz
Description: application/gzip

CREATE TABLE ( .. STORAGE ..)

2021-12-26 Thread Teodor Sigaev


Hi!

Working on pluggable toaster (mostly, for JSONB improvements, see links 
below) I had found that STORAGE attribute on column is impossible to set 
 in CREATE TABLE command but COMPRESS option is possible. It looks 
unreasonable. Suggested patch implements this possibility.


[1] http://www.sai.msu.su/~megera/postgres/talks/jsonb-pgconfnyc-2021.pdf
[2] http://www.sai.msu.su/~megera/postgres/talks/jsonb-pgvision-2021.pdf
[3] http://www.sai.msu.su/~megera/postgres/talks/jsonb-pgconfonline-2021.pdf
[4] http://www.sai.msu.su/~megera/postgres/talks/bytea-pgconfonline-2021.pdf

PS I will propose pluggable toaster patch a bit later
--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

create_table_storage-v1.patch.gz
Description: application/gzip

Re: On login trigger: take three

2021-09-29 Thread Teodor Sigaev

 counter as propose upthread, since updating that can be made safer.  The
current coding also isn't instructing concurrent backends to perform relcache
rebuild.



I think there are pros and cons of each possible approach, but I think
I prefer the current way (which I have tweaked a bit) for similar
reasons to those explained by the original patch author when debating
whether to use reference-counting instead, in the discussion upthread
(e.g. it keeps it all in one place). Also, it seems to be more inline
with the documented reason why that pg_database flag was added in the
first place. I have debugged a few concurrent scenarios with the
current mechanism in place. If you still dislike the logic for
resetting the flag, please elaborate on the issues you foresee and one
of the alternative approaches can be tried.

I've attached the updated patch.

Regards,
Greg Nancarrow
Fujitsu Australia



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


2.sql
Description: application/sql

aggregate crash

2019-12-27 Thread Teodor Sigaev


Hi!

Found crash on production instance, assert-enabled build crashes in pfree() 
call, with default config. v11, v12 and head are affected, but, seems, you need 
to be a bit lucky.


The bug is comparing old and new aggregate pass-by-ref values only by pointer 
value itself, despite on null flag. Any function which returns null doesn't 
worry about actual returned Datum value, so that comparison isn't enough. Test 
case shows bug with ExecInterpExpr() but there several similar places (thanks 
Nikita Glukhov for help).

Attached patch adds check of null flag.

How to reproduce:
http://sigaev.ru/misc/xdump.sql.bz2
bzcat xdump.sql.bz2 | psql postgres && psql postgres < x.sql


Backtrace from v12 (note, newValue and oldValue are differ on current call, but 
oldValue points into pfreed memory) :
#0  0x00c8405a in GetMemoryChunkContext (pointer=0x80a808250) at 
../../../../src/include/utils/memutils.h:130

130 AssertArg(MemoryContextIsValid(context));
(gdb) bt
#0  0x00c8405a in GetMemoryChunkContext (pointer=0x80a808250) at 
../../../../src/include/utils/memutils.h:130

#1  0x00c85ae5 in pfree (pointer=0x80a808250) at mcxt.c:1058
#2  0x0080475e in ExecAggTransReparent (aggstate=0x80a806370, 
pertrans=0x80a87e830, newValue=34535940744, newValueIsNull=false, 
oldValue=34535932496, oldValueIsNull=false)

at execExprInterp.c:4209
#3  0x007ff51f in ExecInterpExpr (state=0x80a87f4d8, 
econtext=0x80a8065a8, isnull=0x7fffd7b7) at execExprInterp.c:1747
#4  0x0082c12b in ExecEvalExprSwitchContext (state=0x80a87f4d8, 
econtext=0x80a8065a8, isNull=0x7fffd7b7) at 
../../../src/include/executor/executor.h:308

#5  0x0082bc0f in advance_aggregates (aggstate=0x80a806370) at 
nodeAgg.c:679
#6  0x0082b8a6 in agg_retrieve_direct (aggstate=0x80a806370) at 
nodeAgg.c:1847

#7  0x00828782 in ExecAgg (pstate=0x80a806370) at nodeAgg.c:1572
#8  0x0080e712 in ExecProcNode (node=0x80a806370) at 
../../../src/include/executor/executor.h:240
#9  0x0080a4a1 in ExecutePlan (estate=0x80a806120, 
planstate=0x80a806370, use_parallel_mode=false, operation=CMD_SELECT, 
sendTuples=true, numberTuples=0,
direction=ForwardScanDirection, dest=0x80a851cc0, execute_once=true) at 
execMain.c:1646
#10 0x0080a362 in standard_ExecutorRun (queryDesc=0x80a853120, 
direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:364
#11 0x0080a114 in ExecutorRun (queryDesc=0x80a853120, 
direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:308
#12 0x00a79d6f in PortalRunSelect (portal=0x80a70d120, forward=true, 
count=0, dest=0x80a851cc0) at pquery.c:929
#13 0x00a79807 in PortalRun (portal=0x80a70d120, 
count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x80a851cc0, 
altdest=0x80a851cc0, completionTag=0x7fffdc30 "")

at pquery.c:770
#14 0x00a74e49 in exec_simple_query (
query_string=0x800d02950 
"SELECT\nT1._Q_001_F_000,\nT1._Q_001_F_001,\nT1._Q_001_F_002RRef,\nT1._Q_001_F_003RRef,\nT1._Q_001_F_004RRef,\nT1._Q_001_F_005RRef,\nMAX(CASE 
WHEN (T1._Q_001_F_010 > CAST(0 AS NUMERIC)) THEN T2._Q_001_F_009RR"...) at 
postgres.c:1227
#15 0x00a74123 in PostgresMain (argc=1, argv=0x80a6ef8f0, 
dbname=0x80a6ef850 "postgres", username=0x80a6ef830 "teodor") at postgres.c:4291

#16 0x009a4c3b in BackendRun (port=0x80a6e6000) at postmaster.c:4498
#17 0x009a403a in BackendStartup (port=0x80a6e6000) at postmaster.c:4189
#18 0x009a2f63 in ServerLoop () at postmaster.c:1727
#19 0x009a0a0a in PostmasterMain (argc=3, argv=0x7fffe3c8) at 
postmaster.c:1400

#20 0x0088deef in main (argc=3, argv=0x7fffe3c8) at main.c:210

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


x.sql
Description: application/sql
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 034970648f3..3b5333716d4 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -1743,7 +1743,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			 * expanded object that is already a child of the aggcontext,
 			 * assume we can adopt that value without copying it.
 			 */
-			if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+			if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue) ||
+fcinfo->isnull != pergroup->transValueIsNull)
 newVal = ExecAggTransReparent(aggstate, pertrans,
 			  newVal, fcinfo->isnull,
 			  pergroup->transValue,
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 6ee24eab3d2..e848853d6fd 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/node

Re: Failure in contrib test _int on loach

2019-04-30 Thread Teodor Sigaev


Hi!

So, if other hackers are agreed with my reasoning, the suggested fix is 
sufficient and can be committed.




Patch looks right, but I think that comment should be improved in follow piece:

if (stack->blkno != GIST_ROOT_BLKNO &&
-   stack->parent->lsn < GistPageGetNSN(stack->page))
+   ((stack->parent->lsn < GistPageGetNSN(stack->page)) ||
+stack->retry_from_parent == true))
{
/*
 * Concurrent split detected. There's no guarantee that the

Not only concurrent split could be deteced here and it was missed long ago. But 
this patch seems a good chance to change this comment.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: [PATCH] btree_gist: fix union implementation for variable length columns

2018-07-12 Thread Teodor Sigaev


It would be easier to figure this out if the btree_gist code weren't
so desperately undocumented.  Teodor, do you remember why it's like
this?


Will look.

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: cost_sort() improvements

2018-07-12 Thread Teodor Sigaev

At least [1] and [2] hit into to that issues and have an objections/questions 
about correctness of cost sort estimation. Suggested patch tries to improve 
current estimation and solve that issues.


Sorry for long delay but issue was even more complicated than I thought. When I 
tried to add cost_sort to GROUP BY patch I found it isn't work well as I hoped 
:(. The root of problem is suggested cost_sort improvement doesn't take into 
account uniqueness of first column and it's cost always the same. I tried to 
make experiments with all the same and all unique column and found that 
execution time could be significantly differ (up to 3 times on 1e7 randomly 
generated integer values). And I went to read book and papers.


So, I suggest new algorithm based on ideas in [3], [4]. In term of that papers, 
let I use designations  from previous my email and Xi - number of tuples with 
key Ki, then estimation is:

log(N! / (X1! * X2! * ..))  ~  sum(Xi * log(N/Xi))
In our case all Xi are the same because now we don't have an estimation of
group sizes, we have only estimation of number of groups. In this case,
formula becomes: N * log(NumberOfGroups). Next, to support correct estimation
of multicolumn sort we need separately compute each column, so, let k is a
column number, Gk - number of groups  defined by k columns:
N * sum( FCk * log(Gk) )
and FCk is a sum(Cj) over k columns. Gk+1 is defined as estimate_num_groups(NGk) 
- i.e. it's recursive, that's means that comparison of k-th columns includes all 
comparison j-th columns < k.


Next, I found that this formula gives significant underestimate with N < 1e4 and 
using [5] (See Chapter 8.2 and Theorem 4.1) found that we can use N + N * log(N) 
formula which actually adds a multiplier in simple case but it's unclear for me 
how to add that multimplier to multicolumn formula, so I added just a separate 
term proportional to N.


As demonstration, I add result of some test, *sh and *plt are scripts to 
reproduce results. Note, all charts are normalized because  computed cost as not 
a milliseconds. p.pl is a parser for JSON format of explain analyze.


N test - sort unique values with different number of tuples
NN test - same as previous one but sort of single value (all the same values)
U test - fixed number of total values (1e7) but differ number of unique values

Hope, new estimation much more close to real sort timing. New estimation 
function gives close result to old estimation function on trivial examples, but 
~10% more expensive, and three of regression tests aren't passed, will look 
closer later. Patch doesn't include  regression test changes.



[1] https://commitfest.postgresql.org/18/1124/
[2] https://commitfest.postgresql.org/18/1651/

[3] https://algs4.cs.princeton.edu/home/, course Algorithms,
Robert Sedgewick, Kevin Wayne,
[4] "Quicksort Is Optimal For Many Equal Keys", Sebastian Wild,
arXiv:1608.04906v4 [cs.DS] 1 Nov 2017
[5] "Introduction to algorithms", Thomas H. Cormen, Charles E.
 Leiserson, Ronald L. Rivest, ISBN 0-07-013143-0
--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a2a7e0c520..3588cff802 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1612,6 +1612,252 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 	rterm->pathtarget->width);
 }
 
+/*
+ * is_fake_var
+ *		Workaround for generate_append_tlist() which generates fake Vars with
+ *		varno == 0, that will cause a fail of estimate_num_group() call
+ */
+static bool
+is_fake_var(Expr *expr)
+{
+	if (IsA(expr, RelabelType))
+		expr = (Expr *) ((RelabelType *) expr)->arg;
+
+	return (IsA(expr, Var) && ((Var *) expr)->varno == 0);
+}
+
+/*
+ * get_width_cost_multiplier
+ *		Returns relative complexity of comparing two valyes based on it's width.
+ * The idea behind - long values have more expensive comparison. Return value is
+ * in cpu_operator_cost unit.
+ */
+static double
+get_width_cost_multiplier(PlannerInfo *root, Expr *expr)
+{
+	double	width = -1.0; /* fake value */
+
+	if (IsA(expr, RelabelType))
+		expr = (Expr *) ((RelabelType *) expr)->arg;
+
+	/* Try to find actual stat in corresonding relation */
+	if (IsA(expr, Var))
+	{
+		Var		*var = (Var *) expr;
+
+		if (var->varno > 0 && var->varno < root->simple_rel_array_size)
+		{
+			RelOptInfo	*rel = root->simple_rel_array[var->varno];
+
+			if (rel != NULL &&
+var->varattno >= rel->min_attr &&
+var->varattno <= rel->max_attr)
+			{
+int	ndx = var->varattno - rel->min_attr;
+
+if (rel->attr_widths[ndx] > 0)
+	width = rel->attr_widths[ndx];
+			}
+		}
+	}
+
+	/* Didn't find any actual stats, use estimati

Re: POC: GROUP BY optimization

2018-06-29 Thread Teodor Sigaev




I tried to attack the cost_sort() issues and hope on that basis we can solve 
problems with 0002 patch and improve incremental sort patch.




OK, will do. Thanks for working on this!


I hope, now we have a better cost_sort(). The obvious way is a try all 
combination of pathkeys in get_cheapest_group_keys_order() and choose cheapest 
one by cost_sort(). But it requires N! operations and potentially could be very 
expensive in case of large number of pathkeys and doesn't solve the issue with 
user-knows-what-he-does pathkeys. We could suggest an order of pathkeys as patch 
suggests now and if cost_sort() estimates cost is less than 80% (arbitrary 
chosen) cost of user-suggested pathkeys then it use our else user pathkeys.



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: cost_sort() improvements

2018-06-29 Thread Teodor Sigaev





Peter Geoghegan wrote:

On Thu, Jun 28, 2018 at 9:47 AM, Teodor Sigaev  wrote:

Current estimation of sort cost has following issues:
  - it doesn't differ one and many columns sort
  - it doesn't pay attention to comparison function cost and column width
  - it doesn't try to count number of calls of comparison function on per
column
basis


I've been suspicious of the arbitrary way in which I/O for external
sorts is costed by cost_sort() for a long time. I'm not 100% sure
about how we should think about this question, but I am sure that it
needs to be improved in *some* way. 
Agree, but I think it should be separated patch to attack this issue. And I 
don't have an idea how to do it, at least right now. Nevertheless, I hope, issue 
of estimation of disk-based sort isn't a blocker of CPU cost estimation 
improvements.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: POC: GROUP BY optimization

2018-06-28 Thread Teodor Sigaev


I'm afraid that's a misunderstanding - my objections did not really go
away. I'm just saying my claim that we're not messing with order of
grouping keys is not entirely accurate, because we do that in one
particular case.

I still think we need to be careful when introducing new optimizations
in this area - reordering the grouping keys by ndistinct, ORDER BY or
whatever. In particular I don't think we should commit these patches
that may quite easily cause regressions, and then hope some hypothetical
future patch fixes the costing. Those objections still stand.


Pls, have a look at https://commitfest.postgresql.org/18/1706/
I tried to attack the cost_sort() issues and hope on that basis we can solve 
problems with 0002 patch and improve incremental sort patch.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

cost_sort() improvements

2018-06-28 Thread Teodor Sigaev


Hi!

Current estimation of sort cost has following issues:
 - it doesn't differ one and many columns sort
 - it doesn't pay attention to comparison function cost and column width
 - it doesn't try to count number of calls of comparison function on per column
   basis

At least [1] and [2] hit into to that issues and have an objections/questions 
about correctness of cost sort estimation. Suggested patch tries to improve 
current estimation and solve that issues.


Basic ideas:
 - let me introduce some designations
i   - index of column, started with 1
N   - number of tuples to sort
Gi  - number of groups, calculated by i number columns. Obviously,
  G0 == 1. Number of groups is caculated by estimate_num_groups().
NGi - number of tuples in one group. NG0 == N and NGi = N/G(i-1)
Fi  - cost of comparison function call of i-th column
Wi  - average width in bytes of 1-ith column.
Ci  - total cost of one comparison call of column calculated as
  Fi * fine_function(Wi) where fine_function(Wi) looks like:
  if (Wi <= sizeof(Datum))
  return 1.0; //any type passed in Datum
  else
  return 1 + log16(Wi/sizeof(Datum));
  log16 just an empirical choice
 - So, cost for first column is 2.0 * C0 * log2(N)
   First column is always compared, multiplier 2.0 is defined per regression
   test. Seems, it estimates a cost for transferring tuple to CPU cache,
   starting cost for unpacking tuple, cost of call qsort compare wrapper
   function, etc. Removing this multiplier causes too optimistic prediction of
   cost.
 - cost for i-th columns:
   Ci * log2(NGi) => Ci * log2(N/G(i-1))
   So, here I believe, that i-th comparison function will be called only inside
   group which number is defined by (i-1) leading columns. Note, following
   discussion [1] I add multiplier 1.5 here to be closer to worst case when
   groups are significantly differ. Right now there is no way to determine how
   differ they could be. Note, this formula describes cost of first column too
   except difference with multiplier.
 - Final cost is cpu_operator_cost * N * sum(per column costs described above).
   Note, for single column with width <= sizeof(datum) and F1 = 1 this formula
   gives exactly the same result as current one.
 - for Top-N sort empiric is close to old one: use 2.0 multiplier as constant
   under log2, and use log2(Min(NGi, output_tuples)) for second and following
   columns.

I believe, suggested method could be easy enhanced to support [1] and used in 
[2].

Open items:
 - Fake Var. I faced with generate_append_tlist() which generates Var with
   varno = 0, it was invented, as I can see, at 2002 with comment 'should be
   changed in future'. Future hasn't yet come... I've added workaround to
   prevent call estimate_num_group() with such Vars, but I'm not sure that is
   enough.
 - Costs of all comparison functions now are the same and equal to 1. May be,
   it's time to change that.
 - Empiric constants. I know, it's impossible to remove them at all, but, may
   be, we need to find better estimation of them.

[1] https://commitfest.postgresql.org/18/1124/
[2] https://commitfest.postgresql.org/18/1651/

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a2a7e0c520..d6b19bb9ad 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1612,6 +1612,231 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 	rterm->pathtarget->width);
 }
 
+/*
+ * is_fake_var
+ *		Workaround for generate_append_tlist() which generates fake Vars with
+ *		varno == 0, that will cause a fail of estimate_num_group() call
+ */
+static bool
+is_fake_var(Expr *expr)
+{
+	if (IsA(expr, RelabelType))
+		expr = (Expr *) ((RelabelType *) expr)->arg;
+
+	return (IsA(expr, Var) && ((Var *) expr)->varno == 0);
+}
+
+/*
+ * get_width_cost_multiplier
+ *		Returns relative complexity of comparing two valyes based on it's width.
+ * The idea behind - long values have more expensive comparison. Return value is
+ * in cpu_operator_cost unit.
+ */
+static double
+get_width_cost_multiplier(PlannerInfo *root, Expr *expr)
+{
+	double	width = -1.0; /* fake value */
+
+	if (IsA(expr, RelabelType))
+		expr = (Expr *) ((RelabelType *) expr)->arg;
+
+	/* Try to find actual stat in corresonding relation */
+	if (IsA(expr, Var))
+	{
+		Var		*var = (Var *) expr;
+
+		if (var->varno > 0 && var->varno < root->simple_rel_array_size)
+		{
+			RelOptInfo	*rel = root->simple_rel_array[var->varno];
+
+			if (rel != NULL &&
+var->varattno >= rel->min_attr &&
+var->varattno <= rel->max_attr)
+			{
+int	ndx = var->

Re: Speedup of relation deletes during recovery

2018-06-15 Thread Teodor Sigaev


We just had a customer hit this issue. I kind of wonder whether this
shouldn't be backpatched: Currently the execution on the primary is
O(NBuffers * log(ndrels)) whereas it's O(NBuffers * ndrels) on the
standby - with a lot higher constants to boot.  That means it's very
easy to get into situations where the standy starts to lag behind very 
significantly.

+1, we faced with that too
--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: POC: GROUP BY optimization

2018-06-13 Thread Teodor Sigaev

, try producing the right output order
(if the ORDER BY was specified)

Yes.



But at this point it's kinda mashed together in ad-hoc manner, using
very simple heuristics / assumptions. We'll need to figure out how to
combine those optimizations.

The used heuristics here:
1) if there is a some order  - let's used it
2) if there is a required order - let's match that order to prevent extra sort 
node.

Incremental sort patch will improve cases where there is partial match of order.



BTW I get compiler warnings that n_preordered_groups may be used
uninitialized in add_paths_to_grouping_rel, because it's only set in an
if/else branch, and then used further down.

Fixed, but I believe that compiler is not smart enough here.



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 4f93afdebc..ec4095bd2e 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -328,6 +328,12 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+/*/
+bool debug_group_by_reorder_by_pathkeys = true;
+bool debug_group_by_match_order_by = true;
+bool debug_cheapest_group_by = true;
+//
+
 /*
  * Reorder GROUP BY pathkeys and clauses to match order of pathkeys. Function
  * returns new lists,  original GROUP BY lists stay untouched.
@@ -341,6 +347,9 @@ group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys,
 	ListCell	*key;
 	int			n;
 
+	if (debug_group_by_reorder_by_pathkeys == false)
+		return 0;
+
 	if (pathkeys == NIL || *group_pathkeys == NIL)
 		return 0;
 
@@ -380,6 +389,157 @@ group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys,
 	return n;
 }
 
+/*
+ * Order tail of list of group pathkeys by uniqueness descendetly. It allows to
+ * speedup sorting. Returns newly allocated lists, old ones stay untouched.
+ * n_preordered defines a head of list which order should be prevented.
+ */
+void
+get_cheapest_group_keys_order(PlannerInfo *root, double nrows,
+			  List *target_list,
+			  List **group_pathkeys, List **group_clauses,
+			  int n_preordered)
+{
+	struct
+	{
+		PathKey			*pathkey;
+		SortGroupClause	*sgc;
+		Node			*pathkeyExpr;
+	}
+   *keys, tmp;
+	intnkeys = list_length(*group_pathkeys) - n_preordered;
+	List		   *pathkeyExprList = NIL,
+   *new_group_pathkeys = NIL,
+   *new_group_clauses = NIL;
+	ListCell	   *cell;
+	inti = 0, n_keys_to_est;
+
+	if (nkeys < 2)
+		return; /* nothing to do */
+
+	/*
+	 * Will try to match ORDER BY pathkeys in hope that one sort is cheaper than
+	 * two
+	 */
+	if (debug_group_by_match_order_by &&
+		n_preordered == 0 && root->sort_pathkeys)
+	{
+		bool _save_debug_group_by_reorder_by_pathkeys =
+			debug_group_by_reorder_by_pathkeys;  /* DEBUG ONLY, to be removed */
+
+		debug_group_by_reorder_by_pathkeys = true;
+		n_preordered = group_keys_reorder_by_pathkeys(root->sort_pathkeys,
+	  group_pathkeys,
+	  group_clauses);
+		debug_group_by_reorder_by_pathkeys = _save_debug_group_by_reorder_by_pathkeys;
+
+		nkeys = list_length(*group_pathkeys) - n_preordered;
+		if (nkeys < 2)
+			return; /* nothing to do */
+	}
+
+	if (!debug_cheapest_group_by)
+		return;
+
+	keys = palloc(nkeys * sizeof(*keys));
+
+	/*
+	 * Collect information about pathkey for subsequent usage
+	 */
+	for_each_cell(cell, list_nth_cell(*group_pathkeys, n_preordered))
+	{
+		PathKey			*pathkey = (PathKey *) lfirst(cell);
+
+		keys[i].pathkey = pathkey;
+		keys[i].sgc = get_sortgroupref_clause(pathkey->pk_eclass->ec_sortref,
+			  *group_clauses);
+		keys[i].pathkeyExpr = get_sortgroupclause_expr(keys[i].sgc,
+	   target_list);
+		i++;
+	}
+
+	/*
+	 * Find the cheapest to sort order of columns. We will find a first column
+	 * with bigger number of group, then pair (first column in pair is  already
+	 * defined in first step), them triple and so on.
+	 */
+	for(n_keys_to_est = 1; n_keys_to_est <= nkeys - 1; n_keys_to_est++)
+	{
+		ListCell   *tail_cell;
+		int			best_i = 0;
+		double		best_est_num_groups = -1;
+
+		/* expand list of columns and remeber last cell */
+		pathkeyExprList = lappend(pathkeyExprList, NULL);
+		tail_cell = list_tail(pathkeyExprList);
+
+		/*
+		 * Find the best last column - the best means bigger number of groups,
+		 * previous columns are already choosen
+		 */
+		for(i = n_keys_to_est - 1; i < nkeys; i++)
+		{
+			double  est_num_groups;
+
+			lfirst(tail_cell) = keys[i].pathkeyExpr;
+			est_num_groups = estimate_num_groups(root, pathkeyExprList,
+ nrows, NULL);
+
+			if (est_num_groups > best_est_num_groups)
+			{
+best_est_num_groups = est_num_gro

Re: POC: GROUP BY optimization

2018-06-07 Thread Teodor Sigaev


Yes. But again, this description is a bit short. First it works after
first patch and might get some preordered leading pathkeys. Second, it
tries to match ORDER BY clause order if there is no preordered leading
pathkeys from first patch (it was introduced in v7). And third, if there
is a tail of unmatched pathkeys on previous stages then it will reorder
that tail.



OK, I haven't looked at v7 yet, but if I understand correctly it tries
to maintain the ordering as much as possible? Does that actually help? I
mean, the incremental sort patch allows the sorting to happen by pieces,
but here we still need to sort all the data, right?

Can you give an example demonstrating the benefit?


See tst.sql. queries are marked with opt (optimization is on) and noopt.

Query 1: select count(*) from btg group by v, r;
Query 2: select count(*) from btg group by n, v, r order by n;

For both queries it's possible to reorder v and r column, n column has the 
single distinct value.


On my laptop:
Query 1opt vs 1noopt: 3177.500 ms vs 6604.493 ms
  2opt vs 2noopt: 5800.307 ms vs 7486.967 ms

So, what we see:
1) for query 1 optimization gives 2 times better performance, for query 2 only 
30%. if column 'n' will be unique then time for query 1 and 2 should be the 
same. We could add check for preordered pathkeys in 
get_cheapest_group_keys_order() and if estimate_num_groups(reordered pathkeys) 
is close to 1 then we could do not reordering of tail of pathkeys.


2) Planing cost is the same for all queries. So, cost_sort() doesn't take into 
account even number of columns.



FWIW I think it would be useful to have "development GUC" that would
allow us to enable/disable these options during development, because
that makes experiments much easier. But then remove them before commit.
Added, v9, debug_enable_group_by_match_order_by and 
debug_enable_cheapest_group_by. I also checked compatibility with incremental 
sort patch, and all works except small merge conflict which could be resolved 
right before committing.


Next, I had a look on cost_incremental_sort() provided by incremental sort patch 
and, it's a pity, it doesn't solve our problem with the impact of the cost of 
per-column comparison function and number of its calls.



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index b22b36ec0e..c073eb061a 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -686,7 +686,18 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 
 			if (opcintype == cur_em->em_datatype &&
 equal(expr, cur_em->em_expr))
-return cur_ec;	/* Match! */
+			{
+/*
+ * Match!
+ *
+ * Copy sortref if it wasn't set yet, it's possible if ec was
+ * constructed from WHERE clause, ie it doesn't have target
+ * reference at all
+ */
+if (cur_ec->ec_sortref == 0 && sortref > 0)
+	cur_ec->ec_sortref = sortref;
+return cur_ec;
+			}
 		}
 	}
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ec66cb9c3c..4f93afdebc 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -26,6 +26,7 @@
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 
 
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -327,6 +328,58 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+/*
+ * Reorder GROUP BY pathkeys and clauses to match order of pathkeys. Function
+ * returns new lists,  original GROUP BY lists stay untouched.
+ */
+int
+group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys,
+			   List **group_clauses)
+{
+	List		*new_group_pathkeys= NIL,
+*new_group_clauses = NIL;
+	ListCell	*key;
+	int			n;
+
+	if (pathkeys == NIL || *group_pathkeys == NIL)
+		return 0;
+
+	/*
+	 * For each pathkey it tries to find corresponding GROUP BY pathkey and
+	 * clause.
+	 */
+	foreach(key, pathkeys)
+	{
+		PathKey			*pathkey = (PathKey *) lfirst(key);
+		SortGroupClause	*sgc;
+
+		/*
+		 * Pathkey should use the same allocated struct, so, equiality of
+		 * pointers is enough
+		 */
+		if (!list_member_ptr(*group_pathkeys, pathkey))
+			break;
+
+		new_group_pathkeys = lappend(new_group_pathkeys, pathkey);
+
+		sgc = get_sortgroupref_clause(pathkey->pk_eclass->ec_sortref,
+	  *group_clauses);
+		new_group_clauses = lappend(new_group_clauses, sgc);
+	}
+
+	n = list_length(new_group_pathkeys);
+
+	/*
+	 * Just append the rest of pathkeys and clauses
+	 */
+	*group_pathkeys = list_concat_unique_ptr(new_group_pathkeys,
+ *group_pathkeys);
+	*group_clause

Re: POC: GROUP BY optimization

2018-06-07 Thread Teodor Sigaev

Again agree. If we have fixed order of columns (ORDER BY) then we should not 
try to reorder it. Current patch follows that if I didn't a mistake.




This part seems to be more a misunderstanding between me and Claudio. I believe 
Claudio was referring to the column order in a GROUP BY, not ORDER BY. In which 
case we don't add any Sort, of course.

I hope so



I'm still opposed to adding arbitrary handicap to prioritize the order specified 
by user, for the reasons I explained before. We should aim to make the 
heuristics/costing reliable enough to make this unnecessary.

+1

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: POC: GROUP BY optimization

2018-06-07 Thread Teodor Sigaev


I don't see why not to generate all possible orderings (keeping only the
cheapest path for each pathkey variant) and let the optimizer to sort it
out.


I'm assuming costing the full N! possible orderings would be
prohibitively expensive.


That's true, but for the first step we need to improve cost_sort and only then 
try to find the best pathkeys order by optimal way.



- If the user requested that order, we assume it "somewhat
subjectively better" (by giving it a slightly reduced cost).
I don't think so. It's not a SQL way - DB should define the optimal way to 
execute query.



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: POC: GROUP BY optimization

2018-06-07 Thread Teodor Sigaev


So the costing was fairly trivial, we simply do something like

 comparison_cost = 2.0 * cpu_operator_cost;

 sort_cost = comparison_cost * tuples * LOG2(tuples);

which essentially ignores that there might be multiple columns, or that
the columns may have sort operator with different costs.

Agree. And distribution of keys.


The question is how reliable the heuristics can be. The current patch
uses just plain ndistinct, but that seems rather unreliable but I don't
have a clear idea how to improve that - we may have MCV for the columns
and perhaps some extended statistics, but I'm not sure how far we can
run with that.

v8 already uses another algorithm.



Essentially what we need to estimate the number of comparisons for each
column, to compute better comparison_cost.

Exactly


Priorization of the user-provided order can be as simple as giving
that comparison_cost a small handicap.


I see no point in doing that, and I don't recall a single place in the
planner where we do that. If the user specified ORDER BY, we'll slap an
explicit Sort on top when needed (which acts as the handicap, but in a
clear way). Otherwise we don't do such things - it'd be just plain
confusing (consider "ORDER BY a,b" vs. "ORDER BY b,c" with same data
types, ndistinct etc. but unexpectedly different costs). Also, what
would be a good value for the handicap?


Again agree. If we have fixed order of columns (ORDER BY) then we should not try 
to reorder it. Current patch follows that if I didn't a mistake.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: POC: GROUP BY optimization

2018-06-07 Thread Teodor Sigaev






OK, I haven't looked at v7 yet, but if I understand correctly it tries
to maintain the ordering as much as possible? Does that actually help? I
mean, the incremental sort patch allows the sorting to happen by pieces,
but here we still need to sort all the data, right?

Can you give an example demonstrating the benefit?





SELECT a, SUM(x) FROM
(
 SELECT a, b, COUNT(c) AS x FROM t1 GROUP BY a, b
 UNION ALL
 SELECT a, b, COUNT(c) AS x FROM t2 GROUP BY a, b
) foo GROUP BY a;

and indexes on (a,b) and (b,a) for both relations. The "deduplication"
by pathkeys I suggested would mean we might keep only index (a,b) on t1
and (b,a) on t2, which means the grouping by "a" can't leverage index
scans on both relations. But if we keep paths for both indexes on each
relation, we can.

yes, one of option


Isn't "estimation of cost of comparing function/number of unique values
in column could be not very accurate and so planner could make a wrong
choice" is more an argument against relying on it when doing these
optimizations?

FWIW it's one of the arguments Tom made in the incremental sort patch,
which relies on it too when computing cost of the incremental sort. I'm
sure it's going to be an obstacle there too. >

I saw 2 times difference in real-world application. Again, improving
sort cost estimation is a separate task.

Sure. But we also need to ask the other question, i.e. how many people
would be negatively affected by the optimization. And I admit I don't
know the answer to that, the next example is entirely made up.
Hm, seems, the best way here is a improving cost_sort estimation. Will try, but 
I think that is separated patch




FWIW I think it would be useful to have "development GUC" that would
allow us to enable/disable these options during development, because
that makes experiments much easier. But then remove them before commit.


Will do

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: POC: GROUP BY optimization

2018-06-06 Thread Teodor Sigaev

tistic, you can do either a greedy search (get the 
next column with the highest ndistinct coefficient) or exhaustive search 
(computing the estimated number of comparisons).


Another challenge is that using only the ndistinct coefficient assumes uniform 
distribution of the values. But you can have a column with 1M distinct values, 
where a single value represents 99% of the rows. And another column with 100k 
distinct values, with actual uniform distribution. I'm pretty sure it'd be more 
efficient to place the 100k column first.


Interesting.  Will think, thank you


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index b22b36ec0e..c073eb061a 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -686,7 +686,18 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 
 			if (opcintype == cur_em->em_datatype &&
 equal(expr, cur_em->em_expr))
-return cur_ec;	/* Match! */
+			{
+/*
+ * Match!
+ *
+ * Copy sortref if it wasn't set yet, it's possible if ec was
+ * constructed from WHERE clause, ie it doesn't have target
+ * reference at all
+ */
+if (cur_ec->ec_sortref == 0 && sortref > 0)
+	cur_ec->ec_sortref = sortref;
+return cur_ec;
+			}
 		}
 	}
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ec66cb9c3c..4f93afdebc 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -26,6 +26,7 @@
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 
 
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -327,6 +328,58 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+/*
+ * Reorder GROUP BY pathkeys and clauses to match order of pathkeys. Function
+ * returns new lists,  original GROUP BY lists stay untouched.
+ */
+int
+group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys,
+			   List **group_clauses)
+{
+	List		*new_group_pathkeys= NIL,
+*new_group_clauses = NIL;
+	ListCell	*key;
+	int			n;
+
+	if (pathkeys == NIL || *group_pathkeys == NIL)
+		return 0;
+
+	/*
+	 * For each pathkey it tries to find corresponding GROUP BY pathkey and
+	 * clause.
+	 */
+	foreach(key, pathkeys)
+	{
+		PathKey			*pathkey = (PathKey *) lfirst(key);
+		SortGroupClause	*sgc;
+
+		/*
+		 * Pathkey should use the same allocated struct, so, equiality of
+		 * pointers is enough
+		 */
+		if (!list_member_ptr(*group_pathkeys, pathkey))
+			break;
+
+		new_group_pathkeys = lappend(new_group_pathkeys, pathkey);
+
+		sgc = get_sortgroupref_clause(pathkey->pk_eclass->ec_sortref,
+	  *group_clauses);
+		new_group_clauses = lappend(new_group_clauses, sgc);
+	}
+
+	n = list_length(new_group_pathkeys);
+
+	/*
+	 * Just append the rest of pathkeys and clauses
+	 */
+	*group_pathkeys = list_concat_unique_ptr(new_group_pathkeys,
+ *group_pathkeys);
+	*group_clauses = list_concat_unique_ptr(new_group_clauses,
+			*group_clauses);
+
+	return n;
+}
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1611,6 +1664,39 @@ pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 	return 0;	/* path ordering not useful */
 }
 
+/*
+ * pathkeys_useful_for_grouping
+ *		Count the number of pathkeys that are useful for grouping (instead of
+ *		explicit sort)
+ *
+ * Group pathkeys could be reordered, so we don't bother about actual order in
+ * pathkeys
+ */
+static int
+pathkeys_useful_for_grouping(PlannerInfo *root, List *pathkeys)
+{
+	ListCell *key;
+	int		  n = 0;
+
+	if (root->group_pathkeys == NIL)
+		return 0;/* no special ordering requested */
+
+	if (pathkeys == NIL)
+		return 0;/* unordered path */
+
+	foreach(key, pathkeys)
+	{
+		PathKey	*pathkey = (PathKey *) lfirst(key);
+
+		if (!list_member_ptr(root->group_pathkeys, pathkey))
+			break;
+
+		n++;
+	}
+
+	return n;
+}
+
 /*
  * truncate_useless_pathkeys
  *		Shorten the given pathkey list to just the useful pathkeys.
@@ -1625,6 +1711,9 @@ truncate_useless_pathkeys(PlannerInfo *root,
 
 	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
 	nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
+	if (nuseful2 > nuseful)
+		nuseful = nuseful2;
+	nuseful2 = pathkeys_useful_for_grouping(root, pathkeys);
 	if (nuseful2 > nuseful)
 		nuseful = nuseful2;
 
@@ -1660,6 +1749,8 @@ has_useful_pathkeys(PlannerInfo *root, RelOptInfo *rel)
 {
 	if (rel->joininfo != NIL || rel->has_eclass_joins)
 		return true;			/* might be able to use pathkeys for merging */
+	if (root->group_pathk

Re: \d t: ERROR: XX000: cache lookup failed for relation

2018-06-05 Thread Teodor Sigaev




Teodor Sigaev wrote:

Ah, I think this is the missing, essential component:
CREATE INDEX ON t(right(i::text,1)) WHERE i::text LIKE '%1';

Finally, I reproduce it with attached script.
In attachment simplified version of script. psql uses ordinary sql query 
to get info about index with usual transaction isolation/MVCC. To create 
a description of index it calls pg_get_indexdef() which doesn't use 
transaction snapshot, it uses catalog snapshot because it accesses to 
catalog through system catalog cache. So the difference is used snapshot 
 between ordinary SQL query and  pg_get_indexdef(). I'm not sure that 
easy to fix and should it be fixed at all.


Simplified query:
SELECT c2.relname, i.indexrelid,
pg_catalog.pg_get_indexdef(i.indexrelid, 0, true)
FROM pg_catalog.pg_class c, pg_catalog.pg_class c2,
pg_catalog.pg_index i
WHERE c.relname = 't' AND c.oid = i.indrelid AND i.indexrelid = c2.oid
--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/


1.sh
Description: application/shellscript

Re: [PATCH] Trim trailing whitespace in vim and emacs

2018-06-05 Thread Teodor Sigaev


I use FileStly plugin to vim [1]. But I slightly modify it,  see in attachment.

FileStyle, sorry.

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: [PATCH] Trim trailing whitespace in vim and emacs

2018-06-05 Thread Teodor Sigaev


I once tried to have vim highlight trailing whitespace, spaces before
tabs, and overlength lines (>80 chars), and while I could do each thing
in isolation, I wasn't able to get it to highlight the three of them at
the same time.  If you have a recipe for that, I welcome it.


I use FileStly plugin to vim [1]. But I slightly modify it,  see in attachment.
And addition in .vimrc:

if expand('%:e') == "c" ||  expand('%:e') == "h" || expand('%:e')  == "cpp" || 
expand('%:e')  == "m" || expand('%:e') == "hpp" || expand('%:e') == "pl" || 
expand('%:e') == "pm" ||  expand('%:e') == "y" ||  expand('%:e') == "l"

else
let g:filestyle_plugin = 1
endif


[1]  https://www.vim.org/scripts/script.php?script_id=5065
--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
"   Copyright 2014 Alexander Serebryakov
"
"   Licensed under the Apache License, Version 2.0 (the "License");
"   you may not use this file except in compliance with the License.
"   You may obtain a copy of the License at
"
"   http://www.apache.org/licenses/LICENSE-2.0
"
"   Unless required by applicable law or agreed to in writing, software
"   distributed under the License is distributed on an "AS IS" BASIS,
"   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
"   See the License for the specific language governing permissions and
"   limitations under the License.

"Plugin checking the file to follow Your Vim settings

if !exists('g:filestyle_plugin')
  let g:filestyle_plugin = 1

  highligh FileStyleTabsError ctermbg=Red guibg=Red
  highligh FileStyleTrailngSpacesError ctermbg=Cyan guibg=Cyan
  highligh FileStyleSpacesError ctermbg=Yellow guibg=Yellow
  highligh FileStyleTooLongLine cterm=inverse gui=inverse

  "Defining auto commands
  augroup filestyle_auto_commands
autocmd!
autocmd BufReadPost,BufNewFile * call FileStyleActivate()
autocmd FileType * call FileStyleCheckFiletype()
autocmd WinEnter * call FileStyleCheck()
  augroup end

  "Defining plugin commands
  command! FileStyleActivate call FileStyleActivate()
  command! FileStyleDeactivate call FileStyleDeactivate()
  command! FileStyleCheck call FileStyleCheck()

endif


"Turn plugin on
function FileStyleActivate()
  let b:filestyle_active = 1
  call FileStyleCheck()
endfunction


"Turn plugin off
function FileStyleDeactivate()
  let b:filestyle_active = 0
  call clearmatches()
endfunction


"Check filetype to handle specific cases
function FileStyleCheckFiletype()
  "Avoid checking of help files
  if =='help'
call FileStyleDeactivate()
  endif
endfunction


"Highlighting specified pattern
function FileStyleHighlightPattern(highlight)
  call matchadd(a:highlight['highlight'], a:highlight['pattern'])
endfunction


"Checking expandtab option
function FileStyleExpandtabCheck()
  if 
let l:highlight = {'highlight' : 'FileStyleTabsError',
 \ 'pattern': '\t\+'}
  else
let l:highlight = {'highlight' : 'FileStyleSpacesError',
 \ 'pattern': '^\t* \{'.',\}'}
  endif
  call FileStyleHighlightPattern(l:highlight)
endfunction


"Checking trailing spaces
function FileStyleTrailingSpaces()
let l:highlight = {'highlight' : 'FileStyleTrailngSpacesError',
 \ 'pattern': '\s\+$'}
  call FileStyleHighlightPattern(l:highlight)
endfunction

"Checking inner spaces
function FileStyleInnerSpaces()
let l:highlight = {'highlight' : 'FileStyleTabsError',
 \ 'pattern': '\ \+\t'}
  call FileStyleHighlightPattern(l:highlight)
endfunction

function FileStyleAllInnerSpaces()
let l:highlight = {'highlight' : 'FileStyleTabsError',
 \ 'pattern': '\ \{4,\}'}
  call FileStyleHighlightPattern(l:highlight)
endfunction


"Checking long lines
function FileStyleLongLines()
  if  > 0
let l:highlight = {'highlight' : 'FileStyleTooLongLine',
 \ 'pattern': '\%' . (+1) . 'v.*' }
call FileStyleHighlightPattern(l:highlight)
  endif
endfunction


"Checking file dependenly on settings
function FileStyleCheck()
  if b:filestyle_active == 1
call clearmatches()
call FileStyleExpandtabCheck()
call FileStyleTrailingSpaces()
call FileStyleInnerSpaces()
call FileStyleAllInnerSpaces()
call FileStyleLongLines()
  endif
endfunction

Re: POC: GROUP BY optimization

2018-06-05 Thread Teodor Sigaev

on 
of the ordering. But as I argued in other threads, such implicit guarantees are 
really fragile, can silently break for arbitrary reasons (say, parallel query 
will do just that) and can be easily fixed by adding a proper ORDER BY. So I 
don't think this is an issue.
Agree. SQL by design doesn't give a warranty of particular order without 
explicit ORDER BY clause.


The other random thought is how will/could this interact with the incremental 
sorts patch. I know Alexander wanted to significantly limit where we actually 
consider the incremental sort, but I'm not sure if this was one of those places 
or not (is sure looks like a place where we would greatly benefit from it).
Seems, they should not directly interact. Patch tries to find cheapest column 
order, Alexander's patch tries to make sort cheaper - they are a different 
tasks. But I will try.


So to summarize this initial review - I do suggest splitting the patch into two 
parts. One that does the index magic, and one for this reordering optimization. 
The first part (additional indexes) seems quite fairly safe, likely to get 
committable soon. The other part (ndistinct reordering) IMHO requires more 
thought regarding costing and interaction with other query parts.

Thank you for review!

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 4f93afdebc..8897157817 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -380,6 +380,128 @@ group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys,
 	return n;
 }
 
+/*
+ * Work struct from sorting pathkeys
+ */
+typedef struct PathKeyNumGroup
+{
+	PathKey	   *key;
+	double		numGroups;
+	int			order; /* just to make qsort stable */
+} PathKeyNumGroup;
+
+static int
+pathkey_numgroups_cmp(const void *a, const void *b)
+{
+	const PathKeyNumGroup *pka = (const PathKeyNumGroup*)a,
+		  *pkb = (const PathKeyNumGroup*)b;
+
+	if (pka->numGroups == pkb->numGroups)
+		/* make qsort stable */
+		return (pka->order > pkb->order) ? 1 : -1;
+	return (pka->numGroups > pkb->numGroups) ? -1 : 1;
+}
+
+/*
+ * Order tail of list of group pathkeys by uniqueness descendetly. It allows to
+ * speedup sorting. Returns newly allocated lists, old ones stay untouched.
+ * startNum defines a head of list which order should be prevented.
+ */
+void
+get_cheapest_group_keys_order(PlannerInfo *root, double nrows,
+			  List *target_list,
+			  List **group_pathkeys, List **group_clauses,
+			  int startNum)
+{
+	PathKeyNumGroup	   *keys;
+	int	nkeys = list_length(*group_pathkeys) - startNum;
+	List			   *pathkeyExprList,
+	   *new_group_pathkeys = NIL,
+	   *new_group_clauses = NIL;
+	ListCell		   *cell;
+	int	i = 0;
+
+	if (nkeys < 2)
+		return; /* nothing to do */
+
+	/*
+	 * Will try to match ORDER BY pathkeys in hope that one sort is cheaper than
+	 * two
+	 */
+	if (startNum == 0 && root->sort_pathkeys)
+	{
+		startNum = group_keys_reorder_by_pathkeys(root->sort_pathkeys,
+  group_pathkeys,
+  group_clauses);
+
+		nkeys = list_length(*group_pathkeys) - startNum;
+		if (nkeys < 2)
+			return; /* nothing to do */
+	}
+
+	keys = palloc(nkeys * sizeof(*keys));
+	pathkeyExprList = list_make1(NULL);
+
+	/*
+	 * for each pathkeys to be ordered we count a number of group to sort them
+	 * in descending order
+	 */
+	for_each_cell(cell, list_nth_cell(*group_pathkeys, startNum))
+	{
+		PathKey			*pathkey = (PathKey *) lfirst(cell);
+		Node			*pathkeyExpr;
+		SortGroupClause	*sgc;
+
+		sgc = get_sortgroupref_clause(pathkey->pk_eclass->ec_sortref,
+	  *group_clauses);
+		pathkeyExpr = get_sortgroupclause_expr(sgc, target_list);
+		linitial(pathkeyExprList) = pathkeyExpr;
+		keys[i].numGroups= estimate_num_groups(root, pathkeyExprList,
+			   nrows, NULL);
+		keys[i].key = pathkey;
+		keys[i].order = i;
+
+		i++;
+	}
+
+	list_free(pathkeyExprList);
+
+	qsort(keys, nkeys, sizeof(*keys), pathkey_numgroups_cmp);
+
+	/*
+	 * Construct result lists
+	 */
+	i = 0;
+	foreach(cell, *group_pathkeys)
+	{
+		PathKey	   *pathkey;
+		SortGroupClause *sgc;
+
+		if (i < startNum)
+			/* presorted head */
+			pathkey = (PathKey *) lfirst(cell);
+		else
+			/* pathkey, sorted above */
+			pathkey = keys[i - startNum].key;
+
+		new_group_pathkeys = lappend(new_group_pathkeys, pathkey);
+
+		sgc = get_sortgroupref_clause(pathkey->pk_eclass->ec_sortref,
+	  *group_clauses);
+		new_group_clauses = lappend(new_group_clauses, sgc);
+
+		i++;
+	}
+
+	pfree(keys);
+
+	/* Just append the rest GROUP BY clauses */
+	new_group_clauses = list_concat_unique_ptr(new_group_clauses, *group_clauses);
+
+	*group_pathkeys = new_group_pathkeys;
+	*group_clauses = new_group_clauses;
+}
+

Re: \d t: ERROR: XX000: cache lookup failed for relation

2018-06-05 Thread Teodor Sigaev


Ah, I think this is the missing, essential component:
CREATE INDEX ON t(right(i::text,1)) WHERE i::text LIKE '%1';

Finally, I reproduce it with attached script.

INSERT 0 99  <- first insertion
ERROR:  cache lookup failed for relation 1032219
ALTER TABLE
ERROR:  cache lookup failed for relation 1033478
ALTER TABLE
ERROR:  cache lookup failed for relation 1034073
ALTER TABLE
ERROR:  cache lookup failed for relation 1034650
ALTER TABLE
ERROR:  cache lookup failed for relation 1035238
ALTER TABLE
ERROR:  cache lookup failed for relation 1035837

will investigate
--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


1.sh
Description: application/shellscript

Re: \d t: ERROR: XX000: cache lookup failed for relation

2018-06-04 Thread Teodor Sigaev




Is that considered an actionable problem?



I think so. but I'm not able to reproduce that, I wrote a script to simplify but 
it doesn't reproduce too.


And how long to wait to reproduce? I waited for one hour

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


1.sh
Description: application/shellscript

Re: New committers announced at PGCon 2018

2018-06-01 Thread Teodor Sigaev





Etsuro Fujita
Peter Geoghegan
Amit Kapila
Alexander Korotkov
Thomas Munro
Michael Paquier
Tomas Vondra

Congratulations to all!


+7!
--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: Few comments on commit 857f9c36 (skip full index scans )

2018-05-30 Thread Teodor Sigaev


I think it will always be set to BTREE_VERSION (See _bt_restore_meta).

You are right, pushed. Thank you!

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: POC: GROUP BY optimization

2018-05-30 Thread Teodor Sigaev

Cosmetics change: remove find_sort_group_clause_by_sortref() function added in 
v5 patch version because it duplicates existsing get_sortgroupref_clause().


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index b22b36ec0e..c073eb061a 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -686,7 +686,18 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 
 			if (opcintype == cur_em->em_datatype &&
 equal(expr, cur_em->em_expr))
-return cur_ec;	/* Match! */
+			{
+/*
+ * Match!
+ *
+ * Copy sortref if it wasn't set yet, it's possible if ec was
+ * constructed from WHERE clause, ie it doesn't have target
+ * reference at all
+ */
+if (cur_ec->ec_sortref == 0 && sortref > 0)
+	cur_ec->ec_sortref = sortref;
+return cur_ec;
+			}
 		}
 	}
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ec66cb9c3c..ca3c78dbbe 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -26,6 +26,7 @@
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 
 
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -327,6 +328,171 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+/*
+ * Reorder GROUP BY pathkeys and clauses to match order of pathkeys. Function
+ * returns new lists,  original GROUP BY lists stay untouched.
+ */
+int
+group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys,
+			   List **group_clauses)
+{
+	List		*new_group_pathkeys= NIL,
+*new_group_clauses = NIL;
+	ListCell	*key;
+	int			n;
+
+	if (pathkeys == NIL || *group_pathkeys == NIL)
+		return 0;
+
+	/*
+	 * For each pathkey it tries to find corresponding GROUP BY pathkey and
+	 * clause.
+	 */
+	foreach(key, pathkeys)
+	{
+		PathKey			*pathkey = (PathKey *) lfirst(key);
+		SortGroupClause	*sgc;
+
+		/*
+		 * Pathkey should use the same allocated struct, so, equiality of
+		 * pointers is enough
+		 */
+		if (!list_member_ptr(*group_pathkeys, pathkey))
+			break;
+
+		new_group_pathkeys = lappend(new_group_pathkeys, pathkey);
+
+		sgc = get_sortgroupref_clause(pathkey->pk_eclass->ec_sortref,
+	  *group_clauses);
+		new_group_clauses = lappend(new_group_clauses, sgc);
+	}
+
+	n = list_length(new_group_pathkeys);
+
+	/*
+	 * Just append the rest of pathkeys and clauses
+	 */
+	*group_pathkeys = list_concat_unique_ptr(new_group_pathkeys,
+ *group_pathkeys);
+	*group_clauses = list_concat_unique_ptr(new_group_clauses,
+			*group_clauses);
+
+	return n;
+}
+
+/*
+ * Work struct from sorting pathkeys
+ */
+typedef struct PathKeyNumGroup
+{
+	PathKey	   *key;
+	double		numGroups;
+	int			order; /* just to make qsort stable */
+} PathKeyNumGroup;
+
+static int
+pathkey_numgroups_cmp(const void *a, const void *b)
+{
+	const PathKeyNumGroup *pka = (const PathKeyNumGroup*)a,
+		  *pkb = (const PathKeyNumGroup*)b;
+
+	if (pka->numGroups == pkb->numGroups)
+		/* make qsort stable */
+		return (pka->order > pkb->order) ? 1 : -1;
+	return (pka->numGroups > pkb->numGroups) ? -1 : 1;
+}
+
+/*
+ * Order tail of list of group pathkeys by uniqueness descendetly. It allows to
+ * speedup sorting. Returns newly allocated lists, old ones stay untouched.
+ */
+void
+sort_group_key_by_distinct(PlannerInfo *root, double nrows, List *targetList,
+		   List **groupPathkeys, List **groupClauses,
+		   int startNum)
+{
+	PathKeyNumGroup	   *keys;
+	int	nkeys = list_length(*groupPathkeys) - startNum;
+	List			   *pathkeyExprList,
+	   *newGroupPathkeys = NIL,
+	   *newGroupClauses = NIL;
+	ListCell		   *cell;
+	int	i = 0;
+
+	if (nkeys < 2)
+		return; /* nothing to do */
+
+	keys = palloc(nkeys * sizeof(*keys));
+	pathkeyExprList = list_make1(NULL);
+
+	/*
+	 * for each pathkeys which aren't supported underlied index (or something
+	 * else) we count a number of group to sort them in descending order
+	 */
+	for_each_cell(cell, list_nth_cell(*groupPathkeys, startNum))
+	{
+		PathKey			*pathkey = (PathKey *) lfirst(cell);
+		Node			*pathkeyExpr;
+		SortGroupClause	*sgc;
+
+		sgc = get_sortgroupref_clause(pathkey->pk_eclass->ec_sortref,
+	  *groupClauses);
+		pathkeyExpr = get_sortgroupclause_expr(sgc, targetList);
+		linitial(pathkeyExprList) = pathkeyExpr;
+		keys[i].numGroups= estimate_num_groups(root, pathkeyExprList,
+			   nrows, NULL);
+		keys[i].key = pathkey;
+		keys[i].order = i;
+
+		i++;
+	}
+
+	list_free(pathkeyExprList);
+
+	qsort(keys, nkeys, sizeof(*keys), pathkey_numgroups_cmp);
+	i = 0

Re: Few comments on commit 857f9c36 (skip full index scans )

2018-05-30 Thread Teodor Sigaev


The metapage upgrade should be performed under critical section.


Agree. But after close look I found that btm_version change isn't wal-logged 
(near line 2251 in _bt_newroot). So btm_version is not propagated to 
replica/backup/etc.

I believe it should be fixed.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

POC: GROUP BY optimization

2018-05-29 Thread Teodor Sigaev


Hi!

As far as I known, columns in GROUP BY could be reordered without loss of 
correctness. But reorder could give a benefits in performance. Suggested patch 
implements that in several ways:


1) Reorder GROUP BY columns by number of unique values in descent order. Simple 
example shows a performance difference by two times (see 
opt_group_by_demostrator.sql, on my notebook first two queries demonstrate 
that). The idea is: if first column is unique then sort comparator will never 
compute difference of following columns.


2) Reorder GROUP BY columns to match existing pathkeys which are result of index 
scan or ORDER BY clause. It prevents to add sort node - suppose, it's a clear win.


3) Patch makes suggestion to use index by GROUP BY clause, but unlike to ORDER 
BY or merge join case it doesn't pay attention to actual order of columns 
because of 2)


Patch doesn't change any cost estimation computation, although 1) could take an 
advantage of it.


Some issues/problems/notices:

1) I didn't play around GROUPING SETS at all. As I can see, current coding 
prevents any optimization around it and, suppose, it should be addressed  to 
another patch


2) I'm not completely satisfied with counting of number of groups per column, it 
looks ugly without refactoring estimate_num_groups(). At least, now it could be 
called multiple times for each column. May be, this number should be added to 
PathKey structure?


3) EquivalenceClass->ec_sortref. If planner faced with column first time in 
WHERE clause it doesn't fill target reference field because it is unknown yet. 
Patch looks for accordance of PathKey (group_pathkeys) and SortGroupClause 
(groupClause) and fails in this case. So, get_eclass_for_sort_expr() now updates 
ec_sortref field if it's not initialized yet.


4) Algorithms to reorder columns is proportional to N^2 where N is number of 
columns, but I hope it isn't a problem because number of GROUP BY columns isn't 
very big usually.





--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index b22b36ec0e..c073eb061a 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -686,7 +686,18 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 
 			if (opcintype == cur_em->em_datatype &&
 equal(expr, cur_em->em_expr))
-return cur_ec;	/* Match! */
+			{
+/*
+ * Match!
+ *
+ * Copy sortref if it wasn't set yet, it's possible if ec was
+ * constructed from WHERE clause, ie it doesn't have target
+ * reference at all
+ */
+if (cur_ec->ec_sortref == 0 && sortref > 0)
+	cur_ec->ec_sortref = sortref;
+return cur_ec;
+			}
 		}
 	}
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ec66cb9c3c..6f61bd20de 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -26,6 +26,7 @@
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 
 
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -327,6 +328,192 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+/*
+ * Find a group clause by it's tle reference
+ */
+static SortGroupClause*
+find_sort_group_clause_by_sortref(List *groupClauses, Index tleSortGroupRef)
+{
+	ListCell	*cell;
+
+	foreach(cell, groupClauses)
+	{
+		SortGroupClause *sgc = (SortGroupClause*) lfirst(cell);
+
+		if (sgc->tleSortGroupRef == tleSortGroupRef)
+			return sgc;
+	}
+
+	elog(ERROR, "could not find target reference as sort group clause");
+
+	return NULL;
+}
+
+/*
+ * Reorder GROUP BY pathkeys and clauses to match order of pathkeys. Function
+ * returns new lists,  original GROUP BY lists stay untouched.
+ */
+int
+group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys,
+			   List **group_clauses)
+{
+	List		*new_group_pathkeys= NIL,
+*new_group_clauses = NIL;
+	ListCell	*key;
+	int			n;
+
+	if (pathkeys == NIL || *group_pathkeys == NIL)
+		return 0;
+
+	/*
+	 * For each pathkey it tries to find corresponding GROUP BY pathkey and
+	 * clause.
+	 */
+	foreach(key, pathkeys)
+	{
+		PathKey			*pathkey = (PathKey *) lfirst(key);
+		SortGroupClause	*sgc;
+
+		/*
+		 * Pathkey should use the same allocated struct, so, equiality of
+		 * pointers is enough
+		 */
+		if (!list_member_ptr(*group_pathkeys, pathkey))
+			break;
+
+		new_group_pathkeys = lappend(new_group_pathkeys, pathkey);
+
+		sgc = find_sort_group_clause_by_sortref(*group_clauses,
+pathkey->pk_eclass->ec_sortref);
+		new_group_clauses = lappend(new_group_clauses, sgc);
+	}
+
+	n = list_length(new_group_pathkeys);
+
+	/*
+	 *

Re: index scan over composite type

2018-05-16 Thread Teodor Sigaev


Thank you. Seems, it works, at least I can't find a counter-example for that.

Tom Lane wrote:

Teodor Sigaev <teo...@sigaev.ru> writes:

I'm not understand why postgres prefers to sort table instead of using
index only scan when query is a simple inner join on composite type.
Query with equality clause with constant works fine with index scan but
join not. Could somebody point me why? Thank you.


Hmm ... the reason why not seems to be that canonicalize_ec_expression()
improperly adds a RelabelType node, causing the composite-type Vars to not
be recognized as matching the eclass they should match.  The attached
patch fixes it and doesn't seem to break anything in the regression tests.

This raises the question of why we don't treat type RECORD more like a
true polymorphic type, but that's a can of worms I don't particularly want
to open right now.  For the moment, this is the only IsPolymorphicType
call in the planner AFAICS, so there's some reason to hope that we don't
have more bugs of the same ilk.

regards, tom lane



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

index scan over composite type

2018-05-15 Thread Teodor Sigaev


Hi!

I'm not understand why postgres prefers to sort table instead of using 
index only scan when query is a simple inner join on composite type. 
Query with equality clause with constant works fine with index scan but 
join not. Could somebody point me why? Thank you.


And I'm not able to force merge_join with index scans with any 
combination of enable_* variables.


 Attached script is a self-contained test script. Pg config file is 
default.


explain
select
a.idv, b.idv
from
a, b
where
a.idv = b.idv;


 Merge Join  (cost=25751.64..27751.64 rows=10 width=74)
   Merge Cond: (a.idv = b.idv)
   ->  Sort  (cost=12875.82..13125.82 rows=10 width=37)
 Sort Key: a.idv
 ->  Seq Scan on a  (cost=0.00..1834.00 rows=10 width=37)
   ->  Sort  (cost=12875.82..13125.82 rows=10 width=37)
 Sort Key: b.idv
 ->  Seq Scan on b  (cost=0.00..1834.00 rows=10 width=37)

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/


test.sql
Description: application/sql

Re: Postgres 11 release notes

2018-05-11 Thread Teodor Sigaev


857f9c36cda520030381bd8c2af20adf0ce0e1d4 Skip full index scan during cleanup
of B-tree indexes when possible


I read that and thought it was too details to be in the release notes.
It is not that it is unimportant, but it is hard to see how people would
notice the difference or change their behavior based on this change.

Hm, it could be something like:
Add possibility to miss scan indexes on append-only table, it decreases 
vacuum time on such tables.





c0cbe00fee6d0a5e0ec72c6d68a035e674edc4cc Add explicit cast from scalar jsonb
to all numeric and bool types.

Pls, edit:
Add functionjson(b)_to_tsvector to create usable text search queries
matching JSON/JSONB values (Dmitry Dolgov)
to
Add function json(b)_to_tsvector to create usable vectors to search in
json(b) (Dmitry Dolgov)
or somehow else. Your wording is about query but actually that functions are
about how to create tsvector from json(b)


OK, how is this text?

 Add function json(b)_to_tsvector() to create
 text search query for matching JSON/JSONB
 values (Dmitry Dolgov)

Not query - *_to_tsvector functions produce a tsvector. So:

Add function json(b)_to_tsvector() to use 
customizable full text search over json(b) columns.



--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Postgres 11 release notes

2018-05-11 Thread Teodor Sigaev




Bruce Momjian wrote:

I have committed the first draft of the Postgres 11 release notes.  I
will add more markup soon.  You can view the most current version here:

http://momjian.us/pgsql_docs/release-11.html

I expect a torrent of feedback.  ;-)

Hi!

Seems, you miss:
857f9c36cda520030381bd8c2af20adf0ce0e1d4 Skip full index scan during cleanup of 
B-tree indexes when possible
c0cbe00fee6d0a5e0ec72c6d68a035e674edc4cc Add explicit cast from scalar jsonb to 
all numeric and bool types.


Pls, edit:
Add functionjson(b)_to_tsvector to create usable text search queries matching 
JSON/JSONB values (Dmitry Dolgov)

to
Add function json(b)_to_tsvector to create usable vectors to search in json(b) 
(Dmitry Dolgov)
or somehow else. Your wording is about query but actually that functions are 
about how to create tsvector from json(b)


Thank you!

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: doc fixes: vacuum_cleanup_index_scale_factor

2018-05-10 Thread Teodor Sigaev


thanks to everyone, pushed

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: Cast jsonb to numeric, int, float, bool

2018-05-08 Thread Teodor Sigaev


1) Does this really pass muster from the translatability standpoint?
I doubt it.

Huh, I missed that.


I think you want the callers to look like

if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric)
cannotCastJsonbValue(v.type, "double precision");

where the subroutine contains the whole ereport() call, and its lookup
table entries are e.g.

gettext_noop("cannot cast jsonb string to type %s")

Thanks for your idea, patch is attached
--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c
index 9d2b89f90c..3ef71bbade 100644
--- a/src/backend/utils/adt/jsonb.c
+++ b/src/backend/utils/adt/jsonb.c
@@ -1857,7 +1857,7 @@ jsonb_object_agg_finalfn(PG_FUNCTION_ARGS)
 /*
  * Extract scalar value from raw-scalar pseudo-array jsonb.
  */
-static JsonbValue *
+static bool
 JsonbExtractScalar(JsonbContainer *jbc, JsonbValue *res)
 {
 	JsonbIterator *it;
@@ -1865,7 +1865,11 @@ JsonbExtractScalar(JsonbContainer *jbc, JsonbValue *res)
 	JsonbValue	tmp;
 
 	if (!JsonContainerIsArray(jbc) || !JsonContainerIsScalar(jbc))
-		return NULL;
+	{
+		/* inform caller about actual type of container */
+		res->type = (JsonContainerIsArray(jbc)) ? jbvArray : jbvObject;
+		return false;
+	}
 
 	/*
 	 * A root scalar is stored as an array of one element, so we get the array
@@ -1887,7 +1891,40 @@ JsonbExtractScalar(JsonbContainer *jbc, JsonbValue *res)
 	tok = JsonbIteratorNext(, , true);
 	Assert(tok == WJB_DONE);
 
-	return res;
+	return true;
+}
+
+/*
+ * Emit correct, translable cast error message
+ */
+static void
+cannotCastJsonbValue(enum jbvType type, const char *sqltype)
+{
+	static struct
+	{
+		enum jbvType	type;
+		char		   *msg;
+	}
+		messages[] =
+	{
+		{ jbvNull,		gettext_noop("cannot cast jsonb null to type %s") },
+		{ jbvString,	gettext_noop("cannot cast jsonb string to type %s") },
+		{ jbvNumeric,	gettext_noop("cannot cast jsonb numeric to type %s") },
+		{ jbvBool,		gettext_noop("cannot cast jsonb boolean to type %s") },
+		{ jbvArray,		gettext_noop("cannot cast jsonb array to type %s") },
+		{ jbvObject,	gettext_noop("cannot cast jsonb object to type %s") },
+		{ jbvBinary,	gettext_noop("cannot cast jsonb array or object to type %s") }
+	};
+	int i;
+
+	for(i=0; i<lengthof(messages); i++)
+		if (messages[i].type == type)
+			ereport(ERROR,
+	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+	 errmsg(messages[i].msg, sqltype)));
+
+	/* should be unreachable */
+	elog(ERROR, "unknown jsonb type: %d", (int)type);
 }
 
 Datum
@@ -1897,9 +1934,7 @@ jsonb_bool(PG_FUNCTION_ARGS)
 	JsonbValue	v;
 
 	if (!JsonbExtractScalar(>root, ) || v.type != jbvBool)
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("jsonb value must be boolean")));
+		cannotCastJsonbValue(v.type, "boolean");
 
 	PG_FREE_IF_COPY(in, 0);
 
@@ -1914,9 +1949,7 @@ jsonb_numeric(PG_FUNCTION_ARGS)
 	Numeric		retValue;
 
 	if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric)
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("jsonb value must be numeric")));
+		cannotCastJsonbValue(v.type, "numeric");
 
 	/*
 	 * v.val.numeric points into jsonb body, so we need to make a copy to
@@ -1937,9 +1970,7 @@ jsonb_int2(PG_FUNCTION_ARGS)
 	Datum		retValue;
 
 	if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric)
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("jsonb value must be numeric")));
+		cannotCastJsonbValue(v.type, "smallint");
 
 	retValue = DirectFunctionCall1(numeric_int2,
    NumericGetDatum(v.val.numeric));
@@ -1957,9 +1988,7 @@ jsonb_int4(PG_FUNCTION_ARGS)
 	Datum		retValue;
 
 	if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric)
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("jsonb value must be numeric")));
+		cannotCastJsonbValue(v.type, "integer");
 
 	retValue = DirectFunctionCall1(numeric_int4,
    NumericGetDatum(v.val.numeric));
@@ -1977,9 +2006,7 @@ jsonb_int8(PG_FUNCTION_ARGS)
 	Datum		retValue;
 
 	if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric)
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("jsonb value must be numeric")));
+		cannotCastJsonbValue(v.type, "bigint");
 
 	retValue = DirectFunctionCall1(numeric_int8,
    NumericGetDatum(v.val.numeric));
@@ -1997,9 +2024,7 @@ jsonb_float4(PG_FUNCTION_ARGS)
 	Datum		retValue;
 
 	if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric)
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("jsonb value must be numeric")));
+		ca

Re: Cast jsonb to numeric, int, float, bool

2018-05-08 Thread Teodor Sigaev


How about "cannot cast jsonb $json_type to $sql_type" where $json_type
is the type inside the jsonb (e.g. string, number, boolean, array,
object)?


Yes, that sounds pretty good.


Does anybody have an objections to patch?

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c
index 9d2b89f90c..9f72984fac 100644
--- a/src/backend/utils/adt/jsonb.c
+++ b/src/backend/utils/adt/jsonb.c
@@ -1857,7 +1857,7 @@ jsonb_object_agg_finalfn(PG_FUNCTION_ARGS)
 /*
  * Extract scalar value from raw-scalar pseudo-array jsonb.
  */
-static JsonbValue *
+static bool
 JsonbExtractScalar(JsonbContainer *jbc, JsonbValue *res)
 {
 	JsonbIterator *it;
@@ -1865,7 +1865,11 @@ JsonbExtractScalar(JsonbContainer *jbc, JsonbValue *res)
 	JsonbValue	tmp;
 
 	if (!JsonContainerIsArray(jbc) || !JsonContainerIsScalar(jbc))
-		return NULL;
+	{
+		/* inform caller about actual type of container */
+		res->type = (JsonContainerIsArray(jbc)) ? jbvArray : jbvObject;
+		return false;
+	}
 
 	/*
 	 * A root scalar is stored as an array of one element, so we get the array
@@ -1887,7 +1891,40 @@ JsonbExtractScalar(JsonbContainer *jbc, JsonbValue *res)
 	tok = JsonbIteratorNext(, , true);
 	Assert(tok == WJB_DONE);
 
-	return res;
+	return true;
+}
+
+/*
+ * Map jsonb types to human-readable names
+ */
+static char*
+JsonbTypeName(enum jbvType type)
+{
+	static struct
+	{
+		enum jbvType	type;
+		char		   *typename;
+	}
+		names[] =
+	{
+		{ jbvNull, "null" },
+		{ jbvString, "string" },
+		{ jbvNumeric, "numeric" },
+		{ jbvBool, "boolean" },
+		{ jbvArray, "array" },
+		{ jbvObject, "object" },
+		{ jbvBinary, "array or object" }
+	};
+	int i;
+
+	for(i=0; i<lengthof(names); i++)
+		if (names[i].type == type)
+			return names[i].typename;
+
+	/* should be unreachable */
+	elog(ERROR, "unknown jsonb type: %d", (int)type);
+
+	return NULL;
 }
 
 Datum
@@ -1899,7 +1936,7 @@ jsonb_bool(PG_FUNCTION_ARGS)
 	if (!JsonbExtractScalar(>root, ) || v.type != jbvBool)
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("jsonb value must be boolean")));
+ errmsg("cannot cast jsonb %s to boolean", JsonbTypeName(v.type;
 
 	PG_FREE_IF_COPY(in, 0);
 
@@ -1916,7 +1953,7 @@ jsonb_numeric(PG_FUNCTION_ARGS)
 	if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric)
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("jsonb value must be numeric")));
+ errmsg("cannot cast jsonb %s to numeric", JsonbTypeName(v.type;
 
 	/*
 	 * v.val.numeric points into jsonb body, so we need to make a copy to
@@ -1939,7 +1976,7 @@ jsonb_int2(PG_FUNCTION_ARGS)
 	if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric)
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("jsonb value must be numeric")));
+ errmsg("cannot cast jsonb %s to int2", JsonbTypeName(v.type;
 
 	retValue = DirectFunctionCall1(numeric_int2,
    NumericGetDatum(v.val.numeric));
@@ -1959,7 +1996,7 @@ jsonb_int4(PG_FUNCTION_ARGS)
 	if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric)
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("jsonb value must be numeric")));
+ errmsg("cannot cast jsonb %s to int4", JsonbTypeName(v.type;
 
 	retValue = DirectFunctionCall1(numeric_int4,
    NumericGetDatum(v.val.numeric));
@@ -1979,7 +2016,7 @@ jsonb_int8(PG_FUNCTION_ARGS)
 	if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric)
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("jsonb value must be numeric")));
+ errmsg("cannot cast jsonb %s to int8", JsonbTypeName(v.type;
 
 	retValue = DirectFunctionCall1(numeric_int8,
    NumericGetDatum(v.val.numeric));
@@ -1999,7 +2036,7 @@ jsonb_float4(PG_FUNCTION_ARGS)
 	if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric)
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("jsonb value must be numeric")));
+ errmsg("cannot cast jsonb %s to float4", JsonbTypeName(v.type;
 
 	retValue = DirectFunctionCall1(numeric_float4,
    NumericGetDatum(v.val.numeric));
@@ -2019,7 +2056,7 @@ jsonb_float8(PG_FUNCTION_ARGS)
 	if (!JsonbExtractScalar(>root, ) || v.type != jbvNumeric)
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("jsonb value must be numeric")));
+ errmsg("cannot cast jsonb %s to float8", JsonbTypeName(v.type;
 
 	retValue = DirectFunctionCall1(numeric_float8,
    NumericGetDatum(v.val.numeric));
diff --git a/src/test/regress/expected/jsonb.out b/src/tes

Re: nbtsort.c performs unneeded (though harmless) truncation

2018-05-04 Thread Teodor Sigaev


Thank you, pushed.

Peter Geoghegan wrote:

I noticed that we're calling _bt_nonkey_truncate() needlessly when a
minimum key is needed at the leftmost page on each level of the tree.
This was always a special case, and I think that it should remain as
one. Attached patch avoids unneeded truncations, while preserving the
generic BTreeTupleGetNAtts() assertions.

This isn't a correctness issue, and the extra overhead of unneeded
truncation should be negligible, but what we have now seems confusing
to me.



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-05-04 Thread Teodor Sigaev


Thanks to everyone, v3 is pushed.

Teodor Sigaev wrote:
I don't very happy with rootBuffer added everywhere. ginInsertItemPointers() 
and   ginPrepareDataScan() now take both args, rootBlkno and rootBuffer, 
second could be invalid. As I can see, you did it to call 
CheckForSerializableConflictIn() in ginEntryInsert(). Seems, it could be 
reverted and CheckForSerializableConflictIn() should be added to 
ginFindLeafPage() with searchMode = false.

Implemented, v3 is attached.



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: FinishPreparedTransaction missing HOLD_INTERRUPTS section

2018-05-03 Thread Teodor Sigaev


Thank you, pushed!

Stas Kelvich wrote:

Hello.

It seems that during COMMIT PREPARED FinishPreparedTransaction() doesn't
hold interrupts around writing to wal and cleaning up ProcArray and GXact
entries. At least RemoveTwoPhaseFile (which is called in between) can print
a warning with ereport(), which, in turn will check for interrupts and
therefore can cancel backend or throw an error before GXact clean-up.

Other similar places like CommitTransaction and PrepareTransaction have
such hold interrupts sections.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-28 Thread Teodor Sigaev

I don't very happy with rootBuffer added everywhere. ginInsertItemPointers() and 
  ginPrepareDataScan() now take both args, rootBlkno and rootBuffer, second 
could be invalid. As I can see, you did it to call 
CheckForSerializableConflictIn() in ginEntryInsert(). Seems, it could be 
reverted and CheckForSerializableConflictIn() should be added to 
ginFindLeafPage() with searchMode = false.

Implemented, v3 is attached.

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
commit 816a9d9cc81a193a10fb49cf26ff2b0214e0ff9b
Author: Teodor Sigaev <teo...@sigaev.ru>
Date:   Sat Apr 28 20:17:46 2018 +0300

Re-think predicate locking on GIN indexes.

The principle behind the locking was not very well thought-out, and not
documented. Add a section in the README to explain how it's supposed to
work, and change the code so that it actually works that way.

This fixes two bugs:

1. If fast update was turned on concurrently, subsequent inserts to the
   pending list would not conflict with predicate locks that were acquired
   earlier, on entry pages. The included 'predicate-gin-fastupdate' test
   demonstrates that. To fix, make all scans acquire a predicate lock on
   the metapage. That lock represents a scan of the pending list, whether
   or not there is a pending list at the moment. Forget about the
   optimization to skip locking/checking for locks, when fastupdate=off.
   Maybe some of that was safe, but I couldn't convince myself of it, so
   better to rip it out and keep things simple.
2. If a scan finds no match, it still needs to lock the entry page. The
   point of predicate locks is to lock the gabs between values, whether
   or not there is a match. The included 'predicate-gin-nomatch' test
   tests that case.

In addition to those two bug fixes, this removes some unnecessary locking,
following the principle laid out in the README. Because all items in
a posting tree have the same key value, a lock on the posting tree root is
enough to cover all the items. (With a very large posting tree, it would
possibly be better to lock the posting tree leaf pages instead, so that a
"skip scan" with a query like "A & B", you could avoid unnecessary conflict
if a new tuple is inserted with A but !B. But let's keep this simple.)

Also, some spelling  fixes.

diff --git a/src/backend/access/gin/README b/src/backend/access/gin/README
index 990b5ffa58..cc434b1feb 100644
--- a/src/backend/access/gin/README
+++ b/src/backend/access/gin/README
@@ -331,6 +331,40 @@ page-deletions safe; it stamps the deleted pages with an XID and keeps the
 deleted pages around with the right-link intact until all concurrent scans
 have finished.)
 
+Predicate Locking
+-
+
+GIN supports predicate locking, for serializable snapshot isolation.
+A predicate locks represent that a scan has scanned a range of values.  They
+are not concerned with physical pages as such, but the logical key values.
+A predicate lock on a page covers the key range that would belong on that
+page, whether or not there are any matching tuples there currently.  In other
+words, a predicate lock on an index page covers the "gaps" between the index
+tuples.  To minimize false positives, predicate locks are acquired at the
+finest level possible.
+
+* Like in the B-tree index, it is enough to lock only leaf pages, because all
+  insertions happen at the leaf level.
+
+* In an equality search (i.e. not a partial match search), if a key entry has
+  a posting tree, we lock the posting tree root page, to represent a lock on
+  just that key entry.  Otherwise, we lock the entry tree page.  We also lock
+  the entry tree page if no match is found, to lock the "gap" where the entry
+  would've been, had there been one.
+
+* In a partial match search, we lock all the entry leaf pages that we scan,
+  in addition to locks on posting tree roots, to represent the "gaps" between
+  values.
+
+* In addition to the locks on entry leaf pages and posting tree roots, all
+  scans grab a lock the metapage.  This is to interlock with insertions to
+  the fast update pending list.  An insertion to the pending list can really
+  belong anywhere in the tree, and the lock on the metapage represents that.
+
+The interlock for fastupdate pending lists means that with fastupdate=on,
+we effectively always grab a full-index lock, so you could get a lot of false
+positives.
+
 Compatibility
 -
 
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 828c7074b7..030d0f4418 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -84,6 +84,9 @@ ginFindLeafPage(GinBtree btree, bool searchMode, Snapshot snapshot)
 	stack->

Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-28 Thread Teodor Sigaev

I took a stab at fixing those issues, as well as the bug when fastupdate is 
turned on concurrently. Does the attached patch look sane to you?
That's look sane, and I believe it should be applied but I see some issue in 
your patch:


I don't very happy with rootBuffer added everywhere. ginInsertItemPointers() and 
 ginPrepareDataScan() now take both args, rootBlkno and rootBuffer, second 
could be invalid. As I can see, you did it to call 
CheckForSerializableConflictIn() in ginEntryInsert(). Seems, it could be 
reverted and CheckForSerializableConflictIn() should be added to 
ginFindLeafPage() with searchMode = false.


Rebased patch is attached.

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
commit 42f73743d9ddf576d2dd9ece3979b407cd70cbfe
Author: Teodor Sigaev <teo...@sigaev.ru>
Date:   Fri Apr 27 13:14:57 2018 +0300

Re-think predicate locking on GIN indexes.

The principle behind the locking was not very well thought-out, and not
documented. Add a section in the README to explain how it's supposed to
work, and change the code so that it actually works that way.

This fixes two bugs:

1. If fast update was turned on concurrently, subsequent inserts to the
   pending list would not conflict with predicate locks that were acquired
   earlier, on entry pages. The included 'predicate-gin-fastupdate' test
   demonstrates that. To fix, make all scans acquire a predicate lock on
   the metapage. That lock represents a scan of the pending list, whether
   or not there is a pending list at the moment. Forget about the
   optimization to skip locking/checking for locks, when fastupdate=off.
   Maybe some of that was safe, but I couldn't convince myself of it, so
   better to rip it out and keep things simple.

2. If a scan finds no match, it still needs to lock the entry page. The
   point of predicate locks is to lock the gabs between values, whether
   or not there is a match. The included 'predicate-gin-nomatch' test
   tests that case.

In addition to those two bug fixes, this removes some unnecessary locking,
following the principle laid out in the README. Because all items in
a posting tree have the same key value, a lock on the posting tree root is
enough to cover all the items. (With a very large posting tree, it would
possibly be better to lock the posting tree leaf pages instead, so that a
"skip scan" with a query like "A & B", you could avoid unnecessary conflict
if a new tuple is inserted with A but !B. But let's keep this simple.)

Also, some spelling  fixes.

diff --git a/src/backend/access/gin/README b/src/backend/access/gin/README
index 990b5ffa58..cc434b1feb 100644
--- a/src/backend/access/gin/README
+++ b/src/backend/access/gin/README
@@ -331,6 +331,40 @@ page-deletions safe; it stamps the deleted pages with an XID and keeps the
 deleted pages around with the right-link intact until all concurrent scans
 have finished.)
 
+Predicate Locking
+-
+
+GIN supports predicate locking, for serializable snapshot isolation.
+A predicate locks represent that a scan has scanned a range of values.  They
+are not concerned with physical pages as such, but the logical key values.
+A predicate lock on a page covers the key range that would belong on that
+page, whether or not there are any matching tuples there currently.  In other
+words, a predicate lock on an index page covers the "gaps" between the index
+tuples.  To minimize false positives, predicate locks are acquired at the
+finest level possible.
+
+* Like in the B-tree index, it is enough to lock only leaf pages, because all
+  insertions happen at the leaf level.
+
+* In an equality search (i.e. not a partial match search), if a key entry has
+  a posting tree, we lock the posting tree root page, to represent a lock on
+  just that key entry.  Otherwise, we lock the entry tree page.  We also lock
+  the entry tree page if no match is found, to lock the "gap" where the entry
+  would've been, had there been one.
+
+* In a partial match search, we lock all the entry leaf pages that we scan,
+  in addition to locks on posting tree roots, to represent the "gaps" between
+  values.
+
+* In addition to the locks on entry leaf pages and posting tree roots, all
+  scans grab a lock the metapage.  This is to interlock with insertions to
+  the fast update pending list.  An insertion to the pending list can really
+  belong anywhere in the tree, and the lock on the metapage represents that.
+
+The interlock for fastupdate pending lists means that with fastupdate=on,
+we effectively always grab a full-index lock, so you could get a lot of false
+positives.
+
 Compatibility
 -
 
diff --git a/src/backend/access/gin/ginbtree.c b/src/backe

Re: Corrupted btree index on HEAD because of covering indexes

2018-04-26 Thread Teodor Sigaev


Teodor, are you caught up to the point where it'd be okay to run pgindent,
or are there still patches waiting?


There is a gin predicate lock patch from Heikki, I will work on it. But 
I have not objection to run pgindent, I believe gin patch will be easy 
to change.


--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Corrupted btree index on HEAD because of covering indexes

2018-04-25 Thread Teodor Sigaev


It looks like you pushed v1, which didn't have the tests and other
changes you asked for. Attached patch adds those back.


Oops, I missed, I don't know how... Thank you very much for your quick eye!

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Corrupted btree index on HEAD because of covering indexes

2018-04-25 Thread Teodor Sigaev


2) Pls, add to test DELETE as it done in
6db4b49986be3fe59a1f6ba6fabf9852864efc3e


Done. I will leave it to you to decide whether or not the original
create_index.sql test can now be removed.
Leave it in both places. In ideal world, in amcheck test we need to create a 
broken index, but I don't know a correct way to do it :)


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: Corrupted btree index on HEAD because of covering indexes

2018-04-25 Thread Teodor Sigaev


Thank you, pushed.

Peter Geoghegan wrote:

On Tue, Apr 24, 2018 at 9:06 AM, Teodor Sigaev <teo...@sigaev.ru> wrote:

Perfect!

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: Corrupted btree index on HEAD because of covering indexes

2018-04-24 Thread Teodor Sigaev


Perfect! I would like to commit it but have some suggestions:

1)
TRUNCATE bttest_multi;
INSERT INTO bttest_multi SELECT i, i%2  FROM generate_series(1, 10) as i;
SELECT bt_index_parent_check('bttest_multi_idx', true);

to improve test stability it would be better to disable autovacuum:
ALTER TABLE bttest_multi SET (autovacuum_enabled = false)


2) Pls, add to test DELETE as it done in 
6db4b49986be3fe59a1f6ba6fabf9852864efc3e

3) It's not directly connected to this patch, but allocation of BtreeCheckState 
is not needed, it could be allocated on stack.


4) Nevertheless, I suggest to use palloc0 (or memset(0)) for BtreeCheckState. 
Now several fields of that structure could be not inited.



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: bms_prev_member won't work correctly if bitmapword is 64-bits

2018-04-23 Thread Teodor Sigaev


Thank you, pushed

David Rowley wrote:

bms_prev_member mistakenly has a hardcoded 24 in the function. This
should really be BITS_PER_BITMAPWORD - 8 so that it is properly set to
56 if someone compiles with 64-bit bitmapwords.

The attached fixes this and also adds a test to exercise the function
a bit. [1] indicates there's currently no coverage of this function at
all.

[1] https://coverage.postgresql.org/src/backend/nodes/bitmapset.c.gcov.html



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: Corrupted btree index on HEAD because of covering indexes

2018-04-23 Thread Teodor Sigaev


Thank you, pushed


I see your point. Maybe don't have the newline between the get and
set, though, to match the existing style. And, the note about the
assertion seems unnecessary.

Removed newline and note


"Get/set leaf page highkey's link. During the second phase of
deletion, the target leaf page's high key may point to an ancestor
page (at all other times, the leaf level high key's link is not used).
See the nbtree README for full details."

Changed as you suggested.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: Corrupted btree index on HEAD because of covering indexes

2018-04-20 Thread Teodor Sigaev


Should we use BTreeInnerTupleGetDownLink() as soon as we use
BTreeInnerTupleSetDownLink() for setting this?
Or even invent BTreeInnerTupleDownLinkIsValid() macro?
I am not sure. Here we actually store UP link - to top parent to remove. I'm 
afraid using BTreeInnerTupleGetDownLink/BTreeInnerTupleSetDownLink could cause a 
confusion, in other hand, introducing 
TreeInnerTupleGetUpLink/BTreeInnerTupleSetUpLink seems over-engineering




After close look I change my opinion. To have a clean code it's much better to 
have new pair get/set macroses specialy to manage link to top pare during page 
deletion. This removes last naked usage of 
ItemPointer(SetInvalid/IsInvalid/GetBlockNumberNoCheck) and uses self-described 
macroses. Patch is attached.



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index beef089ba8..3be229db1f 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1602,10 +1602,9 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MemSet(, 0, sizeof(IndexTupleData));
 	trunctuple.t_info = sizeof(IndexTupleData);
 	if (target != leafblkno)
-		ItemPointerSetBlockNumber(_tid, target);
+		BTreeTupleSetTopParent(, target);
 	else
-		ItemPointerSetInvalid(_tid);
-	BTreeTupleSetNAtts(, 0);
+		BTreeTupleSetTopParent(, InvalidBlockNumber);
 
 	if (PageAddItem(page, (Item) , sizeof(IndexTupleData), P_HIKEY,
 	false, false) == InvalidOffsetNumber)
@@ -1690,7 +1689,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	BTPageOpaque opaque;
 	bool		rightsib_is_rightmost;
 	int			targetlevel;
-	ItemPointer leafhikey;
+	IndexTuple	leafhikey;
 	BlockNumber nextchild;
 
 	page = BufferGetPage(leafbuf);
@@ -1702,7 +1701,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	 * Remember some information about the leaf page.
 	 */
 	itemid = PageGetItemId(page, P_HIKEY);
-	leafhikey = &((IndexTuple) PageGetItem(page, itemid))->t_tid;
+	leafhikey = (IndexTuple) PageGetItem(page, itemid);
 	leafleftsib = opaque->btpo_prev;
 	leafrightsib = opaque->btpo_next;
 
@@ -1714,9 +1713,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	 * parent in the branch.  Set 'target' and 'buf' to reference the page
 	 * actually being unlinked.
 	 */
-	if (ItemPointerIsValid(leafhikey))
+	target = BTreeTupleGetTopParent(leafhikey);
+
+	if (target != InvalidBlockNumber)
 	{
-		target = ItemPointerGetBlockNumberNoCheck(leafhikey);
 		Assert(target != leafblkno);
 
 		/* fetch the block number of the topmost parent's left sibling */
@@ -1919,12 +1919,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	 * branch.
 	 */
 	if (target != leafblkno)
-	{
-		if (nextchild == InvalidBlockNumber)
-			ItemPointerSetInvalid(leafhikey);
-		else
-			ItemPointerSetBlockNumber(leafhikey, nextchild);
-	}
+		BTreeTupleSetTopParent(leafhikey, nextchild);
 
 	/*
 	 * Mark the page itself deleted.  It can be recycled when all current
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index fb8c769df9..67a94cb80a 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -800,11 +800,7 @@ btree_xlog_mark_page_halfdead(uint8 info, XLogReaderState *record)
 	 */
 	MemSet(, 0, sizeof(IndexTupleData));
 	trunctuple.t_info = sizeof(IndexTupleData);
-	if (xlrec->topparent != InvalidBlockNumber)
-		ItemPointerSetBlockNumber(_tid, xlrec->topparent);
-	else
-		ItemPointerSetInvalid(_tid);
-	BTreeTupleSetNAtts(, 0);
+	BTreeTupleSetTopParent(, xlrec->topparent);
 
 	if (PageAddItem(page, (Item) , sizeof(IndexTupleData), P_HIKEY,
 	false, false) == InvalidOffsetNumber)
@@ -912,11 +908,7 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 		/* Add a dummy hikey item */
 		MemSet(, 0, sizeof(IndexTupleData));
 		trunctuple.t_info = sizeof(IndexTupleData);
-		if (xlrec->topparent != InvalidBlockNumber)
-			ItemPointerSetBlockNumber(_tid, xlrec->topparent);
-		else
-			ItemPointerSetInvalid(_tid);
-		BTreeTupleSetNAtts(, 0);
+		BTreeTupleSetTopParent(, xlrec->topparent);
 
 		if (PageAddItem(page, (Item) , sizeof(IndexTupleData), P_HIKEY,
 		false, false) == InvalidOffsetNumber)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 1194be9281..34c71978d5 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -226,6 +226,20 @@ typedef struct BTMetaPageData
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
 	ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno))
 
+/*
+ * Get/set link to top parent to remove. It will be better to have assertion
+ * BTreeTupleGetNAtts() == 0 but pre-v11 version hasn't set INDEX_ALT_TID_MASK
+ *

Re: Corrupted btree index on HEAD because of covering indexes

2018-04-20 Thread Teodor Sigaev


heard of people using bt_index_parent_check() in production, but only
when they already knew that their database was corrupt, and wanted to
isolate the problem. I imagine that people that use
bt_index_parent_check() in production do so because they want as much
information as possible, and are willing to do whatever it takes to
get more information.
That why I think we need improve amcheck - ideally, it should not miss 
any mistakes in tree structure.



Agree, but at least this place needs a comment - why it's safe.


Good idea.
Could you write it? I'm afraid I can't give good explanation why we 
believe that nobody update this page ant it's  high key while page is 
unlocked but pinned.





I also think that we could have better conventional regression test
coverage here.


Will try to invent not so large test.oif it means they may get a little extra


Your smaller test takes about 350ms to run on a debug build on my
laptop, which seems worth it (honestly, this test should have been
there already). Maybe we can add it to the amcheck regression tests
instead, since those are run less often. This also allows us to add a
test case specifically as part of the amcheck enhancement, which makes
a bit more sense.
Hm, it seems to me, that 350ms is short enough to place it in both core 
and amcheck test. I think, basic functionality should be covered by core 
tests as we test insert/create.



--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Corrupted btree index on HEAD because of covering indexes

2018-04-20 Thread Teodor Sigaev


I tried to minimize Michael's test case and add it to patch.


-if (ItemPointerIsValid(leafhikey))
+if (ItemPointerGetBlockNumberNoCheck(leafhikey) != InvalidBlockNumber)

Should we use BTreeInnerTupleGetDownLink() as soon as we use
BTreeInnerTupleSetDownLink() for setting this?
Or even invent BTreeInnerTupleDownLinkIsValid() macro?
I am not sure. Here we actually store UP link - to top parent to remove. 
I'm afraid using BTreeInnerTupleGetDownLink/BTreeInnerTupleSetDownLink 
could cause a confusion, in other hand, introducing 
TreeInnerTupleGetUpLink/BTreeInnerTupleSetUpLink seems over-engineering


--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Corrupted btree index on HEAD because of covering indexes

2018-04-19 Thread Teodor Sigaev


I also think that we could have better conventional regression test
coverage here.


I tried to minimize Michael's test case and add it to patch.

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index beef089ba8..e6fe4ba90e 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1714,7 +1714,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	 * parent in the branch.  Set 'target' and 'buf' to reference the page
 	 * actually being unlinked.
 	 */
-	if (ItemPointerIsValid(leafhikey))
+	if (ItemPointerGetBlockNumberNoCheck(leafhikey) != InvalidBlockNumber)
 	{
 		target = ItemPointerGetBlockNumberNoCheck(leafhikey);
 		Assert(target != leafblkno);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index fe5b698669..fc81088d4b 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3052,6 +3052,16 @@ explain (costs off)
  Filter: (NOT b)
 (4 rows)
 
+--
+-- Test for multilevel page deletion
+--
+CREATE TABLE delete_test_table (a bigint, b bigint, c bigint, d bigint);
+INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,8) i;
+ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+DELETE FROM delete_test_table WHERE a > 4;
+VACUUM delete_test_table;
+DELETE FROM delete_test_table WHERE a > 10;
+VACUUM delete_test_table;
 --
 -- REINDEX (VERBOSE)
 --
diff --git a/src/test/regress/expected/sanity_check.out b/src/test/regress/expected/sanity_check.out
index 8afb1f2f7e..0aa5357917 100644
--- a/src/test/regress/expected/sanity_check.out
+++ b/src/test/regress/expected/sanity_check.out
@@ -38,6 +38,7 @@ d_star|f
 date_tbl|f
 default_tbl|f
 defaultexpr_tbl|f
+delete_test_table|t
 dept|f
 dupindexcols|t
 e_star|f
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index f7731265a0..f9e7118f0d 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1061,6 +1061,17 @@ explain (costs off)
 explain (costs off)
   select * from boolindex where not b order by i limit 10;
 
+--
+-- Test for multilevel page deletion
+--
+CREATE TABLE delete_test_table (a bigint, b bigint, c bigint, d bigint);
+INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,8) i;
+ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+DELETE FROM delete_test_table WHERE a > 4;
+VACUUM delete_test_table;
+DELETE FROM delete_test_table WHERE a > 10;
+VACUUM delete_test_table;
+
 --
 -- REINDEX (VERBOSE)
 --

Re: Corrupted btree index on HEAD because of covering indexes

2018-04-19 Thread Teodor Sigaev


I would like to go and implement this check now, and if all goes well
I may propose that you commit it to the master branch for v11. I don't
see this as a new feature. I see it as essential testing
infrastructure. What do you think about adding this new check soon?
Agree. Hope, nobody found a way how to use amcheck module in production 
to serve user requests. But, it should be implemented before BETA stage, 
in opposite case we will get a lot of objections.





Nevertheless, seems, I found. In _bt_mark_page_halfdead() we use truncated
high key IndexTuple as a storage of blocknumber of top parent to remove. And
sets BTreeTupleSetNAtts(, 0) - it's stored in ip_posid.

But some later, in _bt_unlink_halfdead_page() we check ItemPointer high key
with ItemPointerIsValid macro - and it returns false, because offset is
actually InvalidOffsetNumber - i.e. 0 which was set by BTreeTupleSetNAtts.
And some wrong decisions are follows, I didn't look at that.


I'm not at all surprised. I strongly suspected that it was some simple
issue with the representation, since the INCLUDE patch didn't actually
change the page deletion algorithm in any way.

+1




Trivial and naive fix is attached, but for me it looks a bit annoing that we
store pointer (leafhikey) somewhere inside unlocked page.


Well, it has worked that way for a long time. No reason to change it now.

Agree, but at least this place needs a comment - why it's safe.


I also think that we could have better conventional regression test
coverage here.

Will try to invent not so large test.

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Corrupted btree index on HEAD because of covering indexes

2018-04-19 Thread Teodor Sigaev


I'll take a look tomorrow.


Interesting, contrib/amcheck doesn't find any error in index. Seems, it's 
subject for further improvement.


Nevertheless, seems, I found. In _bt_mark_page_halfdead() we use truncated high 
key IndexTuple as a storage of blocknumber of top parent to remove. And sets 
BTreeTupleSetNAtts(, 0) - it's stored in ip_posid.


But some later, in _bt_unlink_halfdead_page() we check ItemPointer high key with 
ItemPointerIsValid macro - and it returns false, because offset is actually 
InvalidOffsetNumber - i.e. 0 which was set by BTreeTupleSetNAtts. And some wrong 
decisions are follows, I didn't look at that.


Trivial and naive fix is attached, but for me it looks a bit annoing that we 
store pointer (leafhikey) somewhere inside unlocked page.




--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index beef089ba8..e6fe4ba90e 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1714,7 +1714,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	 * parent in the branch.  Set 'target' and 'buf' to reference the page
 	 * actually being unlinked.
 	 */
-	if (ItemPointerIsValid(leafhikey))
+	if (ItemPointerGetBlockNumberNoCheck(leafhikey) != InvalidBlockNumber)
 	{
 		target = ItemPointerGetBlockNumberNoCheck(leafhikey);
 		Assert(target != leafblkno);

Re: WIP: Covering + unique indexes.

2018-04-19 Thread Teodor Sigaev


Thank you, pushed

Peter Geoghegan wrote:

On Wed, Apr 18, 2018 at 10:47 PM, Teodor Sigaev <teo...@sigaev.ru> wrote:

Thank you, pushed.


Thanks.

I saw another preexisting issue, this time one that has been around
since 2007. Commit bc292937 forgot to remove a comment above
_bt_insertonpg() (the 'afteritem' stuff ended up being moved to the
bottom of _bt_findinsertloc(), where it remains today). The attached
patch fixes this, and in passing mentions the fact that
_bt_insertonpg() only performs retail insertions, and specifically
never inserts high key items.

I don't think it's necessary to add something about negative infinity
items to the same comment block. While it's true that _bt_insertonpg()
cannot truncate downlinks to make new minus infinity items, I see that
as a narrower issue.



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: WIP: Covering + unique indexes.

2018-04-18 Thread Teodor Sigaev


Thank you, pushed.


Actually, I see one tiny issue with extra '*' characters here:


+* The number of attributes won't be explicitly represented if the
+* negative infinity tuple was generated during a page split that
+* occurred with a version of Postgres before v11.  There must be a
+* problem when there is an explicit representation that is
+* non-zero, * or when there is no explicit representation and the
+* tuple is * evidently not a pre-pg_upgrade tuple.


I also suggest fixing this indentation before commit:


+   /*
+*Cannot leak memory here, TupleDescCopy() doesn't allocate any
+* inner structure, so, plain pfree() should clean all allocated memory
+*/


fixed

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Corrupted btree index on HEAD because of covering indexes

2018-04-18 Thread Teodor Sigaev


Will see...

Michael Paquier wrote:

Hi all,

I was just testing the VACUUM truncation logic, and bumped into what
looks like a corrupted btree index.  Here is a reproducer:
create table aa (a int primary key, b bool);
insert into aa values (generate_series(1,100), false);
checkpoint;
update aa set b = false where a > 50; -- Dirties a set of shared
buffers
delete from aa where a > 75; -- Delete a set of rows
vacuum aa;
delete from aa where a > 10;
vacuum aa; -- error on btree with right sibling

And here is the actual failure when the second vacuum:
ERROR:  XX000: right sibling 4132 of block 2128 is not next child 5396 of block 412 in 
index "aa_pkey"
LOCATION:  _bt_mark_page_halfdead, nbtpage.c:1564

This works on REL_10_STABLE, so I am adding an open item.  I have not
investigated the exact problem yet, but bisect is showing me covering
indexes as the culprit (8224de4).

Thanks,
--
Michael



--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: WIP: Covering + unique indexes.

2018-04-18 Thread Teodor Sigaev


I don't understand. We do check the number of attributes on rightmost
pages, but we do so separately, in the main loop. For every item that
isn't the high key.
Comment added, pls, verify. And refactored _bt_check_natts(), I hope, 
now it's a bit more readable.



4) BTreeTupSetNAtts - seems, it's better to add check  of nattrs to fits  to
BT_N_KEYS_OFFSET_MASK  mask, and it should not touch BT_RESERVED_OFFSET_MASK
bits, now it will overwrite that bits.


An assertion sounds like it would be an improvement, though I don't
see that in the patch you posted.
I didn't do that in v1, sorry, I was unclear. Attached patch contains 
all changes suggested in my previous email.


--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index be0206d58e..1a605f9944 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -698,6 +698,9 @@ nextpage:
  *	 "real" data item on the page to the right (if such a first item is
  *	 available).
  *
+ * - That tuples report that they have the expected number of attributes.
+ *	 INCLUDE index pivot tuples should not contain non-key attributes.
+ *
  * Furthermore, when state passed shows ShareLock held, and target page is
  * internal page, function also checks:
  *
@@ -722,43 +725,35 @@ bt_target_page_check(BtreeCheckState *state)
 	elog(DEBUG2, "verifying %u items on %s block %u", max,
 		 P_ISLEAF(topaque) ? "leaf" : "internal", state->targetblock);
 
-
-	/* Check the number of attributes in high key if any */
-	if (!P_RIGHTMOST(topaque))
+	/*
+	 * Check the number of attributes in high key. Note, rightmost page doesn't
+	 * contain a high key, so nothing to check
+	 */
+	if (!P_RIGHTMOST(topaque) &&
+		!_bt_check_natts(state->rel, state->target, P_HIKEY))
 	{
-		if (!_bt_check_natts(state->rel, state->target, P_HIKEY))
-		{
-			ItemId		itemid;
-			IndexTuple	itup;
-			char	   *itid,
-	   *htid;
+		ItemId		itemid;
+		IndexTuple	itup;
 
-			itemid = PageGetItemId(state->target, P_HIKEY);
-			itup = (IndexTuple) PageGetItem(state->target, itemid);
-			itid = psprintf("(%u,%u)", state->targetblock, P_HIKEY);
-			htid = psprintf("(%u,%u)",
-			ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-			ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+		itemid = PageGetItemId(state->target, P_HIKEY);
+		itup = (IndexTuple) PageGetItem(state->target, itemid);
 
-			ereport(ERROR,
-	(errcode(ERRCODE_INDEX_CORRUPTED),
-	 errmsg("wrong number of index tuple attributes for index \"%s\"",
-			RelationGetRelationName(state->rel)),
-	 errdetail_internal("Index tid=%s natts=%u points to %s tid=%s page lsn=%X/%X.",
-		itid,
-		BTreeTupGetNAtts(itup, state->rel),
-		P_ISLEAF(topaque) ? "heap" : "index",
-		htid,
-		(uint32) (state->targetlsn >> 32),
-		(uint32) state->targetlsn)));
-		}
+		ereport(ERROR,
+(errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("wrong number of high key index tuple attributes in index \"%s\"",
+		RelationGetRelationName(state->rel)),
+ errdetail_internal("Index block=%u natts=%u block type=%s page lsn=%X/%X.",
+	state->targetblock,
+	BTreeTupleGetNAtts(itup, state->rel),
+	P_ISLEAF(topaque) ? "heap" : "index",
+	(uint32) (state->targetlsn >> 32),
+	(uint32) state->targetlsn)));
 	}
 
-
 	/*
 	 * Loop over page items, starting from first non-highkey item, not high
-	 * key (if any).  Also, immediately skip "negative infinity" real item (if
-	 * any).
+	 * key (if any).  Most tests are not performed for the "negative infinity"
+	 * real item (if any).
 	 */
 	for (offset = P_FIRSTDATAKEY(topaque);
 		 offset <= max;
@@ -791,7 +786,7 @@ bt_target_page_check(BtreeCheckState *state)
 		tupsize, ItemIdGetLength(itemid),
 		(uint32) (state->targetlsn >> 32),
 		(uint32) state->targetlsn),
-	 errhint("This could be a torn page problem")));
+	 errhint("This could be a torn page problem.")));
 
 		/* Check the number of index tuple attributes */
 		if (!_bt_check_natts(state->rel, state->target, offset))
@@ -806,11 +801,11 @@ bt_target_page_check(BtreeCheckState *state)
 
 			ereport(ERROR,
 	(errcode(ERRCODE_INDEX_CORRUPTED),
-	 errmsg("wrong number of index tuple attributes for index \"%s\"",
+	 errmsg("wrong number of index tuple attributes in index \"%s\"",
 			RelationGetRelationName(state->rel)),
 	 errdetail_internal("Index tid=%s natts=%u points to %s tid=%s pa

Re: pgindent run soon?

2018-04-18 Thread Teodor Sigaev


If there are large refactoring or bug-fix patches that haven't landed
yet, then it'd be appropriate to wait for those to get in, but I'm not
aware of such at the moment.

Pls, wait
https://www.postgresql.org/message-id/9c63951d-7696-ecbb-b832-70db7ed3f39b%40sigaev.ru


Thank you.

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: WIP: Covering + unique indexes.

2018-04-18 Thread Teodor Sigaev


I mostly agree with your patch, nice work, but I have some notices for your 
patch:

1)
bt_target_page_check():
if (!P_RIGHTMOST(topaque) &&
!_bt_check_natts(state->rel, state->target, P_HIKEY))

Seems not very obvious: it looks like we don't need to check nattrs on rightmost 
page. Okay, I remember that on rightmost page there is no hikey at all, but at 
least comment should added. Implicitly bt_target_page_check() already takes into 
account 'is page rightmost or not?' by using  P_FIRSTDATAKEY, so, may be better 
to move rightmost check into bt_target_page_check() with some refactoring if-logic:


if (offset > maxoff)
return true; //nothing to check, also covers empty rightmost page

if (P_ISLEAF) {
if (offnum >= P_FIRSTDATAKEY)
...
else /* if (offnum == P_HIKEY) */
...
}
else // !P_ISLEAF
{
if (offnum == P_FIRSTDATAKEY)
...
else if (offnum > P_FIRSTDATAKEY)
...
else /* if (offnum == P_HIKEY) */
...
}

I see it's possible only 3 nattrs value: 0, nkey and nkey+nincluded, but 
collapsing if-clause to three branches causes difficulties for code readers. Let 
compiler optimize that. Sorry for late notice, but it takes my attention only 
when I noticed (!P_RIGHTMOST && !_bt_check_natts) condition.


2)
Style notice:
ItemPointerSetInvalid(_tid);
+   BTreeTupSetNAtts(, 0);
if (PageAddItem(page, (Item) , sizeof(IndexTupleData), P_HIKEY,
It's better to have blank line between BTreeTupSetNAtts() and if clause.

3) Naming BTreeTupGetNAtts/BTreeTupSetNAtts - several lines above we use full 
Tuple word in dowlink macroses, here we use just Tup. Seems, better to have 
Tuple in both cases. Or Tup, but still in both cases.


4) BTreeTupSetNAtts - seems, it's better to add check  of nattrs to fits  to 
BT_N_KEYS_OFFSET_MASK  mask, and it should not touch BT_RESERVED_OFFSET_MASK 
bits, now it will overwrite that bits.


Attached patch is rebased to current head and contains some comment improvement 
in index_truncate_tuple() - you save some amount of memory with TupleDescCopy() 
call but didn't explain why pfree is enough to free all allocated memory.





Peter Geoghegan wrote:

On Tue, Apr 17, 2018 at 3:12 AM, Alexander Korotkov
<a.korot...@postgrespro.ru> wrote:

Hmm, what do you think about making BTreeTupGetNAtts() take tupledesc
argument, not relation>  It anyway doesn't need number of key attributes,
only total number of attributes.  Then _bt_isequal() would be able to use
BTreeTupGetNAtts().


That would make the BTreeTupGetNAtts() assertions quite a bit more
verbose, since there is usually no existing tuple descriptor variable,
but there is almost always a "rel" variable. The coverage within
_bt_isequal() does not seem important, because we only use it with the
page high key in rare cases, where _bt_moveright() will already have
tested the highkey.


I think it's completely OK to fix broken things when you've to touch
them.  Probably, Teodor would decide to make that by separate commit.
So, it's up to him.


You're right to say that this old negative infinity tuple code within
_bt_insertonpg() is broken code, and not just dead code. The code
doesn't just confuse things (e.g. see recent commit 2a67d644). It also
seems like it could actually be harmful. This is code that could only
ever corrupt your database.

I'm fine if Teodor wants to commit that change separately, of course.



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index be0206d58e..7efd7ac47b 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -698,6 +698,9 @@ nextpage:
  *	 "real" data item on the page to the right (if such a first item is
  *	 available).
  *
+ * - That tuples report that they have the expected number of attributes.
+ *	 INCLUDE index pivot tuples should not contain non-key attributes.
+ *
  * Furthermore, when state passed shows ShareLock held, and target page is
  * internal page, function also checks:
  *
@@ -722,43 +725,32 @@ bt_target_page_check(BtreeCheckState *state)
 	elog(DEBUG2, "verifying %u items on %s block %u", max,
 		 P_ISLEAF(topaque) ? "leaf" : "internal", state->targetblock);
 
-
-	/* Check the number of attributes in high key if any */
-	if (!P_RIGHTMOST(topaque))
+	/* Check the number of attributes in high key */
+	if (!P_RIGHTMOST(topaque) &&
+		!_bt_check_natts(state->rel, state->target, P_HIKEY))
 	{
-		if (!_bt_check_natts(state->rel, state->target, P_HIKEY))
-		{
-			ItemId		itemid;
-			IndexTuple	itup;
-			char	   *itid,
-	   *htid;
+		ItemId		itemid;
+		IndexTuple	itup;
 
-			itemid = PageGetItemId(state-&g

Re: psql leaks memory on query cancellation

2018-04-12 Thread Teodor Sigaev


I imagine that this indicates that control-C processing allocates some
memory it doesn't free, resulting in an "island" up at the end of memory
that prevents glibc from releasing all the free memory it's got.  Whether
that's an actual leak, or just memory we're holding onto in hopes of
reusing it, isn't clear.  (valgrind might be useful.)


malloc could request memory from kernel by two ways: sbrk() and mmap(), 
first one has described problem, mmap hasn't. It's described in 
mallopt(3) in section M_MMAP_THRESHOLD, to test that try to repeat test 
with MALLOC_MMAP_THRESHOLD_ environment  set to 8192.



--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: WIP: Covering + unique indexes.

2018-04-12 Thread Teodor Sigaev




Peter Geoghegan wrote:

On Tue, Apr 10, 2018 at 5:45 PM, Peter Geoghegan <p...@bowt.ie> wrote:

I did find another problem, though. Looks like the idea to explicitly
represent the number of attributes directly has paid off already:

pg@~[3711]=# create table covering_bug (f1 int, f2 int, f3 text);
create unique index cov_idx on covering_bug (f1) include(f2);
insert into covering_bug select i, i * random() * 1000, i * random() *
10 from generate_series(0,10) i;
DEBUG:  building index "pg_toast_16451_index" on table "pg_toast_16451" serially
CREATE TABLE
DEBUG:  building index "cov_idx" on table "covering_bug" serially
CREATE INDEX
ERROR:  tuple has wrong number of attributes in index "cov_idx"


Actually, this was an error on my part (though I'd still maintain that
the check paid off here!). I'll still add defensive assertions inside
_bt_newroot(), and anywhere else that they're needed. There is no
reason to not add defensive assertions in all code that handles page
splits, and needs to fetch a highkey from some other page. We missed a
few of those.
Agree, I prefer to add more Assert, even. may be, more than actually 
needed. Assert-documented code :)




I'll add an item to "Decisions to Recheck Mid-Beta" section of the
open items page for this patch. We should review the decision to make
a call to _bt_check_natts() within _bt_compare(). It might work just
as well as an assertion, and it would be unfortunate if workloads that
don't use covering indexes had to pay a price for the
_bt_check_natts() call, even if it was a small price. I've seen
_bt_compare() appear prominently in profiles quite a few times.



Could you show a patch?

I think, we need move _bt_check_natts() and its call under 
USE_ASSERT_CHECKING to prevent performance degradation. Users shouldn't 
pay for unused feature.

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Partitioned tables and covering indexes

2018-04-12 Thread Teodor Sigaev


pushed. Hope, second try will be successful.

Teodor Sigaev wrote:

Thank you, pushed

Amit Langote wrote:

Hi.

On 2018/04/11 0:36, Teodor Sigaev wrote:

 Does the attached fix look correct?  Haven't checked the fix with
ATTACH
 PARTITION though.


Attached patch seems to fix the problem.  However, I would rather get
rid of modifying stmt->indexParams.  That seems to be more logical
for me.  Also, it would be good to check some covering indexes on
partitioned tables.  See the attached patch.


Seems right way, do not modify incoming object and do not copy rather
large and deep nested structure as suggested by Amit.


Yeah, Alexander's suggested way of using a separate variable for
indexParams is better.


But it will  be better to have a ATTACH PARTITION test too.


I have added tests.  Actually, instead of modifying existing tests, I
think it might be better to have a separate section at the end of
indexing.sql to test covering indexes feature for partitioned tables.

Attached find updated patch.

Thanks,
Amit





--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: Partitioned tables and covering indexes

2018-04-12 Thread Teodor Sigaev

Thanks to everyone, this part is pushed. I will waiting a bit before pushing 
topic-stater patch to have a time to look on buildfarm.


Alvaro, could you add a comment to CompareIndexInfo() to clarify why it doesn't 
compare indoptions (like DESC/ASC etc)? It doesn't matter for uniqueness of 
index but changes an order and it's not clear at least for me why we don't pay 
attention for that.



In attached path I did:
1) fix a bug in CompareIndexInfo() which checks opfamily for including column
2) correct size for all vectors
3) Add assertion in various places to be sure that we don't try to use including 
column as key column
4) per Peter Geoghegan idea add a error message if somebody adds options to 
include column instead silently ignore it.





--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: Covering GiST indexes

2018-04-12 Thread Teodor Sigaev

Interesting work. I don't have a time now to learn deep your patch, so, add it
to next commitfest, pls. First of all I'd like to see more tests in patch, not
only CREATE INDEX.

Andrey Borodin wrote:

Hi, hackers!

Looks like we finally have covering indexes! And that's cool!

So I decided to create a thread to discuss covering GiST indexes.
Here's a prototype patch implementing this functionality.
It is quite small (+80 -30) and lacks tests and docs. But it creates a context.

I have two concerns.
First one is about INDEX_AM_RESERVED_BIT.
B-tree uses it as a base for prefix truncation (I'm not quite sure why it is
usually called suffix truncation, but this is a matter for other thread).
GiST , probably, will not use [pre\su]fix truncation. But I'd like to use that
13th bit to implement intra-page indexing - a way to improve search within gist
page. See [0,1]

Second, currently including indexes do not allow same attributes in both keys
and include parts.
# create index on x using gist(c) include (c);
ERROR: included columns must not intersect with key columns

But it makes sense for example for geometries like PostGIS. Index keys are
truncated to small MBRs while having whole complex geometry right in an index
could be handy.

Any feedback will be appreciated.

Best regards, Andrey Borodin.

[0]
https://www.postgresql.org/message-id/7780A07B-4D04-41E2-B228-166B41D07EEE%40yandex-team.ru
[1]
https://www.postgresql.org/message-id/CAJEAwVE0rrr+OBT-P0gDCtXbVDkBBG_WcXwCBK=gho4fewu...@mail.gmail.com

--
Teodor Sigaev E-mail: teo...@sigaev.ru
WWW: http://www.sigaev.ru/

Re: Partitioned tables and covering indexes

2018-04-11 Thread Teodor Sigaev


That's the idea that I tried to express. The point is that we need to
tell the user that there is no need to worry about it, rather than
that they're wrong to ask about it. Though we should probably actually
just throw an error.


Or maybe it should be the collation of the underlying table columns.
Otherwise the collation returned by an index-only scan would be
different from a table scan, no?

+1, dangerous
--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Partitioned tables and covering indexes

2018-04-11 Thread Teodor Sigaev




Peter Geoghegan wrote:

On Wed, Apr 11, 2018 at 2:29 PM, Peter Eisentraut
<peter.eisentr...@2ndquadrant.com> wrote:

But in this case it doesn't even do equality comparison, it just returns
the value.


That's the idea that I tried to express. The point is that we need to
tell the user that there is no need to worry about it, rather than
that they're wrong to ask about it. Though we should probably actually
just throw an error.


Any operation over including columns is a deal for filter, not an index, 
so collation will not be used somewhere inside index.


2Alexander: ComputeIndexAttrs() may use whole vectors (for including 
columns too) just to simplify coding, in other places, seems, better to 
have exact size of vectors. Using full-sized vectors could mask a error, 
for exmaple, with setting opfamily to InvalidOid for key column. But if 
you insist on that, I'll modify attached patch to use full-sized vectors 
anywhere.


In attached path I did:
1) fix a bug in CompareIndexInfo() which checks opfamily for including 
column

2) correct size for all vectors
3) Add assertion in various places to be sure that we don't try to use 
including column as key column
4) per Peter Geoghegan idea add a error message if somebody adds options 
to include column instead silently ignore it.



--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5d73e92901..58b4249784 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -297,6 +297,7 @@ ConstructTupleDescriptor(Relation heapRelation,
 		 Oid *classObjectId)
 {
 	int			numatts = indexInfo->ii_NumIndexAttrs;
+	int			numkeyatts = indexInfo->ii_NumIndexKeyAttrs;
 	ListCell   *colnames_item = list_head(indexColNames);
 	ListCell   *indexpr_item = list_head(indexInfo->ii_Expressions);
 	IndexAmRoutine *amroutine;
@@ -375,7 +376,8 @@ ConstructTupleDescriptor(Relation heapRelation,
 			to->attidentity = '\0';
 			to->attislocal = true;
 			to->attinhcount = 0;
-			to->attcollation = collationObjectId[i];
+			to->attcollation = (i < numkeyatts) ?
+			collationObjectId[i] : InvalidOid;
 		}
 		else
 		{
@@ -411,7 +413,8 @@ ConstructTupleDescriptor(Relation heapRelation,
 			to->attcacheoff = -1;
 			to->atttypmod = exprTypmod(indexkey);
 			to->attislocal = true;
-			to->attcollation = collationObjectId[i];
+			to->attcollation = (i < numkeyatts) ?
+			collationObjectId[i] : InvalidOid;
 
 			ReleaseSysCache(tuple);
 
@@ -608,9 +611,9 @@ UpdateIndexRelation(Oid indexoid,
 	indkey = buildint2vector(NULL, indexInfo->ii_NumIndexAttrs);
 	for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
 		indkey->values[i] = indexInfo->ii_KeyAttrNumbers[i];
-	indcollation = buildoidvector(collationOids, indexInfo->ii_NumIndexAttrs);
+	indcollation = buildoidvector(collationOids, indexInfo->ii_NumIndexKeyAttrs);
 	indclass = buildoidvector(classOids, indexInfo->ii_NumIndexKeyAttrs);
-	indoption = buildint2vector(coloptions, indexInfo->ii_NumIndexAttrs);
+	indoption = buildint2vector(coloptions, indexInfo->ii_NumIndexKeyAttrs);
 
 	/*
 	 * Convert the index expressions (if any) to a text datum
@@ -1080,7 +1083,7 @@ index_create(Relation heapRelation,
 
 		/* Store dependency on collations */
 		/* The default collation is pinned, so don't bother recording it */
-		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		for (i = 0; i < indexInfo->ii_NumIndexKeyAttrs; i++)
 		{
 			if (OidIsValid(collationObjectId[i]) &&
 collationObjectId[i] != DEFAULT_COLLATION_OID)
@@ -1831,6 +1834,11 @@ CompareIndexInfo(IndexInfo *info1, IndexInfo *info2,
 	if (info1->ii_NumIndexAttrs != info2->ii_NumIndexAttrs)
 		return false;
 
+	/* and same number of key attributes */
+	if (info1->ii_NumIndexKeyAttrs != info2->ii_NumIndexKeyAttrs)
+		return false;
+
+
 	/*
 	 * and columns match through the attribute map (actual attribute numbers
 	 * might differ!)  Note that this implies that index columns that are
@@ -1848,6 +1856,10 @@ CompareIndexInfo(IndexInfo *info1, IndexInfo *info2,
 			info1->ii_KeyAttrNumbers[i]))
 			return false;
 
+		/* collation and opfamily is not valid for including columns */
+		if (i >= info1->ii_NumIndexKeyAttrs)
+			continue;
+
 		if (collations1[i] != collations2[i])
 			return false;
 		if (opfamilies1[i] != opfamilies2[i])
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 860a60d109..4933608e61 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1359,7 +1359,8 @@ CheckPredicate(Expr *predicate)
 
 /*
  * Compute per-index-column information, including indexed column numbers
- * or index expressions, opclasses, and indoptions.
+ * or index expressions, opclasses, and indoptions. Note, all output v

Re: Partitioned tables and covering indexes

2018-04-11 Thread Teodor Sigaev

Actually, discovered bug is not related to patch except new test faces 
with it,
problem is: CompareIndexInfo() checks rd_opfamily for equality for all 
attributes, not only for key attribute.


Patch attached. But it seems to me, field's names of
IndexInfo structure are a bit confused now:
int ii_NumIndexAttrs;   /* total number of columns in index */
int ii_NumIndexKeyAttrs;/* number of key columns in 
index */

AttrNumber  ii_KeyAttrNumbers[INDEX_MAX_KEYS];


ii_KeyAttrNumbers contains all columns, i.e. it contains 
ii_NumIndexAttrs number of columns, not a ii_NumIndexKeyAttrs number as 
easy to think.


I suggest rename ii_KeyAttrNumbers to ii_AttrNumbers or 
ii_IndexAttrNumbers. Opinions?




     for (i = 0; i < info1->ii_NumIndexAttrs; i++)
     {
     if (maplen < info2->ii_KeyAttrNumbers[i])




--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5d73e92901..0002816fcc 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1831,6 +1831,10 @@ CompareIndexInfo(IndexInfo *info1, IndexInfo *info2,
 	if (info1->ii_NumIndexAttrs != info2->ii_NumIndexAttrs)
 		return false;
 
+	/* and same number of key attributes */
+	if (info1->ii_NumIndexKeyAttrs != info2->ii_NumIndexKeyAttrs)
+		return false;
+
 	/*
 	 * and columns match through the attribute map (actual attribute numbers
 	 * might differ!)  Note that this implies that index columns that are
@@ -1850,7 +1854,9 @@ CompareIndexInfo(IndexInfo *info1, IndexInfo *info2,
 
 		if (collations1[i] != collations2[i])
 			return false;
-		if (opfamilies1[i] != opfamilies2[i])
+
+		/* opfamily is valid on for key attributes */
+		if (i < info2->ii_NumIndexKeyAttrs && opfamilies1[i] != opfamilies2[i])
 			return false;
 	}

Re: Partitioned tables and covering indexes

2018-04-11 Thread Teodor Sigaev


Patch makes buildfarm almost red, patch is temporary reverted.

Actually, discovered bug is not related to patch except new test faces with it,
problem is: CompareIndexInfo() checks rd_opfamily for equality for all 
attributes, not only for key attribute.


Obviously, CompareIndexInfo() needs more work:
for (i = 0; i < info1->ii_NumIndexAttrs; i++)
{
if (maplen < info2->ii_KeyAttrNumbers[i])

Seems, we can go out from ii_KeyAttrNumbers array.




Amit Langote wrote:

Hi.

On 2018/04/11 0:36, Teodor Sigaev wrote:

     Does the attached fix look correct?  Haven't checked the fix with
ATTACH
     PARTITION though.


Attached patch seems to fix the problem.  However, I would rather get
rid of modifying stmt->indexParams.  That seems to be more logical
for me.  Also, it would be good to check some covering indexes on
partitioned tables.  See the attached patch.


Seems right way, do not modify incoming object and do not copy rather
large and deep nested structure as suggested by Amit.


Yeah, Alexander's suggested way of using a separate variable for
indexParams is better.


But it will  be better to have a ATTACH PARTITION test too.


I have added tests.  Actually, instead of modifying existing tests, I
think it might be better to have a separate section at the end of
indexing.sql to test covering indexes feature for partitioned tables.

Attached find updated patch.

Thanks,
Amit



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: WIP: Covering + unique indexes.

2018-04-11 Thread Teodor Sigaev


_bt_mark_page_halfdead() looked like it had a problem, but it now
looks like I was wrong. I also verified every other
BTreeInnerTupleGetDownLink() caller. It now looks like everything is
good here.


Right - it tries to find right page by conlsulting in parent page, by taking of 
next key.




--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: Partitioned tables and covering indexes

2018-04-11 Thread Teodor Sigaev


Thank you, pushed

Amit Langote wrote:

Hi.

On 2018/04/11 0:36, Teodor Sigaev wrote:

     Does the attached fix look correct?  Haven't checked the fix with
ATTACH
     PARTITION though.


Attached patch seems to fix the problem.  However, I would rather get
rid of modifying stmt->indexParams.  That seems to be more logical
for me.  Also, it would be good to check some covering indexes on
partitioned tables.  See the attached patch.


Seems right way, do not modify incoming object and do not copy rather
large and deep nested structure as suggested by Amit.


Yeah, Alexander's suggested way of using a separate variable for
indexParams is better.


But it will  be better to have a ATTACH PARTITION test too.


I have added tests.  Actually, instead of modifying existing tests, I
think it might be better to have a separate section at the end of
indexing.sql to test covering indexes feature for partitioned tables.

Attached find updated patch.

Thanks,
Amit



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: WIP: Covering + unique indexes.

2018-04-10 Thread Teodor Sigaev


* There is no pfree() within _bt_buildadd() for truncated tuples, even
though that's a context where it's clearly not okay.

Agree



* It might be a good idea to also pfree() the truncated tuple for most
other _bt_buildadd() callers. Even though it's arguably okay in other
cases, it seems worth being consistent about it (consistent with old
nbtree code).

Seems, I don't see other calls to pfree after.


* There should probably be some documentation around why it's okay
that we call index_truncate_tuple() with an exclusive buffer lock held
(during a page split). For example, there should probably be a comment
on the VARATT_IS_EXTERNAL() situation.

I havn't objection to improve docs/comments.



* Not sure that all calls to BTreeInnerTupleGetDownLink() are limited
to inner tuples, which might be worth doing something about (perhaps
just renaming the macro).

What is suspicious place for you opinion?



I do not have the time to write a patch right away, but I should be
able to post one in a few days. I want to avoid sending several small
patches.

no problem, we can wait


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: Partitioned tables and covering indexes

2018-04-10 Thread Teodor Sigaev


Does the attached fix look correct?  Haven't checked the fix with ATTACH
PARTITION though.


Attached patch seems to fix the problem.  However, I would rather get
rid of modifying stmt->indexParams.  That seems to be more logical
for me.  Also, it would be good to check some covering indexes on
partitioned tables.  See the attached patch.
Seems right way, do not modify incoming object and do not copy rather large and 
deep nested structure as suggested by Amit.


But it will  be better to have a ATTACH PARTITION test too.

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: Optimization of range queries

2018-04-09 Thread Teodor Sigaev


Hi!

12 years ago I proposed patch to which could "union" OR clauses into one 
range clause if it's possible. In that time pgsql could not use IS NULL 
as index clause, so patch doesn't support that


https://www.postgresql.org/message-id/flat/45742C51.9020602%40sigaev.ru

option number 4), all other are already committed.


Konstantin Knizhnik wrote:

Hi hackers,

Postgres optimizer is not able to build efficient execution plan for the 
following query:


explain select * from  people_raw where not ("ID"<2068113880 AND "INN" 
is not null) and "ID"<=2068629726 AND "INN" is not null;

  QUERY PLAN
 

  Bitmap Heap Scan on people_raw  (cost=74937803.72..210449640.49 
rows=121521030 width=336)

    Recheck Cond: ("ID" <= 2068629726)
    Filter: (("INN" IS NOT NULL) AND (("ID" >= 2068113880) OR ("INN" IS 
NULL)))
    ->  Bitmap Index Scan on "People_pkey" (cost=0.00..74907423.47 
rows=2077021718 width=0)

  Index Cond: ("ID" <= 2068629726)
(5 rows)


Here the table is very large, but query effects only relatively small 
number of rows located in the range: [2068113880,2068629726]

But unfortunately optimizer doesn't take it into the account.
Moreover, using "is not null" and "not null" is both operands of AND is 
not smart:

  (("INN" IS NOT NULL) AND (("ID" >= 2068113880) OR ("INN" IS NULL)))

If I remove "is not null" condition, then plan is perfect:

explain select * from  people_raw where not ("ID"<2068113880) and 
"ID"<=2068629726;

  QUERY PLAN
 

  Index Scan using "People_pkey" on people_raw  (cost=0.58..196745.57 
rows=586160 width=336)

    Index Cond: (("ID" >= 2068113880) AND ("ID" <= 2068629726))
(2 rows)

Before starting  investigation of the problem, I will like to know 
opinion and may be some advise of people familiar with optimizer:

how difficult will be to handle this case and where to look.

Thanks in advance,



--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-09 Thread Teodor Sigaev




Alvaro Herrera wrote:

Heikki Linnakangas wrote:


Remember, the purpose of predicate locks is to lock key ranges, not physical
pages or tuples in the index. We use leaf pages as handy shortcut for "any
key value that would belong on this page", but it is just an implementation
detail.


Hmm ... so, thinking about pending list locking, would it work to
acquire locks on the posting tree's root of each item in the pending
list, when the item is put in the pending list? (even if we insert the
item in the pending list instead of its posting tree).


Items in pending list doesn't use posting tree or list, pending list is just 
list of pair (ItemPointer to heap, entry) represented as IndexTuple. There is no 
order in pending list, so Heikki suggests to lock metapage always instead of 
locking whole relation if fastupdate=on. If fastupdate is off then insertion 
process will not change metapage.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: WIP: Covering + unique indexes.

2018-04-09 Thread Teodor Sigaev


Thanks to both of you, pushed

Shinoda, Noriyoshi wrote:

Hi!

Thank you for your response.

I think that it is good with your proposal.

Regards,

Noriyoshi Shinoda

*From:*Alexander Korotkov [mailto:a.korot...@postgrespro.ru]
*Sent:* Monday, April 9, 2018 11:22 PM
*To:* Shinoda, Noriyoshi <noriyoshi.shin...@hpe.com>
*Cc:* PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>; Teodor Sigaev 
<teo...@sigaev.ru>; Peter Geoghegan <p...@bowt.ie>; Jeff Janes 
<jeff.ja...@gmail.com>; Anastasia Lubennikova <a.lubennik...@postgrespro.ru>

*Subject:* Re: WIP: Covering + unique indexes.

Hi!

On Mon, Apr 9, 2018 at 5:07 PM, Shinoda, Noriyoshi <noriyoshi.shin...@hpe.com 
<mailto:noriyoshi.shin...@hpe.com>> wrote:


I tested this feature and found a document shortage in the columns added to
the pg_constraint catalog.
The attached patch will add the description of the 'conincluding' column to
the manual of the pg_constraint catalog.

Thank you for pointing this!

I think we need more wordy explanation here.  My proposal is attached.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com <http://www.postgrespro.com/>
The Russian Postgres Company



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-09 Thread Teodor Sigaev


Hi!


1. Why do we lock all posting tree pages, even though they all represent the 
same value? Isn't it enough to lock the root of the posting tree?
2. Why do we lock any posting tree pages at all, if we lock the entry tree page 
anyway? Isn't the lock on the entry tree page sufficient to cover the key value?
3. Why do we *not* lock the entry leaf page, if there is no match? We still need 
a lock to remember that we probed for that value and there was no match, so that 
we conflict with a tuple that might be inserted later.


At least #3 is a bug. The attached patch adds an isolation test that 
demonstrates it. #1 and #2 are weird, and cause unnecessary locking, so I think 
we should fix those too, even if they don't lead to incorrect results.


I can't find a hole here. Agree.

I took a stab at fixing those issues, as well as the bug when fastupdate is 
turned on concurrently. Does the attached patch look sane to you?


I like an idea use metapage locking, thank you. Patch seems good, will you push 
it?

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-09 Thread Teodor Sigaev


Ugh, I miss your last email where you another locking protocol. Reading.

Teodor Sigaev wrote:
Attached is a test case that demonstrates a case where we miss a serialization 
failure, when fastupdate is turned on concurrently. It works on v10, but fails 
to throw a serialization error on v11.

Thank you for reserching!

Proof of concept:
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 43b2fce2c5..b8291f96e2 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -10745,6 +10745,7 @@ ATExecSetRelOptions(Relation rel, List *defList, 
AlterTableType operation,

     case RELKIND_INDEX:
     case RELKIND_PARTITIONED_INDEX:
     (void) index_reloptions(rel->rd_amroutine->amoptions, newOptions, 
true);

+   TransferPredicateLocksToHeapRelation(rel);
     break;
     default:
     ereport(ERROR,

it fixes pointed bug, but will gives false positives. Right place for that in 
ginoptions function, but ginoptions doesn't has an access to relation structure 
and I don't see a reason why it should.




--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-09 Thread Teodor Sigaev

Attached is a test case that demonstrates a case where we miss a serialization 
failure, when fastupdate is turned on concurrently. It works on v10, but fails 
to throw a serialization error on v11.

Thank you for reserching!

Proof of concept:
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 43b2fce2c5..b8291f96e2 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -10745,6 +10745,7 @@ ATExecSetRelOptions(Relation rel, List *defList, 
AlterTableType operation,

case RELKIND_INDEX:
case RELKIND_PARTITIONED_INDEX:
(void) index_reloptions(rel->rd_amroutine->amoptions, newOptions, 
true);

+   TransferPredicateLocksToHeapRelation(rel);
break;
default:
ereport(ERROR,

it fixes pointed bug, but will gives false positives. Right place for that in 
ginoptions function, but ginoptions doesn't has an access to relation structure 
and I don't see a reason why it should.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: WIP: Covering + unique indexes.

2018-04-09 Thread Teodor Sigaev


Thank you, pushed.


Peter Geoghegan wrote:

On Sun, Apr 8, 2018 at 11:18 AM, Teodor Sigaev <teo...@sigaev.ru> wrote:

Thank you, fixed


I suggest that we remove some unneeded amcheck tests, as in the
attached patch. They don't seem to add anything.



--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: WIP: Covering + unique indexes.

2018-04-08 Thread Teodor Sigaev


Thanks to everyone, pushed.


Indeed thanks, this will be a nice feature.

It is giving me a compiler warning on non-cassert builds using gcc 
(Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609:


indextuple.c: In function 'index_truncate_tuple':
indextuple.c:462:6: warning: unused variable 'indnatts' [-Wunused-variable]
   int   indnatts = tupleDescriptor->natts;


Thank you, fixed

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Verbosity of genbki.pl

2018-04-08 Thread Teodor Sigaev


1. Print nothing at all.  That's more in keeping with our modern
build practices, but maybe it's too big a change?

2. Print just one message like "Generating postgres.bki and related
files", and I guess a second one for fmgroids.h and related files.

I don't have a strong preference.  Opinions?


Second point, pls. I'd like to see some stage done

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: WIP: Covering + unique indexes.

2018-04-08 Thread Teodor Sigaev


Thank you, fixed

Jeff Janes wrote:
On Sat, Apr 7, 2018 at 4:02 PM, Teodor Sigaev <teo...@sigaev.ru 
<mailto:teo...@sigaev.ru>> wrote:


Thanks to everyone, pushed.


Indeed thanks, this will be a nice feature.

It is giving me a compiler warning on non-cassert builds using gcc 
(Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609:


indextuple.c: In function 'index_truncate_tuple':
indextuple.c:462:6: warning: unused variable 'indnatts' [-Wunused-variable]
   int   indnatts = tupleDescriptor->natts;

Cheers,

Jeff


--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: WIP: Covering + unique indexes.

2018-04-07 Thread Teodor Sigaev


Support for testing amcaninclude via
pg_indexam_has_property(oid,'can_include') seems to be missing?

Also the return values of pg_index_column_has_property for included
columns seem a bit dubious... should probably be returning NULL for most
properties except 'returnable'.

Damn, you right, it's missed.


I can look at fixing these for you if you like?


If you will do that I will be very grateful

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: WIP: Covering + unique indexes.

2018-04-07 Thread Teodor Sigaev


Thank you, I looked to buildfarm and completely forget about commitfest site

Andres Freund wrote:

On 2018-04-07 23:02:08 +0300, Teodor Sigaev wrote:

Thanks to everyone, pushed.


Marked CF entry as committed.

Greetings,

Andres Freund



--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: WIP: Covering + unique indexes.

2018-04-07 Thread Teodor Sigaev


I'll keep an eye on the buildfarm, since it's late in Russia.


Thank you very much! Now 23:10 MSK and I'll be able to follow during 
approximately hour.


--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

1 2 >

1 - 100 of 165 matches

Mail list logo