from:"Aleksander Alekseev"

[HACKERS] Patch: ResourceOwner optimization for tables with many partitions

2015-12-04 Thread Aleksander Alekseev

Hello all,

Current implementation of ResourceOwner uses arrays to store resources
like TupleDescs, Snapshots, etc. When we want to release one of these
resources ResourceOwner finds it with linear scan. Granted, resource
array are usually small so its not a problem most of the time. But it
appears to be a bottleneck when we are working with tables which have a
lot of partitions.

To reproduce this issue:
1. run `./gen.pl 1 | psql my_database postgres`
2. run `pgbench -j 8 -c 8 -f q.sql -T 100 my_database`
3. in second terminal run `sudo perf top -u postgres`

Both gen.pl and q.sql are attached to this message.

You will see that postgres spends a lot of time in ResourceOwnerForget*
procedures:

 32.80%  postgres   [.] list_nth
 20.29%  postgres   [.] ResourceOwnerForgetRelationRef
 12.87%  postgres   [.] find_all_inheritors
  7.90%  postgres   [.] get_tabstat_entry
  6.68%  postgres   [.] ResourceOwnerForgetTupleDesc
  1.17%  postgres   [.] hash_search_with_hash_value
 ... < 1% ...

I would like to suggest a patch (see attachment) witch fixes this
bottleneck. Also I discovered that there is a lot of code duplication in
ResourceOwner. Patch fixes this too. The idea is quite simple. I just
replaced arrays with something that could be considered hash tables,
but with some specific optimizations. 

After applying this patch we can see that bottleneck is gone:

 42.89%  postgres   [.] list_nth
 18.30%  postgres   [.] find_all_inheritors
 10.97%  postgres   [.] get_tabstat_entry
  1.82%  postgres   [.] hash_search_with_hash_value
  1.21%  postgres   [.] SearchCatCache
 ... < 1% ...

For tables with thousands partitions we gain in average 32.5% more TPS.

As far as I can see in the same time patch doesn't make anything worse.
`make check` passes with asserts enabled and disabled. There is no
performance degradation according to both standard pgbench benchmark
and benchmark described above for tables with 10 and 100 partitions.

Best regards,
Aleksander

gen.pl
Description: Perl program


q.sql
Description: application/sql
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 0e7acbf..6abbfa5 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -47,62 +47,296 @@
 #define MAX_RESOWNER_LOCKS 15
 
 /*
- * ResourceOwner objects look like this
+ * This number is used as initial size of resource arrays (like buffers,
+ * catrefs, etc) in ResourceOwnerData structure. If given number of items is
+ * not enough, we double array size and reallocate memory.
+ *
+ * Should be power of two since we use (arrsize -1) as mask for hash value.
+ *
  */
-typedef struct ResourceOwnerData
+#define RESARRAY_INIT_SIZE 16
+
+/*
+ * How many items could be stored in a resource array of given capacity. If
+ * this number is reached we need to resize an array to prevent hash collisions.
+ *
+ * This computation actually costs only two additions and one binary shift.
+ */
+#define RESARRAY_MAX_ITEMS(capacity) ((capacity)*3/4)
+
+/*
+ * Common structure for storing different type of resources.
+ */
+typedef struct ResourceArray
+{
+	uint8*		itemsarr;		/* buffer for storing values */
+	uint32		itemsizelg:  2;	/* sizeof one item log 2 */
+	uint32		capacity:   30;	/* capacity of array */
+	uint32		nitems;			/* how many items is stored in items array */
+	uint32		maxitems;		/* precalculated RESARRAY_MAX_ITEMS(capacity) */
+} ResourceArray;
+
+/*
+ * Such type of callback function is called when resource stored in
+ * ResourceArray is released using ResourceArrayFree. Callback should
+ * _actually_ release a resource so nitems value will be decremented.
+ */
+typedef void (*ResourceArrayRemoveCallback)(const uint8* ref, bool isCommit);
+
+/* Used as argument to memcmp to determine if ResourceArray[i] is free. */
+static const uint8 RESOURCE_ARRAY_ZERO_ELEMENT[sizeof(void*)] = {0};
+
+/*
+ * Calculate hash_any of given data. For uint32 values use faster hash_uint32.
+ */
+static Datum
+ResourceArrayHash(const uint8* data, int size)
+{
+	uint32		tmp;
+
+	Assert(size == sizeof(uint32) || size == sizeof(void*));
+
+	if(size == sizeof(uint32))
+	{
+		tmp = *((const uint32*)data);
+		return hash_uint32(tmp);
+	}
+	else
+	{
+		return hash_any(data, size);
+	}
+}
+
+/*
+ * Initialize ResourceArray
+ */
+static void
+ResourceArrayInit(ResourceArray *resarr, uint32 itemsize) {
+	Assert(itemsize == sizeof(int) || itemsize == sizeof(void*));
+	Assert(resarr->itemsarr == NULL);
+	Assert(resarr->capacity == 0);
+	Assert(resarr->nitems == 0);
+	Assert(resarr->maxitems == 0);
+
+	resarr->itemsizelg = 0;
+	while(itemsize > 1)
+	{
+		resarr->itemsizelg++;
+		itemsize >>= 1;
+	}
+
+	Assert(resarr->itemsizelg == 2 || resarr->itemsizelg == 3);
+}
+
+/*
+ * Add a resource to ResourceArray
+ *
+ * Caller must have previously done

Re: [HACKERS] Patch: ResourceOwner optimization for tables with many partitions

2015-12-09 Thread Aleksander Alekseev

Hello, Robert

Thanks for your review. I believe I fixed items 1, 2 and 3 (see
attachment). Also I would like to clarify item 4.

> 4. It mixes together multiple ideas in a single patch, not only
> introducing a hashing concept but also striping a brand-new layer of
> abstraction across the resource-owner mechanism.  I am not sure that
> layer of abstraction is a very good idea, but if it needs to be done,
> I think it should be a separate patch.

Do I right understand that you suggest following?

Current patch should be split in two parts. In first patch we create
and use ResourceArray with array-based implementation (abstraction
layer). Then we apply second patch which change ResourceArray
implementation to hashing based (optimization).

Best regards,
Aleksander

On Tue, 8 Dec 2015 17:30:33 -0500
Robert Haas <robertmh...@gmail.com> wrote:

> On Fri, Dec 4, 2015 at 7:15 AM, Aleksander Alekseev
> <a.aleks...@postgrespro.ru> wrote:
> > Current implementation of ResourceOwner uses arrays to store
> > resources like TupleDescs, Snapshots, etc. When we want to release
> > one of these resources ResourceOwner finds it with linear scan.
> > Granted, resource array are usually small so its not a problem most
> > of the time. But it appears to be a bottleneck when we are working
> > with tables which have a lot of partitions.
> >
> > To reproduce this issue:
> > 1. run `./gen.pl 1 | psql my_database postgres`
> > 2. run `pgbench -j 8 -c 8 -f q.sql -T 100 my_database`
> > 3. in second terminal run `sudo perf top -u postgres`
> >
> > Both gen.pl and q.sql are attached to this message.
> >
> > You will see that postgres spends a lot of time in
> > ResourceOwnerForget* procedures:
> >
> >  32.80%  postgres   [.] list_nth
> >  20.29%  postgres   [.] ResourceOwnerForgetRelationRef
> >  12.87%  postgres   [.] find_all_inheritors
> >   7.90%  postgres   [.] get_tabstat_entry
> >   6.68%  postgres   [.] ResourceOwnerForgetTupleDesc
> >   1.17%  postgres   [.] hash_search_with_hash_value
> >  ... < 1% ...
> >
> > I would like to suggest a patch (see attachment) witch fixes this
> > bottleneck. Also I discovered that there is a lot of code
> > duplication in ResourceOwner. Patch fixes this too. The idea is
> > quite simple. I just replaced arrays with something that could be
> > considered hash tables, but with some specific optimizations.
> >
> > After applying this patch we can see that bottleneck is gone:
> >
> >  42.89%  postgres   [.] list_nth
> >  18.30%  postgres   [.] find_all_inheritors
> >  10.97%  postgres   [.] get_tabstat_entry
> >   1.82%  postgres   [.] hash_search_with_hash_value
> >   1.21%  postgres   [.] SearchCatCache
> >  ... < 1% ...
> >
> > For tables with thousands partitions we gain in average 32.5% more
> > TPS.
> >
> > As far as I can see in the same time patch doesn't make anything
> > worse. `make check` passes with asserts enabled and disabled. There
> > is no performance degradation according to both standard pgbench
> > benchmark and benchmark described above for tables with 10 and 100
> > partitions.
> 
> I do think it's interesting to try to speed up the ResourceOwner
> mechanism when there are many resources owned.  We've had some forays
> in that direction in the past, and I think it's useful to continue
> that work.  But I also think that this patch, while interesting, is
> not something we can seriously consider committing in its current
> form, because:
> 
> 1. It lacks adequate comments and submission notes to explain the
> general theory of operation of the patch.
> 
> 2. It seems to use uint8 * rather freely as a substitute for char * or
> void *, which doesn't seem like a great idea.
> 
> 3. It doesn't follow PostgreSQL formatting conventions (uncuddled
> curly braces, space after if and before open paren, etc.).
> 
> 4. It mixes together multiple ideas in a single patch, not only
> introducing a hashing concept but also striping a brand-new layer of
> abstraction across the resource-owner mechanism.  I am not sure that
> layer of abstraction is a very good idea, but if it needs to be done,
> I think it should be a separate patch.
> 

diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 0e7acbf..d2b37d5 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -29,6 +29,323 @@
 #include "utils/snapmgr.h"
 
 /*
+ * ResourceArray is a common structure for stor

Re: [HACKERS] Patch: ResourceOwner optimization for tables with many partitions

2015-12-11 Thread Aleksander Alekseev

>> To be honest, I think this patch is really ugly. [...] I'm not sure
>> exactly what to do about that, but it seems like a problem.

I have an idea. There are actually two types of resources - int-like
(buffers, files) and void*-like (RelationRef, TupleDesc, ...). What if
I split ResourceArray into IntResourceArray and PointerResourceArray? It
would definitely solve ugliness problem --- no more memcpy's, char[]
buffers, etc.

>> It would be advisable for example that hash_any not suddenly become
>> covered by the "must not fail" requirement.

Frankly I can't think of any case when hash_any could or should fail.
Maybe we should just add a "must not fail" constraint to hash_any
description?

Also I could use some other hash implementation. It may be reasonable
in this case since size of data I would like to hash is small and known
in advance.

>> BTW, I do not think you can get away with the requirement that
>> all-zeroes isn't a valid resource representation. It might be okay
>> today but it's hardly future-proof.

Agree. I could store a value that should be considered as "zero" in
ResourceArray. It would be InvalidBuffer for buffers, -1 for files and
NULL for all void*-types. Does such solution sounds OK? 

Best regards,
Aleksander


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: ResourceOwner optimization for tables with many partitions

2015-12-11 Thread Aleksander Alekseev

> It would be InvalidBuffer for buffers, -1 for files and NULL for all
> void*-types. Does such solution sounds OK? 

On second thought I believe there is no reason for storing anything for
void*-like types. I could just hardcode NULL in PointerResourceArray.

Best regards,
Aleksander



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-11 Thread Aleksander Alekseev

Hello all,

Consider following stacktrace:

(gdb) bt
#0  0x7f77c81fae87 in semop () syscall-template.S:81
#1  0x0063b721 in PGSemaphoreLock pg_sema.c:387
#2  0x00692e1b in LWLockAcquire lwlock.c:1026
#3  0x0068d4c5 in LockAcquireExtended lock.c:881
#4  0x0068dcc1 in LockAcquire lock.c:672
#5  0x0068b7a8 in LockRelationOid lmgr.c:112
#6  0x00501d18 in find_inheritance_children pg_inherits.c:120
#7  0x00501d80 in find_all_inheritors pg_inherits.c:182
#8  0x0062db8d in expand_inherited_rtentry prepunion.c:1285
#9  expand_inherited_tables prepunion.c:1212
#10 0x00622705 in subquery_planner planner.c:501
#11 0x00622d31 in standard_planner planner.c:285
#12 0x0069ef0c in pg_plan_query postgres.c:809
#13 0x0069f004 in pg_plan_queries postgres.c:868
#14 0x006a0fc2 in exec_simple_query postgres.c:1033
#15 PostgresMain postgres.c:4032
#16 0x00467479 in BackendRun postmaster.c:4237
#17 BackendStartup postmaster.c:3913
#18 ServerLoop () postmaster.c:1684
#19 0x0064c828 in PostmasterMain postmaster.c:1292
#20 0x00467f3e in main main.c:223

Turns out PostgreSQL can spend a lot of time waiting for a lock in this
particular place, especially if you are running PostgreSQL on 60-core
server. Which obviously is a pretty bad sign.

You can easily reproduce this issue on regular Core i7 server. Use
attached schema.sql file to create a database and run:

pgbench -j 8 -c 8 -f pgbench.sql -T 300 my_database 2>/dev/null &

While this example is running connect to some PostgreSQL process with
gdb and run bt/c from time to time. You will see that PostgreSQL waits
for this particular lock quite often.

The problem is that code between LWLockAcquire (lock.c:881) and
LWLockRelease (lock.c:1020) can _sometimes_ run up to 3-5 ms. Using
old-good gettimeofday and logging method I managed to find a bottleneck:

-- proclock = SetupLockInTable [lock.c:892]
 `-- proclock = (PROCLOCK *) hash_search_with_hash_value [lock.c:1105]
   `-- currBucket = get_hash_entry(hashp) [dynahash.c:985]
 `-- SpinLockAcquire(>mutex) [dynahash.c:1187]

If my measurements are not wrong (I didn't place gettimeofday between
SpinLockAquire/SpinLockRelease, etc) we sometimes spend about 3 ms here
waiting for a spinlock, which doesn't seems right.

I managed to fix this behaviour by modifying choose_nelem_alloc
procedure in dynahash.c (see attached patch). The idea is to double
number of items we allocate when there is no more free items in hash
table. So we need twice less allocations which reduces contention.

This patch doesn't cause any performance degradation according to
pgbench, `make check` passes, etc.

Best regards,
Aleksanderdiff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index eacffc4..48def5e 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -544,19 +544,19 @@ choose_nelem_alloc(Size entrysize)
 	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(entrysize);
 
 	/*
-	 * The idea here is to choose nelem_alloc at least 32, but round up so
+	 * The idea here is to choose nelem_alloc at least 64, but round up so
 	 * that the allocation request will be a power of 2 or just less. This
 	 * makes little difference for hash tables in shared memory, but for hash
 	 * tables managed by palloc, the allocation request will be rounded up to
 	 * a power of 2 anyway.  If we fail to take this into account, we'll waste
 	 * as much as half the allocated space.
 	 */
-	allocSize = 32 * 4;			/* assume elementSize at least 8 */
+	allocSize = 64 * 4;			/* assume elementSize at least 8 */
 	do
 	{
 		allocSize <<= 1;
 		nelem_alloc = allocSize / elementSize;
-	} while (nelem_alloc < 32);
+	} while (nelem_alloc < 64);
 
 	return nelem_alloc;
 }


schema.sql
Description: application/sql


pgbench.sql
Description: application/sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-11 Thread Aleksander Alekseev

Hello, Tom

I see your point, but I would like to clarify a few things.

1. Do we consider described measurement method good enough to conclude
that sometimes PostgreSQL really spends 3 ms in a spinlock (like a RTT
between two Internet hosts in the same city)? If not, what method
should be used to approve or disapprove this?

2. If we agree that PostgreSQL does sometimes spend 3 ms in a spinlock
do we consider this a problem?

3. If we consider this a problem, what method is considered appropriate
to find a real reason of such behaviour so we could fix it?

Best regards,
Aleksander


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: ResourceOwner optimization for tables with many partitions

2015-12-14 Thread Aleksander Alekseev

Hello, Robert

Here is my fix for item 4.

Best regards,
Aleksander

On Thu, 10 Dec 2015 11:37:23 -0500
Robert Haas <robertmh...@gmail.com> wrote:

> On Wed, Dec 9, 2015 at 8:30 AM, Aleksander Alekseev
> <a.aleks...@postgrespro.ru> wrote:
> > Hello, Robert
> >
> > Thanks for your review. I believe I fixed items 1, 2 and 3 (see
> > attachment). Also I would like to clarify item 4.
> >
> >> 4. It mixes together multiple ideas in a single patch, not only
> >> introducing a hashing concept but also striping a brand-new layer
> >> of abstraction across the resource-owner mechanism.  I am not sure
> >> that layer of abstraction is a very good idea, but if it needs to
> >> be done, I think it should be a separate patch.
> >
> > Do I right understand that you suggest following?
> >
> > Current patch should be split in two parts. In first patch we create
> > and use ResourceArray with array-based implementation (abstraction
> > layer). Then we apply second patch which change ResourceArray
> > implementation to hashing based (optimization).
> 
> Well, sorta.  To be honest, I think this patch is really ugly.  If we
> were going to do this then, yes, I would want to split the patch into
> two parts along those lines.  But actually I don't really want to do
> it this way at all.  It's not that I don't want the performance
> benefits: I do.  But the current code is really easy to read and
> extremely simple, and this changes it into something that is a heck of
> a lot harder to read and understand.  I'm not sure exactly what to do
> about that, but it seems like a problem.
> 

diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 0e7acbf..39324fe 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -29,6 +29,209 @@
 #include "utils/snapmgr.h"
 
 /*
+ * ResourceArray is a common structure for storing different types of resources.
+ *
+ * ResourceOwner can own `HeapTuple`s, `Relation`s, `Snapshot`s, etc. For
+ * each resource type there are procedures ResourceOwnerRemember* and
+ * ResourceOwnerForget*. There are also ResourceOwnerEnlarge* procedures
+ * which should be called before corresponding ResourceOwnerRemember* calls
+ * (see below). Internally each type of resource is stored in separate
+ * ResourceArray.
+ */
+typedef struct ResourceArray
+{
+	char	   *itemsarr;		/* buffer for storing values */
+	uint32		itemsizelg:2;	/* sizeof one item log 2 */
+	uint32		capacity:30;	/* capacity of array */
+	uint32		nitems;			/* how many items is stored in items array */
+}	ResourceArray;
+
+/*
+ * This number is used as initial size of resource array. If given number of
+ * items is not enough, we double array size and reallocate memory.
+ */
+#define RESARRAY_INIT_SIZE 16
+
+/*
+ * Such type of callback function is called when resource stored in
+ * ResourceArray is released using ResourceArrayFree. Callback should
+ * _actually_ release a resource so nitems value will be decremented.
+ */
+typedef void (*ResourceArrayRemoveCallback) (const void *ref, bool isCommit);
+
+/*
+ * Initialize ResourceArray
+ */
+static void
+ResourceArrayInit(ResourceArray * resarr, uint32 itemsize)
+{
+	Assert(itemsize == sizeof(int) || itemsize == sizeof(void *));
+	Assert(resarr->itemsarr == NULL);
+	Assert(resarr->capacity == 0);
+	Assert(resarr->nitems == 0);
+
+	resarr->itemsizelg = 0;
+	while (itemsize > 1)
+	{
+		resarr->itemsizelg++;
+		itemsize >>= 1;
+	}
+
+	Assert(resarr->itemsizelg == 2 || resarr->itemsizelg == 3);
+}
+
+/*
+ * Add a resource to ResourceArray
+ *
+ * Caller must have previously done ResourceArrayEnlarge()
+ */
+static void
+ResourceArrayAdd(ResourceArray * resarr, const void *dataref)
+{
+	char	   *itemptr;
+	uint32		itemsize = 1 << resarr->itemsizelg;
+
+	Assert(resarr->capacity > 0);
+	Assert(resarr->itemsarr != NULL);
+	Assert(resarr->nitems < resarr->capacity);
+
+	/*
+	 * Read next line as:
+	 *
+	 * itemptr = &(itemsarr[resarr->nitems])
+	 *
+	 * We use binary shift since compiler doesn't know that itemsize is always
+	 * power of two. It would use multiplication instead of efficient binary
+	 * shift in code `resarr->nitems * itemsize`.
+	 */
+	itemptr = resarr->itemsarr + (resarr->nitems << resarr->itemsizelg);
+	memcpy(itemptr, dataref, itemsize);
+	resarr->nitems++;
+}
+
+/*
+ * Remove a resource from ResourceArray
+ *
+ * Returns true on success, false if resource was not found
+ */
+static bool
+ResourceArrayRemove(ResourceArray * resarr, const void *dataref)
+{
+	Datum		idx;
+	char	   *itemptr;
+	char	   *lastitemptr;
+	uint32		itemsize = 1 << resarr->itemsizelg;
+
+	Assert(resarr->capacity > 0);
+	Assert(resarr->itemsarr != NULL);

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-15 Thread Aleksander Alekseev

Hello, Tom.

I was exploring this issue further and discovered something strange.

"PROCLOCK hash" and "LOCK hash" are hash tables in shared memory. All
memory for these tables is in fact pre-allocated. But for some reason
these two tables are created (lock.c:394) with init_size =/= max_size.
It causes small overhead on calling memory allocator after hash table
creation and additional locking/unlocking.

I checked all other hash tables created via ShmemInitHash. All of these
tables have init_size == max_size. I suggest to create "PROCLOCK hash"
and "LOCK hash" with init_size == max_size too (see attached patch).
Granted this change doesn't cause any noticeable performance
improvements according to pgbench. Nevertheless it looks like a very
right thing to do which at least doesn't make anything worse.

If this patch seems to be OK next logical step I believe would be to
remove init_size parameter in ShmemInitHash procedure since in practice
it always equals max_size.

On Fri, 11 Dec 2015 10:30:30 -0500
Tom Lane <t...@sss.pgh.pa.us> wrote:

> Aleksander Alekseev <a.aleks...@postgrespro.ru> writes:
> > Turns out PostgreSQL can spend a lot of time waiting for a lock in
> > this particular place, especially if you are running PostgreSQL on
> > 60-core server. Which obviously is a pretty bad sign.
> > ...
> > I managed to fix this behaviour by modifying choose_nelem_alloc
> > procedure in dynahash.c (see attached patch).
> 
> TBH, this is just voodoo.  I don't know why this change would have
> made any impact on lock acquisition performance, and neither do you,
> and the odds are good that it's pure chance that it changed
> anything.  One likely theory is that you managed to shift around
> memory allocations so that something aligned on a cacheline boundary
> when it hadn't before.  But, of course, the next patch that changes
> allocations anywhere in shared memory could change that back.  There
> are lots of effects like this that appear or disappear based on
> seemingly unrelated code changes when you're measuring edge-case
> performance.
> 
> The patch is not necessarily bad in itself.  As time goes by and
> machines get bigger, it can make sense to allocate more memory at a
> time to reduce memory management overhead.  But arguing for it on the
> basis that it fixes lock allocation behavior with 60 cores is just
> horsepucky.  What you were measuring there was steady-state hash
> table behavior, not the speed of the allocate-some-more-memory code
> path.
> 
>   regards, tom lane
> 
> 

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 76fc615..86d2f88 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -373,16 +373,14 @@ void
 InitLocks(void)
 {
 	HASHCTL		info;
-	long		init_table_size,
-max_table_size;
+	long		shmem_table_size;
 	bool		found;
 
 	/*
 	 * Compute init/max size to request for lock hashtables.  Note these
 	 * calculations must agree with LockShmemSize!
 	 */
-	max_table_size = NLOCKENTS();
-	init_table_size = max_table_size / 2;
+	shmem_table_size = NLOCKENTS();
 
 	/*
 	 * Allocate hash table for LOCK structs.  This stores per-locked-object
@@ -394,14 +392,13 @@ InitLocks(void)
 	info.num_partitions = NUM_LOCK_PARTITIONS;
 
 	LockMethodLockHash = ShmemInitHash("LOCK hash",
-	   init_table_size,
-	   max_table_size,
+	   shmem_table_size,
+	   shmem_table_size,
 	   ,
 	HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
 
 	/* Assume an average of 2 holders per lock */
-	max_table_size *= 2;
-	init_table_size *= 2;
+	shmem_table_size *= 2;
 
 	/*
 	 * Allocate hash table for PROCLOCK structs.  This stores
@@ -413,8 +410,8 @@ InitLocks(void)
 	info.num_partitions = NUM_LOCK_PARTITIONS;
 
 	LockMethodProcLockHash = ShmemInitHash("PROCLOCK hash",
-		   init_table_size,
-		   max_table_size,
+		   shmem_table_size,
+		   shmem_table_size,
 		   ,
  HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-11 Thread Aleksander Alekseev

Oops. s/approve or disapprove/confirm or deny/

On Fri, 11 Dec 2015 19:14:41 +0300
Aleksander Alekseev <a.aleks...@postgrespro.ru> wrote:

> Hello, Tom
> 
> I see your point, but I would like to clarify a few things.
> 
> 1. Do we consider described measurement method good enough to conclude
> that sometimes PostgreSQL really spends 3 ms in a spinlock (like a RTT
> between two Internet hosts in the same city)? If not, what method
> should be used to approve or disapprove this?
> 
> 2. If we agree that PostgreSQL does sometimes spend 3 ms in a spinlock
> do we consider this a problem?
> 
> 3. If we consider this a problem, what method is considered
> appropriate to find a real reason of such behaviour so we could fix
> it?
> 
> Best regards,
> Aleksander
> 
> 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-17 Thread Aleksander Alekseev

> It'd really like to see it being replaced by a queuing lock
> (i.e. lwlock) before we go there. And then maybe partition the
> freelist, and make nentries an atomic.

I believe I just implemented something like this (see attachment). The
idea is to partition PROCLOCK hash table manually into NUM_LOCK_
PARTITIONS smaller and non-partitioned hash tables. Since these tables
are non-partitioned spinlock is not used and there is no lock
contention.

On 60-core server we gain 3.5-4 more TPS according to benchmark
described above. As I understand there is no performance degradation in
other cases (different CPU, traditional pgbench, etc).

If this patch seems to be OK I believe we could consider applying the
same change not only to PROCLOCK hash table.diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 76fc615..1a86f86 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -249,15 +249,56 @@ static volatile FastPathStrongRelationLockData *FastPathStrongRelationLocks;
  * shared memory; LockMethodLocalHash is local to each backend.
  */
 static HTAB *LockMethodLockHash;
-static HTAB *LockMethodProcLockHash;
+static HTAB *LockMethodProcLockHashPartitions[NUM_LOCK_PARTITIONS];
 static HTAB *LockMethodLocalHash;
 
+/*
+ * LockMethodProcLockHash hash table is partitioned into NUM_LOCK_PARTITIONS
+ * usual (non-partitioned, i.e. without HASH_PARTITION flag) hash tables. The
+ * reason for partitioning LockMethodProcLockHash manually instead of just
+ * using single hash table with HASH_PARTITION flag is following. If hash
+ * table has HASH_PARTITION flag it uses a single spinlock to safely
+ * access some of its fields. Turned out in case of this particular hash
+ * table it causes a considerable performance degradation becouse of lock
+ * contention on servers with tens of cores.
+ */
+#define LockMethodProcLockHash(hashcode) \
+	(LockMethodProcLockHashPartitions[LockHashPartition(hashcode)])
+
+/*
+ * Each partition of LockMethodProcLockHash hash table has an unique name
+ * generated from this template using snprintf.
+ */
+static const char ProcLockHashNameTemplate[] = "PROCLOCK hash, partition %d";
+
+/*
+ * Size of buffer required to store LockMethodProcLockHash partition name.
+ *
+ * 10 is number of digits in 2^32. 2 is length of "%d" string.
+ */
+#define PROC_LOCK_HASH_NAME_BUFF_SIZE (sizeof(ProcLockHashNameTemplate)+10-2)
 
 /* private state for error cleanup */
 static LOCALLOCK *StrongLockInProgress;
 static LOCALLOCK *awaitedLock;
 static ResourceOwner awaitedOwner;
 
+/*
+ * Calculate total number of entries in LockMethodProcLockHash hash table.
+ *
+ * Caller is responsible for having acquired appropriate LWLocks.
+ */
+static long
+GetProcLockHashNumEntries()
+{
+	int			i;
+	long		sum = 0;
+
+	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
+		sum += hash_get_num_entries(LockMethodProcLockHashPartitions[i]);
+
+	return sum;
+}
 
 #ifdef LOCK_DEBUG
 
@@ -373,16 +414,16 @@ void
 InitLocks(void)
 {
 	HASHCTL		info;
-	long		init_table_size,
-max_table_size;
+	long		shmem_table_size;
 	bool		found;
+	int			i;
+	char		proclock_hash_name[PROC_LOCK_HASH_NAME_BUFF_SIZE];
 
 	/*
 	 * Compute init/max size to request for lock hashtables.  Note these
 	 * calculations must agree with LockShmemSize!
 	 */
-	max_table_size = NLOCKENTS();
-	init_table_size = max_table_size / 2;
+	shmem_table_size = NLOCKENTS();
 
 	/*
 	 * Allocate hash table for LOCK structs.  This stores per-locked-object
@@ -394,14 +435,13 @@ InitLocks(void)
 	info.num_partitions = NUM_LOCK_PARTITIONS;
 
 	LockMethodLockHash = ShmemInitHash("LOCK hash",
-	   init_table_size,
-	   max_table_size,
+	   shmem_table_size,
+	   shmem_table_size,
 	   ,
 	HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
 
 	/* Assume an average of 2 holders per lock */
-	max_table_size *= 2;
-	init_table_size *= 2;
+	shmem_table_size = (shmem_table_size * 2) / NUM_LOCK_PARTITIONS;
 
 	/*
 	 * Allocate hash table for PROCLOCK structs.  This stores
@@ -412,11 +452,17 @@ InitLocks(void)
 	info.hash = proclock_hash;
 	info.num_partitions = NUM_LOCK_PARTITIONS;
 
-	LockMethodProcLockHash = ShmemInitHash("PROCLOCK hash",
-		   init_table_size,
-		   max_table_size,
-		   ,
- HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
+	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
+	{
+		snprintf(proclock_hash_name, sizeof(proclock_hash_name),
+ ProcLockHashNameTemplate, i + 1);
+
+		LockMethodProcLockHashPartitions[i] = ShmemInitHash(proclock_hash_name,
+			shmem_table_size,
+			shmem_table_size,
+			,
+  HASH_ELEM | HASH_FUNCTION);
+	}
 
 	/*
 	 * Allocate fast-path structures.
@@ -943,7 +989,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 proclock_hashcode = ProcLockHashCode(>tag, hashcode);
 SHMQueueDelete(>lockLink);
 SHMQueueDelete(>procLink);
-if

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-18 Thread Aleksander Alekseev

> Oops, 3.5-4 _times_ more TPS, i.e. 2206 TPS vs 546 TPS.

In fact these numbers are for similar but a little bit different
benchmark (same schema without checks on child tables). Here are exact
numbers for a benchmark described above.



Before:

$ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 64
duration: 30 s
number of transactions actually processed: 20433
latency average: 93.966 ms
tps = 679.698439 (including connections establishing)
tps = 680.353897 (excluding connections establishing)

$ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 64
duration: 30 s
number of transactions actually processed: 19111
latency average: 100.466 ms
tps = 635.763523 (including connections establishing)
tps = 636.112682 (excluding connections establishing)

$ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 64
duration: 30 s
number of transactions actually processed: 19218
latency average: 99.906 ms
tps = 639.506848 (including connections establishing)
tps = 639.838757 (excluding connections establishing)


After:

$ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 64
duration: 30 s
number of transactions actually processed: 95900
latency average: 20.021 ms
tps = 3194.142762 (including connections establishing)
tps = 3196.091843 (excluding connections establishing)

$ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 64
duration: 30 s
number of transactions actually processed: 96837
latency average: 19.827 ms
tps = 3225.822355 (including connections establishing)
tps = 3227.762847 (excluding connections establishing)

$ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 64
duration: 30 s
number of transactions actually processed: 96143
latency average: 19.970 ms
tps = 3202.637126 (including connections establishing)
tps = 3204.070466 (excluding connections establishing)


Ratio:

$ python

min(3194.0, 3225.0, 3202.0) / max(679.0, 635.0, 639.0)
4.703976435935199


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-18 Thread Aleksander Alekseev

> This idea can improve the situation with ProcLock hash table, but I
> think IIUC what Andres is suggesting would reduce the contention
> around dynahash freelist and can be helpful in many more situations
> including BufMapping locks.

I agree. But as I understand PostgreSQL community doesn't generally
аpprоvе big changes that affects whole system. Especially if original
problem was only in one particular place. Therefore for now I suggest
only a small change. Naturally if it will be accepted there is no
reason not to apply same changes for BufMapping or even dynahash itself
with corresponding PROCLOCK hash refactoring.

BTW could you (or anyone?) please help me find this thread regarding
BufMapping or perhaps provide a benchmark? I would like to reproduce
this issue but I can't find anything relevant in a mailing list. Also
it seems to be a good idea to compare alternative approaches that were
mentioned (atomics ops, group leader). Are there any discussions,
benchmarks or patches regarding this topic?

Frankly I have serious doubts regarding atomics ops since they will more
likely create the same contention that a spinlock does. But perhaps
there is a patch that works not the way I think it could work.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-18 Thread Aleksander Alekseev

> What's in pgbench.sql?

It's from first message of this thread:

http://www.postgresql.org/message-id/20151211170001.78ded9d7@fujitsu


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-29 Thread Aleksander Alekseev

Good news, everyone!

I discovered that NUM_LOCK_OF_PARTITIONS is a bottleneck for a last
patch. Long story short - NUM_LOCK_OF_PARTITIONS = (1 << 7) gives best
performance:

PN = 16, AN =  8, NUM_LOCK_PARTITIONS = (1 << 7): 4782.9
PN = 16, AN =  4, NUM_LOCK_PARTITIONS = (1 << 7): 4089.9
PN = 16, AN =  2, NUM_LOCK_PARTITIONS = (1 << 7): 2822.5

PN =  8, AN =  8, NUM_LOCK_PARTITIONS = (1 << 7): 4391.7
PN =  8, AN =  4, NUM_LOCK_PARTITIONS = (1 << 7): 3665.0
PN =  8, AN =  2, NUM_LOCK_PARTITIONS = (1 << 7): 2205.7

PN =  4, AN =  8, NUM_LOCK_PARTITIONS = (1 << 7): 4031.9
PN =  4, AN =  4, NUM_LOCK_PARTITIONS = (1 << 7): 3378.2
PN =  4, AN =  2, NUM_LOCK_PARTITIONS = (1 << 7): 2142.4

Also I tried to optimize global_free_list_* procedures as I planned.
Optimized version is asymptotically faster but works about 5% slower in
practice because of some additional operations we need to perform.
Patch is attached to this message in case you would like to examine a
code.

Here is a funny thing - benchmark results I shared 22.12.2015 are wrong
because I forgot to run `make clean` after changing lwlock.h (autotools
doesn't rebuild project properly after changing .h files). Here are
correct results:

60-core server,
pgbench -j 64 -c 64 -f pgbench.sql -T 50 -P 1 my_database

NUM_LOCK_  |  master  | no locks |  lwlock  | spinlock | spinlock 
PARTITIONS | (99ccb2) |  |  |  |  array   
---|--|--|--|--|--
 1 << 4|  660.4   |  2296.3  |  1441.8  |  454.4   |  1597.8  
---|--|--|--|--|--
 1 << 5|  536.6   |  3006.5  |  1632.3  |  381.8   |  1896.8  
---|--|--|--|--|--
 1 << 6|  469.9   |  4083.5  |  1693.4  |  427.7   |  2271.5  
---|--|--|--|--|--
 1 << 7|  430.8   |  4653.0  |  2016.3  |  361.1   |  3277.1  

As you may see "spinlock array" version works quite well. It is almost
5 times faster than current dynahash implementation and only 30% slower
than "no locks" version. It also guarantees that we will use all
available memory.

I experimented with various improvements of "spinlock array" as well.
E.g. I tried to borrow elements not one a time, but it didn't cause any
performance improvements.

In case you believe that 5 times faster is not good enough we can also
use sharded-global-free-list.patch with appropriate AN and PN values.
Still this patch doesn't guarantee that all available memory will be
used ("no lock" patch doesn't guarantee it either).

Considering that benchmark above is not for all possible cases, but for
a very specific one, and that "spinlock array" patch has much better
guarantees then "no lock" patch, and that lock-free algorithms are
pretty complicated and error-prone I suggest we call "spinlock array"
solution good enough, at least until someone complain again that
dynahash is a bottleneck. Naturally first I clean a code, write more
comments, then re-check once again that there is no performance
degradation according to usual pgbench, etc.diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 78f15f0..91fcc05 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -265,7 +265,7 @@ InitShmemIndex(void)
  */
 HTAB *
 ShmemInitHash(const char *name, /* table string name for shmem index */
-			  long init_size,	/* initial table size */
+			  long init_size,	/* initial table size */ // AALEKSEEV: is ignored, refactor!
 			  long max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
@@ -299,7 +299,7 @@ ShmemInitHash(const char *name, /* table string name for shmem index */
 	/* Pass location of hashtable header to hash_create */
 	infoP->hctl = (HASHHDR *) location;
 
-	return hash_create(name, init_size, infoP, hash_flags);
+	return hash_create(name, max_size, infoP, hash_flags);
 }
 
 /*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index eacffc4..07a46db 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -87,6 +87,8 @@
 #include "access/xact.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
+#include "storage/lock.h"
+#include "storage/lwlock.h"
 #include "utils/dynahash.h"
 #include "utils/memutils.h"
 
@@ -118,6 +120,13 @@ typedef HASHELEMENT *HASHBUCKET;
 /* A hash segment is an array of bucket headers */
 typedef HASHBUCKET *HASHSEGMENT;
 
+// AALEKSEEV: TODO: comment, should be power of two
+#define GLOBAL_FREE_LIST_PARTITIONS_NUM 4
+#define GLOBAL_FREE_LIST_PARTITIONS_MASK (GLOBAL_FREE_LIST_PARTITIONS_NUM-1)
+
+// AALEKSEEV: TODO: comment
+#define GLOBAL_FREE_LIST_ALLOC_NUM 8
+
 /*
  * Header structure for a hash table --- contains all changeable info
  *
@@ -129,11 +138,24 @@ typedef HASHBUCKET *HASHSEGMENT;
 struct HASHHDR
 {
 	/* In a

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-22 Thread Aleksander Alekseev

> > Actually, I'd like to improve all partitioned hashes instead of
> > improve only one case.  
> 
> Yeah.  I'm not sure that should be an LWLock rather than a spinlock,
> but we can benchmark it both ways.  

I would like to share some preliminary results. I tested four
implementations:

- no locks and no element stealing from other partitions;
- single LWLock per partitioned table;
- single spinlock per partitioned table;
- NUM_LOCK_PARTITIONS spinlocks per partitioned table;

Interestingly "Shared Buffer Lookup Table" (see buf_table.c) has 128
partitions. The constant NUM_BUFFER_PARTITIONS was increased from 16 to
128 in commit 3acc10c9: 

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3acc10c997f916f6a741d0b4876126b7b08e3892;hp=952872698d9443fdf9b808a1376017f00c91065a

Obviously after splitting a freelist into NUM_LOCK_PARTITIONS
partitions (and assuming that all necessary locking/unlocking is done
on calling side) tables can't have more than NUM_LOCK_PARTITIONS
partitions because it would cause race conditions. For this reason I
had to define NUM_BUFFER_PARTITIONS as NUM_LOCK_PARTITIONS and compare
behaviour of PostgreSQL depending on different values of
NUM_LOCK_PARTITIONS.

So here are results:

Core i7, pgbench -j 8 -c 8 -T 30 pgbench
(3 tests, TPS excluding connections establishing)

NUM_LOCK_  |  master  | no locks |  lwlock  | spinlock | spinlock 
PARTITIONS | (99ccb2) |  |  |  |  array   
---|--|--|--|--|--
   |  295.4   |  297.4   |  299.4   |  285.6   |  302.7   
(1 << 4)   |  286.1   |  300.5   |  283.4   |  300.9   |  300.4   
   |  300.0   |  300.0   |  302.1   |  300.7   |  300.3   
---|--|--|--|--|--
   |  |  296.7   |  299.9   |  298.8   |  298.3
(1 << 5)   |      |  301.9   |  302.2   |  305.7   |  306.3
   |  |  287.7   |  301.0   |  303.0   |  304.5
---|--|--|--|--|--
   |  |  296.4   |  300.5   |  302.9   |  304.6   
(1 << 6)   |      |  301.7   |  305.6   |  306.4   |  302.3   
   |  |  299.6   |  304.5   |  306.6   |  300.4   
---|--|--|--|--|--
   |  |  295.9   |  298.7   |  295.3   |  305.0   
(1 << 7)   |      |  299.5   |  300.5   |  299.0   |  310.2   
   |  |  287.8   |  285.9   |  300.2   |  302.2   

Core i7, pgbench -j 8 -c 8 -f big_table.sql -T 30 my_database
(3 test, TPS excluding connections establishing)

NUM_LOCK_  |  master  | no locks |  lwlock  | spinlock | spinlock 
PARTITIONS | (99ccb2) |  |  |  |  array   
---|--|--|--|--|--
   |  505.1   |  521.3   |  511.1   |  524.4   |  501.6   
(1 << 4)   |  452.4   |  467.4   |  509.2   |  472.3   |  453.7   
   |  435.2   |  462.4   |  445.8   |  467.9   |  467.0   
---|--|--|--|--|--
   |  |  514.8   |  476.3   |  507.9   |  510.6   
(1 << 5)   |      |  457.5   |  491.2   |  464.6   |  431.7   
   |  |  442.2   |  457.0   |  495.5   |  448.2
---|--|--|--|--|--
   |  |  516.4   |  502.5   |  468.0   |  521.3   
(1 << 6)   |      |  463.6   |  438.7   |  488.8   |  455.4   
   |  |  434.2   |  468.1   |  484.7   |  433.5   
---|--|--|--|--|--
   |  |  513.6   |  459.4   |  519.6   |  510.3   
(1 << 7)   |      |  470.1   |  454.6   |  445.5   |  415.9   
   |  |  459.4   |  489.7   |  457.1   |  452.8   

60-core server, pgbench -j 64 -c 64 -T 30 pgbench
(3 tests, TPS excluding connections establishing)

NUM_LOCK_  |  master  | no locks |  lwlock  | spinlock | spinlock 
PARTITIONS | (99ccb2) |  |  |  |  array   
---|--|--|--|--|--
   |  3156.2  |  3157.9  |  3542.0  |  3444.3  |  3472.4  
(1 << 4)   |  3268.5  |  3444.7  |  3485.7  |  3486.0  |  3500.5  
   |  3251.2  |  3482.3  |  3398.7  |  3587.1  |  3557.7  
---|--|--|--|--|--
   |  |  3352.7  |  3556.0  |  3543.3  |  3526.8 
(1 << 5)   |      |  3465.0  |  3475.2  |  3486.9  |  3528.4  
   |  |  3410.0  |  3482.0  |  3493.7  |  3444.9  
---|--|--|--|--|--
   |  |  3437.8  |  3413.1  |  3445.8  |  3481.6  
(1 << 6)   |      |  3470.1  |  3478.4  |  3538.5  |  3579.9  
   |  |  3450.8  |  3431.1  |  3509.0  |  3512.5  
---|--|--|--|--|--
   |

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-18 Thread Aleksander Alekseev

> Could we split one freelist in hash to NUM_LOCK_PARTITIONS freelists?
> Each partition will have its own freelist and if freelist is empty
> then partition should search an entry in freelists of other
> partitions. To prevent concurrent access it's needed to add one
> LWLock to hash, each partition should lock LWlock in share mode to
> work with its own freelist and exclusive to work with other freelists.
> 
> Actually, I'd like to improve all partitioned hashes instead of
> improve only one case.

It seems to be a most promising direction of research for now. I will
send a patch and benchmark results soon.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-18 Thread Aleksander Alekseev

> This idea can improve the situation with ProcLock hash table, but I
> think IIUC what Andres is suggesting would reduce the contention
> around dynahash freelist and can be helpful in many more situations
> including BufMapping locks.

I agree. But as I understand PostgreSQL community doesn't generally
approve big changes that affects whole system. Especially if original
problem was only in one particular place. Therefore for now I suggest
only a small change. Naturally if it will be accepted there is no
reason not to apply same changes for BufMapping or even dynahash itself
with corresponding PROCLOCK hash refactoring.

BTW could you (or anyone?) please help me find this thread regarding
BufMapping or perhaps provide a benchmark? I would like to reproduce
this issue but I can't find anything relevant in a mailing list. Also
it seems to be a good idea to compare alternative approaches that were
mentioned (atomics ops, group leader). Are there any discussions,
benchmarks or patches regarding this topic?

Frankly I have serious doubts regarding atomics ops since they will more
likely create the same contention that a spinlock does. But perhaps
there is a patch that works not the way I think it could work.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-28 Thread Aleksander Alekseev

Here is another preliminary result I would like to share. As always you
will find corresponding patch in attachment. It has work in progress
quality.

The idea was suggested by colleague of mine Aleksander Lebedev.

freeList is partitioned like in "no lock" patch. When there is no
enough free items in a freeList we borrow AN items from a global list.
When freeList become too large we return AN items back to global list.
This global list is also partitioned into PN partitions. Each partition
is protected by a spinlock. 

This way we have less lock contention than in "lwlock" or "spinlock
array" versions since we borrow multiple free elements, not one a time.
Also in worst case only AN*(NUM_LOCK_PARTITIONS-1) free items are not
used instead of (Total/NUM_LOCK_PARTITIONS)*(NUM_LOCK_PARTITIONS-1).

On 60-core server amount of TPS depends on AN and PN like this:

   ||||||
   | AN = 1 | AN = 2 | AN = 4 | AN = 8 | AN =16 | AN =32 
---||||||
   |  733.0 | 1120.6 | 1605.5 | 1842.5 | 1545.5 | 1237.0 
PN = 1 |  740.3 | 1127.0 | 1634.2 | 1800.8 | 1573.5 | 1245.1 
   |  742.9 | 1102.1 | 1647.2 | 1853.6 | 1533.4 | 1251.9 
---||||||
   | 1052.0 | 1438.1 | 1755.6 | 1981.0 | 2022.0 | 1816.8 
PN = 2 | 1044.8 | 1453.1 | 1784.0 | 1958.3 | 2033.2 | 1819.2 
   | 1028.7 | 1419.8 | 1809.2 | 1981.2 | 2028.2 | 1790.2 
---||||||
   | 1182.0 | 1521.5 | 1813.2 | 1932.6 | 2035.2 | 1948.4 
PN = 4 | 1212.4 | 1535.4 | 1816.8 | 1927.0 | 2018.7 | 2014.6 
   | 1189.4 | 1528.9 | 1816.9 | 1942.6 | 2011.9 | 2018.3 
---||||||
   | 1148.1 | 1522.2 | 1795.4 | 1926.6 | 2031.7 | 2015.6 
PN = 8 | 1175.6 | 1529.4 | 1807.6 | 1913.5 | 2007.3 | 2062.0 
   | 1169.9 | 1528.0 | 1796.3 | 1926.0 | 2011.1 | 2042.8 
---||||||
   | 1117.7 | 1491.0 | 1803.9 | 1925.3 | 2029.4 | 2056.2 
PN =16 | 1132.8 | 1481.0 | 1809.6 | 1968.1 | 2033.8 | 2068.5 
   | 1131.4 | 1481.8 | 1819.4 | 1946.2 | 2071.1 | 2073.8 

AN = GLOBAL_FREE_LIST_ALLOC_NUMBER
PN = GLOBAL_FREE_LIST_PARTITIONS_NUM

There is no performance degradation on Core i7. By increasing PN or AN
any further we don't gain any more TPS.

As you may see this version is about 30% faster than "lwlock" or
"spinlock array" and 3.1 times faster than master. Still it's about 2.5
slower than "no locks" version which I find frustrating.

Next I will try to speedup this version by modifying global_free_list_*
procedures. Current implementations are not most efficient ones. Also
I'm planning to explore approaches which involve lock free algorithms.

I would like to know your opinion about such approach. For instance
could we spare AN*(NUM_LOCK_PARTITIONS-1) items in a worst case or we
can't by same reason we can't do it for (Total / NUM_LOCK_PARTITIONS) *
(NUM_LOCK_PARTITIONS-1)?
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 78f15f0..91fcc05 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -265,7 +265,7 @@ InitShmemIndex(void)
  */
 HTAB *
 ShmemInitHash(const char *name, /* table string name for shmem index */
-			  long init_size,	/* initial table size */
+			  long init_size,	/* initial table size */ // AALEKSEEV: is ignored, refactor!
 			  long max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
@@ -299,7 +299,7 @@ ShmemInitHash(const char *name, /* table string name for shmem index */
 	/* Pass location of hashtable header to hash_create */
 	infoP->hctl = (HASHHDR *) location;
 
-	return hash_create(name, init_size, infoP, hash_flags);
+	return hash_create(name, max_size, infoP, hash_flags);
 }
 
 /*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index eacffc4..8375c3b 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -87,6 +87,8 @@
 #include "access/xact.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
+#include "storage/lock.h"
+#include "storage/lwlock.h"
 #include "utils/dynahash.h"
 #include "utils/memutils.h"
 
@@ -118,6 +120,13 @@ typedef HASHELEMENT *HASHBUCKET;
 /* A hash segment is an array of bucket headers */
 typedef HASHBUCKET *HASHSEGMENT;
 
+// AALEKSEEV: TODO: comment, should be power of two
+#define GLOBAL_FREE_LIST_PARTITIONS_NUM 16
+#define GLOBAL_FREE_LIST_PARTITIONS_MASK (GLOBAL_FREE_LIST_PARTITIONS_NUM-1)
+
+// AALEKSEEV: TODO: comment
+#define GLOBAL_FREE_LIST_ALLOC_NUMBER 16
+
 /*
  * Header structure for a hash table --- contains all changeable info
  *
@@ -129,11 +138,20 @@ typedef HASHBUCKET *HASHSEGMENT;
 struct HASHHDR
 {
 	/* In a partitioned table, take this lock to touch nentries or

Re: [HACKERS] Patch: ResourceOwner optimization for tables with many partitions

2015-12-24 Thread Aleksander Alekseev

Oops, wrong patches - here are correct ones.diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 0e7acbf..3330c8d 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -29,6 +29,156 @@
 #include "utils/snapmgr.h"
 
 /*
+ * ResourceArray is a common structure for storing different types of resources.
+ *
+ * ResourceOwner can own `HeapTuple`s, `Relation`s, `Snapshot`s, etc. For
+ * each resource type there are procedures ResourceOwnerRemember* and
+ * ResourceOwnerForget*. There are also ResourceOwnerEnlarge* procedures
+ * which should be called before corresponding ResourceOwnerRemember* calls
+ * (see below). Internally each type of resource is stored in separate
+ * ResourceArray.
+ */
+typedef struct ResourceArray
+{
+	Datum	   *itemsarr;		/* buffer for storing values */
+	uint32		capacity;		/* capacity of array */
+	uint32		nitems;			/* how many items is stored in items array */
+}	ResourceArray;
+
+/*
+ * This number is used as initial size of resource array. If given number of
+ * items is not enough, we double array size and reallocate memory.
+ */
+#define RESARRAY_INIT_SIZE 16
+
+/*
+ * Initialize ResourceArray
+ */
+static void
+ResourceArrayInit(ResourceArray * resarr)
+{
+	Assert(resarr->itemsarr == NULL);
+	Assert(resarr->capacity == 0);
+	Assert(resarr->nitems == 0);
+}
+
+/*
+ * Add a resource to ResourceArray
+ *
+ * Caller must have previously done ResourceArrayEnlarge()
+ */
+static void
+ResourceArrayAdd(ResourceArray * resarr, Datum data)
+{
+	Assert(resarr->capacity > 0);
+	Assert(resarr->itemsarr != NULL);
+	Assert(resarr->nitems < resarr->capacity);
+
+	resarr->itemsarr[resarr->nitems] = data;
+	resarr->nitems++;
+}
+
+/*
+ * Remove a resource from ResourceArray
+ *
+ * Returns true on success, false if resource was not found
+ */
+static bool
+ResourceArrayRemove(ResourceArray * resarr, Datum data)
+{
+	int			i,
+j,
+lastidx;
+
+	Assert(resarr->capacity > 0);
+	Assert(resarr->itemsarr != NULL);
+
+	lastidx = ((int) resarr->nitems) - 1;
+
+	for (i = lastidx; i >= 0; i--)
+	{
+		if (resarr->itemsarr[i] == data)
+		{
+			for (j = i; j < lastidx; j++)
+resarr->itemsarr[j] = resarr->itemsarr[j + 1];
+			resarr->nitems--;
+			return true;
+		}
+	}
+
+	return false;
+}
+
+/*
+ * Make sure there is a room for at least one more resource in an array.
+ *
+ * This is separate from actually inserting a resource because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+static void
+ResourceArrayEnlarge(ResourceArray * resarr)
+{
+	uint32		i,
+oldcap,
+oldnitems;
+	Datum	   *olditemsarr;
+
+	if (resarr->nitems < resarr->capacity)
+		return;	/* nothing to do */
+
+	olditemsarr = resarr->itemsarr;
+	oldcap = resarr->capacity;
+	oldnitems = resarr->nitems;
+
+	resarr->capacity = oldcap > 0 ? oldcap * 2 : RESARRAY_INIT_SIZE;
+	resarr->itemsarr = (Datum *)
+		MemoryContextAlloc(TopMemoryContext,
+		   resarr->capacity * sizeof(Datum));
+	resarr->nitems = 0;
+
+	if (olditemsarr != NULL)
+	{
+		for (i = 0; i < oldnitems; i++)
+			ResourceArrayAdd(resarr, olditemsarr[i]);
+		pfree(olditemsarr);
+	}
+}
+
+/*
+ * Get any convenient element.
+ *
+ * Returns true on success, false on failure.
+ */
+static bool
+ResourceArrayGetAny(ResourceArray * resarr, Datum *out)
+{
+	if (resarr->nitems == 0)
+		return false;
+
+	Assert(resarr->capacity > 0);
+
+	*out = resarr->itemsarr[resarr->nitems - 1];
+	return true;
+}
+
+/*
+ * Return ResourceArray to initial state
+ */
+static void
+ResourceArrayFree(ResourceArray * resarr)
+{
+	Assert(resarr->nitems == 0);
+
+	resarr->capacity = 0;
+
+	if (!resarr->itemsarr)
+		return;
+
+	pfree(resarr->itemsarr);
+	resarr->itemsarr = NULL;
+}
+
+/*
  * To speed up bulk releasing or reassigning locks from a resource owner to
  * its parent, each resource owner has a small cache of locks it owns. The
  * lock manager has the same information in its local lock hash table, and
@@ -56,53 +206,22 @@ typedef struct ResourceOwnerData
 	ResourceOwner nextchild;	/* next child of same parent */
 	const char *name;			/* name (just for debugging) */
 
-	/* We have built-in support for remembering owned buffers */
-	int			nbuffers;		/* number of owned buffer pins */
-	Buffer	   *buffers;		/* dynamically allocated array */
-	int			maxbuffers;		/* currently allocated array size */
-
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];		/* list of owned locks */
 
-	/* We have built-in support for remembering catcache references */
-	int			ncatrefs;		/* number of owned catcache pins */
-	HeapTuple  *catrefs;		/* dynamically allocated array */
-	int			maxcatrefs;		/* currently allocated array size */
-
-	int			ncatlistrefs;	/* number of owned catcache-list pins */
-	CatCList  **catlistrefs;	/* dynamically allocated array */
-	int

Re: [HACKERS] Patch: ResourceOwner optimization for tables with many partitions

2015-12-24 Thread Aleksander Alekseev

I believe I fixed all flaws mentioned so far (see attachment).

Also I did a new benchmark to make sure that new patch makes PostgreSQL
work faster and doesn't cause performance degradation in some cases. 

"usual pgbench" row corresponds to `pgbench -j 8 -c 8 -T 30 pgbench`
performed on a 4-core PC.

"N partitions" rows correspond to a benchmark described in a first
message of this thread performed on the same PC. N is and argument
given to gen.pl script i.e. number of partitions in generated
partitioned table.

Here are results (3 tests, TPS excluding connections establishing):

Test|  Before   |   After   |  Delta avg   
|---|---|-
|295.7  |295.0  | 
usual pgbench   |303.1  |299.6  |   ~ 0%  
|297.7  |302.7  | 
|---|---|-
|  28022.3  |  27956.1  | 
10 partitions   |  27550.1  |  28916.9  |   ~ 0%  
|  28617.0  |  28022.9  | 
|---|---|-
|   3021.4  |   3184.0  | 
100 partitions  |   2949.1  |   3120.1  | 3% more TPS  
|   2870.6  |   2825.2  | 
|---|---|-
|106.7  |158.6  | 
1000 partitions |105.2  |168.4  | 53% more TPS 
|105.9  |162.0  | 


On Fri, 18 Dec 2015 13:39:01 -0500
Tom Lane  wrote:

> Robert Haas  writes:
> > I don't know, I'm still not very comfortable with this.  And Tom
> > didn't like dictating that hash_any() must be no-fail, though I'm
> > not sure why.
> 
> What I definitely didn't like was assuming at a distance that it would
> be no-fail.  If we're to depend on that, the patch had better attach
> a comment saying so to the header comments of the function(s) it's
> assuming that about.  Otherwise, somebody could hack up hashfunc.c
> in a way that breaks the assumption, without any clue that some code
> in a very-far-away module is critically reliant on it.
> 
> > Let's wait to see what others think.
> 
> A few observations:
> 
> * This bit is too cute by half, if not three-quarters:
> 
> + uint32  itemsizelg:2;   /* sizeof one
> item log 2 */
> + uint32  capacity:30;/* capacity of
> array */
> 
> Is there a good reason to assume that the only things we'll ever store
> in these arrays are of size no more than 8 bytes?  Are we so desperate
> to save space that we cannot spare two separate words for itemsize and
> capacity?  (ISTM it's a good bet that the extra code for accessing
> these bitfields occupies more space than would be saved, considering
> how few ResourceOwners typically exist at one time.)  Let's just make
> it a couple of ints and be done.  Actually, maybe nitems and capacity
> should be size_t, just in case.
> 
> * An alternative design would be to forget itemsizelg altogether and
> insist that everything stored in the resource arrays be a Datum,
> which could then be coerced to/from some form of integer or some form
> of pointer as appropriate.  That would waste some space in the int
> case, but it would considerably simplify both the ResourceArray code
> and the APIs to it, which might be worth the price of assuming we'll
> never store anything bigger than 8 bytes.  It also would make this
> look more like some existing APIs such as the on_exit callbacks.
> 
> * A lot of the code churn comes from the insistence on defining
> callbacks, which I'm dubious that we need.  We could instead have a
> function that is "get any convenient one of the array elements" and
> revise the loops in ResourceOwnerReleaseInternal to be like
> 
>   while ((item = getconvenientitem(resourcearray)))
>   {
>   drop item in exactly the same way as before
>   }
> 
> I find that preferable to the proposed ResourceArrayRemoveAll
> 
> + while (resarr->nitems > 0)
> + {
> + releasecb(resarr->itemsarr, isCommit);
> + }
> 
> which certainly looks like it's an infinite loop; it's assuming (again
> with no documentation) that the callback function will cause the array
> to get smaller somehow.  With the existing coding, it's much more
> clear why we think the loops will terminate.
> 
> * The reason that ResourceOwnerReleaseInternal was not horribly
> inefficient was that its notion of "any convenient one" of the items
> to be deleted next was in fact the one that the corresponding Forget
> function would examine first, thus avoiding an O(N^2) cost to
> re-identify the item to be dropped.  I think we should make an effort
> to be more explicit about that connection in any rewrite.  In
> particular, it looks to me like when a hash array is in use, things
> will get slower not faster because we'll be adding a hash lookup

Re: [HACKERS] WIP: bloom filter in Hash Joins with batches

2015-12-24 Thread Aleksander Alekseev

Hello, Tomas.

Great idea!

Did you consider to cache bloom filter or at least part(s) of it
somehow? I think this way we could gain some more TPS. This of course
assuming that creating a bloom filter is really a bottleneck here,
which would be nice be investigated first.

Best regards,
Aleksander


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-30 Thread Aleksander Alekseev

Here is a clean version of the patch. Step 1 is an optimization. Step 2
refactors this:

 HTAB *
 ShmemInitHash(const char *name, /* table string name for shmem index */
- long init_size,   /* initial table size */
+ long init_size,   /* initial table size XXX ignored,
 refactor! */


As I understand there is no performance degradation:

Before
==

Core i7, pgbench -j 8 -c 8 -T 50 pgbench:

```
number of transactions actually processed: 14315
latency average: 27.945 ms
latency stddev: 44.920 ms
tps = 286.097043 (including connections establishing)
tps = 286.127642 (excluding connections establishing)
```

60-core server, pgbench -j 64 -c 64 -T 50 pgbench:

```
number of transactions actually processed: 176127
latency average: 18.162 ms
latency stddev: 23.861 ms
tps = 3521.069672 (including connections establishing)
tps = 3522.166706 (excluding connections establishing)
```


After
=

Core i7, pgbench -j 8 -c 8 -T 50 pgbench:

```
number of transactions actually processed: 14212
latency average: 27.984 ms
latency stddev: 43.250 ms
tps = 284.038588 (including connections establishing)
tps = 285.112772 (excluding connections establishing)
```

60-core server, pgbench -j 64 -c 64 -T 50 pgbench:

```
number of transactions actually processed: 172135
latency average: 17.767 ms
latency stddev: 22.692 ms
tps = 3590.410302 (including connections establishing)
tps = 3594.607165 (excluding connections establishing)
```diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 78f15f0..d361dbb 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -265,7 +265,7 @@ InitShmemIndex(void)
  */
 HTAB *
 ShmemInitHash(const char *name, /* table string name for shmem index */
-			  long init_size,	/* initial table size */
+			  long init_size,	/* initial table size XXX ignored, refactor! */
 			  long max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
@@ -299,7 +299,7 @@ ShmemInitHash(const char *name, /* table string name for shmem index */
 	/* Pass location of hashtable header to hash_create */
 	infoP->hctl = (HASHHDR *) location;
 
-	return hash_create(name, init_size, infoP, hash_flags);
+	return hash_create(name, max_size, infoP, hash_flags);
 }
 
 /*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index eacffc4..a50a2d3 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -15,7 +15,7 @@
  * to hash_create.  This prevents any attempt to split buckets on-the-fly.
  * Therefore, each hash bucket chain operates independently, and no fields
  * of the hash header change after init except nentries and freeList.
- * A partitioned table uses a spinlock to guard changes of those two fields.
+ * A partitioned table uses spinlocks to guard changes of those fields.
  * This lets any subset of the hash buckets be treated as a separately
  * lockable partition.  We expect callers to use the low-order bits of a
  * lookup key's hash value as a partition number --- this will work because
@@ -87,6 +87,7 @@
 #include "access/xact.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
+#include "storage/lock.h"
 #include "utils/dynahash.h"
 #include "utils/memutils.h"
 
@@ -128,12 +129,21 @@ typedef HASHBUCKET *HASHSEGMENT;
  */
 struct HASHHDR
 {
-	/* In a partitioned table, take this lock to touch nentries or freeList */
-	slock_t		mutex;			/* unused if not partitioned table */
+	/*
+	 * In a partitioned table take these locks to touch nentries or freeList.
+	 * If hash table is not partitioned mutexes are not used.
+	 */
+	slock_t		mutex[NUM_LOCK_PARTITIONS];
+
+	/*
+	 * These fields change during entry addition/deletion. If hash table is
+	 * not partitioned only nentries[0] and freeList[0] are used.
+	 */
 
-	/* These fields change during entry addition/deletion */
-	long		nentries;		/* number of entries in hash table */
-	HASHELEMENT *freeList;		/* linked list of free elements */
+	/* number of entries in hash table */
+	long		nentries[NUM_LOCK_PARTITIONS];
+	/* linked list of free elements */
+	HASHELEMENT *freeList[NUM_LOCK_PARTITIONS];
 
 	/* These fields can change, but not in a partitioned table */
 	/* Also, dsize can't change in a shared table, even if unpartitioned */
@@ -166,6 +176,8 @@ struct HASHHDR
 
 #define IS_PARTITIONED(hctl)  ((hctl)->num_partitions != 0)
 
+#define PARTITION_IDX(hctl, hashcode) (IS_PARTITIONED(hctl) ? LockHashPartition(hashcode) : 0)
+
 /*
  * Top control structure for a hashtable --- in a shared table, each backend
  * has its own copy (OK since no fields change at runtime)
@@ -219,10 +231,10 @@ static long hash_accesses,
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem);
+static bool element_alloc(HTAB *hashp, int nelem, int partition_idx);
 static bool

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-30 Thread Aleksander Alekseev

Agree. --enable-depend turned on by default could save me a lot of
time. Now I'm aware regarding --enable-depend and `make clean`. Still
other people will definitely do the same mistake over and over again.

On Wed, 30 Dec 2015 11:49:13 -0300
Alvaro Herrera <alvhe...@2ndquadrant.com> wrote:

> Andres Freund wrote:
> > On 2015-12-30 11:37:19 -0300, Alvaro Herrera wrote:
> > > Aleksander Alekseev wrote:
> > > 
> > > > Here is a funny thing - benchmark results I shared 22.12.2015
> > > > are wrong because I forgot to run `make clean` after changing
> > > > lwlock.h (autotools doesn't rebuild project properly after
> > > > changing .h files).
> > > 
> > > Running configure with --enable-depend should avoid this problem.
> > 
> > I still maintain that --enable-depend should be on by default.
> 
> +1  Let's do it.
> 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-17 Thread Aleksander Alekseev

> Oh, that's an interesting idea.  I guess the problem is that if the
> freelist is unshared, then users might get an error that the lock
> table is full when some other partition still has elements remaining.

True, but I don't believe it should be a real problem assuming we have
a strong hash function. If user get such an error it means that
database is under heavy load and there is not much more free elements
in other tables either. This particular user didn't get lucky, some
other will. Anyway administrator should do something about it -
fight DDoS attack, tune database parameters, etc.

> 3.5-4 more TPS, or 3.5 times more TPS?  Can you share the actual
> numbers?

Oops, 3.5-4 _times_ more TPS, i.e. 2206 TPS vs 546 TPS.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-12 Thread Aleksander Alekseev

Hello, Álvaro

> So you were saying some posts upthread that the new hash tables would
> "borrow" AN elements from the freelist then put AN elements back when
> too many got unused.  Is that idea embodied in this patch?

Right. It didn't satisfy "use all memory" requirement anyway. It's
better to sacrifice a few TPS (according only to my benchmark which
doesn't represent all possible use cases) than accidentally break
someones system after upgrading to the next version of PostgreSQL. See
example with pg_dump given by Robert Haas above.

>> +/* Number of partitions of the shared buffer mapping hashtable */
>> +#define NUM_BUFFER_PARTITIONS NUM_LOCK_PARTITIONS
>> +
>>  /* Number of partitions the shared predicate lock tables are divided
>> into */ #define LOG2_NUM_PREDICATELOCK_PARTITIONS  4
>>  #define NUM_PREDICATELOCK_PARTITIONS  (1 <<
>> LOG2_NUM_PREDICATELOCK_PARTITIONS)  
>
> I find this odd.  Why is it a good idea to define the bufMapping
> partitions like this? What's the explanation for the performance
> numbers you saw?

My mistake. I thought that last table was self-explanatory.

We see that a single spinlock for accessing a freeList (current
implementation) is obviously a bottleneck. Replacing it with say an
array of spinlocks reduces lock contention and therefore gives more TPS
(see first row). Also we can increase concurrency level even further by
increasing number of lock partitions (see columns "no locks", "lwlock"
and "spinlock array"). Previously it couldn't help us (see "master"
column) because of a bottleneck.

If you have any other questions regarding this patch please don't
hesitate to ask.

Best regards,
Aleksander


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-02-08 Thread Aleksander Alekseev

Hello, Robert.

> So: do we have clear evidence that we need 128 partitions here, or
> might, say, 16 work just fine?

Yes, this number of partitions was chosen based on this benchmark (see
"spinlock array" column):

http://www.postgresql.org/message-id/20151229184851.1bb7d1bd@fujitsu

In fact we can choose 16, 32, 64 or 128 partitions - any number works
better than current implementation with a single spinlock. But
according to this benchmark 128 partitions give twice as much TPS then
16 partitions, almost as if we didn't have any locks at all (see "no
locks" column).

Still I'm a bit concerned that 128 partitions require (128-16)*
( sizeof(slock_t) + sizeof(void*) + sizeof(long) ) bytes of additional
memory per hash table which is:

(gdb) p (128-16)*(sizeof(slock_t) + sizeof(long) + sizeof(void*))
$1 = 1904

... bytes of shared memory on amd64.

I'm not sure if this is considered OK. If it is I suggest to leave
number 128. If it's not, perhaps we could agree on say 64 partitions as
a compromise. Described benchmark doesn't represent any possible
workload anyway. I personally believe that we don't have so many small
hash tables that 1.8 Kb of additional shared memory per hash table would
be an issue.

> I don't really want to increase the number of lock manager partition
> locks to 128 just for the convenience of this patch, and even less do
> I want to impose the requirement that every future shared hash table
> must use an equal number of partitions.

May I ask why? Does it create a real problem?

> I don't see any reason why the number of free list partitions needs
> to be a constant, or why it needs to be equal to the number of hash
> bucket partitions.  That just seems like a completely unnecessary
> constraint. You can take the hash value, or whatever else you use to
> pick the partition, modulo any number you want. 

True, number of spinlocks / freeLists could be a variable. I believe
it's an old-good simplicity vs flexibility question.

Lets say we allow user (i.e. calling side) to choose spinlocks and free
lists number. So now we should use freeList[hash % items_number] formula
which means division instead of binary shift. Do we really need such an
expensive operation just to calculate index of a free list? Probably
not. Let limit possible items_number values so it could be only power
of two. Now there is only four possible values user would more likely to
choose (16, 32, 64, 128), which is not as flexible anymore.

What about memory? If we allocate mutex, nentries and freeList
dynamically depending on items_number value, HASHHDR should store
pointers to allocated memory which means additional indirection for
almost every access to mutex, nentries and freeList fields. Probably a
bad idea. Otherwise we still waste shared memory, probably even more
memory since we don't want maximum items_number value to be too low.

Etc.

I believe such a change will cause only additional code complexity
(code is already far from trivial in dynahash.c) so next time it will
be much harder to change anything, probably some bugs we will miss and
will not solve any real problem. Why just not to choose a constant which
seems to be good enough instead?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-02-12 Thread Aleksander Alekseev

Hello, Robert

> It also strikes me that it's probably quite likely that slock_t
> mutex[NUM_FREELISTS] is a poor way to lay out this data in memory.
> For example, on a system where slock_t is just one byte, most likely
> all of those mutexes are going to be in the same cache line, which
> means you're going to get a LOT of false sharing.  It seems like it
> would be sensible to define a new struct that contains an slock_t, a
> long, and a HASHELEMENT *, and then make an array of those structs.
> That wouldn't completely eliminate false sharing, but it would reduce
> it quite a bit.  My guess is that if you did that, you could reduce
> the number of freelists to 8 or less and get pretty much the same
> performance benefit that you're getting right now with 32. And that,
> too, seems likely to be good for single-client performance.

I experimented for a while trying to fit every spinlock in a separate
cache line. Indeed we can gain some speedup this way. Here are
benchmark results on 12-core server for NUM_LOCK_PARTITIONS = 32 (in
this case difference is more notable):

| FREELISTS | SIZE =  32 | SIZE =  64 | SIZE = 128 | SIZE = 256 |
|---|||||
| 4 |   +25.4%   |   +28.7%   |   +28.4%   |   +28.3%   |
| 8 |   +28.2%   |   +29.4%   |   +31.7%   |   +31.4%   |
|16 |   +29.9%   |   +32.6%   |   +31.6%   |   +30.8%   |
|32 |   +33.7%   |   +34.2%   |   +33.6%   |   +32.6%   |

Here SIZE is short for FREELIST_BUFFER_SIZE (part of a hack I used to
align FREELIST structure, see attached patch). Cache line on this CPU
is 64 bytes according to /sys/devices/system/cpu/cpu0/cache/index0/*

I would like to remind that without these changes we had the following
picture (please note "NLP = 32" column):

pgbench -f pgbench.sql -T 150 -P 1 -c 40 -j 12

 DMN | NLP = 16 | NLP = 32 | NLP = 64 | NLP = 128
-|--|--|--|--
   8 |  +15.1%  |  +28.2%  |  +34.1%  |  +33.7%  
  16 |  +16.6%  |  +30.9%  |  +37.0%  |  +40.8%  
  32 |  +15.1%  |  +33.9%  |  +39.5%  |  +41.9%  
  64 |  +15.0%  |  +31.9%  |  +40.1%  |  +42.9%  
 128 |   +7.7%  |  +24.7%  |  +29.6%  |  +31.6%  

* NLP = NUM_LOCK_PARTITIONS
* DMN = DEFAULT_MUTEXES_NUM

So its +34.2% vs +33.9% and the number of freeLists remains the same.

> I am however wondering if it to set the freelist affinity based on
> something other than the hash value, like say the PID, so that the
> same process doesn't keep switching to a different freelist for every
> buffer eviction.

Also I tried to use PID to determine freeList number:

```
#include 
#include 

...

#define FREELIST_IDX(hctl, hashcode) \
  (IS_PARTITIONED(hctl) ? ((uint32)getpid()) % NUM_FREELISTS : 0)

...

// now nentries could be negative in this case
// Assert(FREELIST(hctl, freelist_idx).nentries > 0);

```

Unfortunately this approach gives +33.9% TPS in best case. Also there
is a possibility that nentries will overflow. So far I don't have a
clear idea how to solve this issue effectively.

I'm not sure if further improvements worth an effort.diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 24a53da..4d9ed3e 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -15,7 +15,7 @@
  * to hash_create.  This prevents any attempt to split buckets on-the-fly.
  * Therefore, each hash bucket chain operates independently, and no fields
  * of the hash header change after init except nentries and freeList.
- * A partitioned table uses a spinlock to guard changes of those two fields.
+ * A partitioned table uses spinlocks to guard changes of those fields.
  * This lets any subset of the hash buckets be treated as a separately
  * lockable partition.  We expect callers to use the low-order bits of a
  * lookup key's hash value as a partition number --- this will work because
@@ -111,6 +111,12 @@
 #define DEF_DIRSIZE			   256
 #define DEF_FFACTOR			   1	/* default fill factor */
 
+/* Number of freelists to be used for a partitioned hash table. */
+#define NUM_FREELISTS			4
+
+// TODO: comment. Should be power of 2.
+// #define FREELIST_IDX_STEP 8
+#define FREELIST_BUFFER_SIZE 32
 
 /* A hash bucket is a linked list of HASHELEMENTs */
 typedef HASHELEMENT *HASHBUCKET;
@@ -118,6 +124,20 @@ typedef HASHELEMENT *HASHBUCKET;
 /* A hash segment is an array of bucket headers */
 typedef HASHBUCKET *HASHSEGMENT;
 
+// TODO: re-check comments
+// TODO: pgindent
+
+// TODO: comment!
+typedef struct
+{
+	slock_t mutex;	/* a spinlock */
+	long nentries;	/* number of entries */
+	HASHELEMENT* freeList; /* lists of free elements */
+} FREELIST;
+
+// TODO: comment
+typedef struct { uint8 x[FREELIST_BUFFER_SIZE]; } FREELISTBUFF;
+
 /*
  * Header structure for a hash table --- contains all changeable info
  *
@@ -128,12 +148,16 @@ typedef HASHBUCKET *HASHSEGMENT;
  */
 struct HASHHDR
 {
-	/* In a partitioned table, take this lock to touch nentries or

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-02-10 Thread Aleksander Alekseev

Hello, Robert

> Basically, the burden for you to impose a new coding rule on everybody
> who uses shared hash tables in the future is very high.

I fixed an issue you described. Number of spinlocks doesn't depend of
NUM_LOCK_PARTITIONS anymore and could be specified for each hash table
on a calling side.

I did a benchmark described in a first message of this thread again.
Currently I don't have access to the same 60-core server so I used more
common 12-core server (24 with HT). According to this benchmark TPS
increment depends on NUM_LOCK_PARTITIONS and default number of
spinlocks this way:

pgbench -f pgbench.sql -T 150 -P 1 -c 40 -j 12

 DMN | NLP = 16 | NLP = 32 | NLP = 64 | NLP = 128
-|--|--|--|--
   8 |  +15.1%  |  +28.2%  |  +34.1%  |  +33.7%  
  16 |  +16.6%  |  +30.9%  |  +37.0%  |  +40.8%  
  32 |  +15.1%  |  +33.9%  |  +39.5%  |  +41.9%  
  64 |  +15.0%  |  +31.9%  |  +40.1%  |  +42.9%  
 128 |   +7.7%  |  +24.7%  |  +29.6%  |  +31.6%  

* NLP = NUM_LOCK_PARTITIONS
* DMN = DEFAULT_MUTEXES_NUM

I realize this benchmark doesn't represent any possible workload so for
attached patch I choose NUM_LOCK_PARTITIONS = DEFAULT_MUTEXES_NUM = 32.
It seems to be a reasonable compromise of a speedup according to
"synthetic and meaningless in practice" benchmark and number of used
locks which could mean quite a lot in practice. Still this values could
be easily changed in any moment.

Here are before/after benchmark results for this concrete patch.


BEFORE

pgbench (default):

tps = 1295.798531 (including connections establishing)
tps = 1295.858295 (excluding connections establishing)

pgbench -f pgbench.sql:

tps = 1020.072172 (including connections establishing)
tps = 1020.116888 (excluding connections establishing)


AFTER

pgbench (default):

tps = 1299.369807 (including connections establishing)
tps = 1299.429764 (excluding connections establishing)

pgbench -f pgbench.sql:

tps = 1365.749333 (including connections establishing)
tps = 1365.814384 (excluding connections establishing)


So as I understand this patch solves a lock contention problem and
doesn't make anything else worse.diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index dffc477..9a15a2a 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -495,7 +495,7 @@ pgss_shmem_startup(void)
 	info.hash = pgss_hash_fn;
 	info.match = pgss_match_fn;
 	pgss_hash = ShmemInitHash("pg_stat_statements hash",
-			  pgss_max, pgss_max,
+			  pgss_max,
 			  ,
 			  HASH_ELEM | HASH_FUNCTION | HASH_COMPARE);
 
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 39e8baf..dd5acb7 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -62,7 +62,7 @@ InitBufTable(int size)
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
 
 	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
-  size, size,
+  size,
   ,
   HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
 }
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 81506ea..4c18701 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -237,7 +237,7 @@ InitShmemIndex(void)
 	hash_flags = HASH_ELEM;
 
 	ShmemIndex = ShmemInitHash("ShmemIndex",
-			   SHMEM_INDEX_SIZE, SHMEM_INDEX_SIZE,
+			   SHMEM_INDEX_SIZE,
 			   , hash_flags);
 }
 
@@ -255,17 +255,12 @@ InitShmemIndex(void)
  * exceeded substantially (since it's used to compute directory size and
  * the hash table buckets will get overfull).
  *
- * init_size is the number of hashtable entries to preallocate.  For a table
- * whose maximum size is certain, this should be equal to max_size; that
- * ensures that no run-time out-of-shared-memory failures can occur.
- *
  * Note: before Postgres 9.0, this function returned NULL for some failure
  * cases.  Now, it always throws error instead, so callers need not check
  * for NULL.
  */
 HTAB *
 ShmemInitHash(const char *name, /* table string name for shmem index */
-			  long init_size,	/* initial table size */
 			  long max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
@@ -299,7 +294,7 @@ ShmemInitHash(const char *name, /* table string name for shmem index */
 	/* Pass location of hashtable header to hash_create */
 	infoP->hctl = (HASHHDR *) location;
 
-	return hash_create(name, init_size, infoP, hash_flags);
+	return hash_create(name, max_size, infoP, hash_flags);
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index e3e9599..fc20a67 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -374,18 +374,10 @@ void
 InitLocks(void)
 {
 	HASHCTL		info;
-	long		init_table_size,
-

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-02-11 Thread Aleksander Alekseev

Hello, Robert

> Thanks, this looks way better.  Some more comments:
> 
> - I don't think there's any good reason for this patch to change the
> calling convention for ShmemInitHash().  Maybe that's a good idea and
> maybe it isn't, but it's a separate issue from what this patch is
> doing, and if we're going to do it at all, it should be discussed
> separately.  Let's leave it out of this patch.
> 
> - I would not provide an option to change the number of freelist
> mutexes.  Let's dump DEFAULT_MUTEXES_NUM and MAX_MUTEXES_NUM and have
> FREELIST_MUTEXES_NUM.  The value of 32 which you selected is fine with
> me.
> 
> - The change to LOG2_NUM_LOCK_PARTITIONS in lwlock.h looks like an
> independent change.  Again, leave it out of this patch and submit it
> separately with its own benchmarking data if you want to argue for it.
> 
> I think if you make these changes this patch will be quite a bit
> smaller and in pretty good shape.
> 
> Thanks for working on this.
> 

Here is a corrected patch. I decided to keep a small change in
InitLocks procedure (lock.c) since ShmemInitHash description clearly
states "For a table whose maximum size is certain, [init_size] should
be equal to max_size". Also I added two items to my TODO list to send
more patches as soon (and if) this patch will be applied.

Thanks for your contribution to this patch.diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index e3e9599..5296e77 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -374,18 +374,10 @@ void
 InitLocks(void)
 {
 	HASHCTL		info;
-	long		init_table_size,
-max_table_size;
+	long		max_table_size;
 	bool		found;
 
 	/*
-	 * Compute init/max size to request for lock hashtables.  Note these
-	 * calculations must agree with LockShmemSize!
-	 */
-	max_table_size = NLOCKENTS();
-	init_table_size = max_table_size / 2;
-
-	/*
 	 * Allocate hash table for LOCK structs.  This stores per-locked-object
 	 * information.
 	 */
@@ -393,16 +385,16 @@ InitLocks(void)
 	info.keysize = sizeof(LOCKTAG);
 	info.entrysize = sizeof(LOCK);
 	info.num_partitions = NUM_LOCK_PARTITIONS;
+	max_table_size = NLOCKENTS();
 
 	LockMethodLockHash = ShmemInitHash("LOCK hash",
-	   init_table_size,
+	   max_table_size,
 	   max_table_size,
 	   ,
 	HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
 
 	/* Assume an average of 2 holders per lock */
 	max_table_size *= 2;
-	init_table_size *= 2;
 
 	/*
 	 * Allocate hash table for PROCLOCK structs.  This stores
@@ -414,7 +406,7 @@ InitLocks(void)
 	info.num_partitions = NUM_LOCK_PARTITIONS;
 
 	LockMethodProcLockHash = ShmemInitHash("PROCLOCK hash",
-		   init_table_size,
+		   max_table_size,
 		   max_table_size,
 		   ,
  HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 24a53da..1a3244a 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -15,7 +15,7 @@
  * to hash_create.  This prevents any attempt to split buckets on-the-fly.
  * Therefore, each hash bucket chain operates independently, and no fields
  * of the hash header change after init except nentries and freeList.
- * A partitioned table uses a spinlock to guard changes of those two fields.
+ * A partitioned table uses spinlocks to guard changes of those fields.
  * This lets any subset of the hash buckets be treated as a separately
  * lockable partition.  We expect callers to use the low-order bits of a
  * lookup key's hash value as a partition number --- this will work because
@@ -111,6 +111,11 @@
 #define DEF_DIRSIZE			   256
 #define DEF_FFACTOR			   1	/* default fill factor */
 
+/*
+ * Number of mutexes and respectively number of nentries and freeLists
+ * (see below). Should be power of 2.
+ */
+#define MUTEXES_NUM32
 
 /* A hash bucket is a linked list of HASHELEMENTs */
 typedef HASHELEMENT *HASHBUCKET;
@@ -128,12 +133,23 @@ typedef HASHBUCKET *HASHSEGMENT;
  */
 struct HASHHDR
 {
-	/* In a partitioned table, take this lock to touch nentries or freeList */
-	slock_t		mutex;			/* unused if not partitioned table */
-
-	/* These fields change during entry addition/deletion */
-	long		nentries;		/* number of entries in hash table */
-	HASHELEMENT *freeList;		/* linked list of free elements */
+	/*
+	 * There are two fields declared below: nentries and freeList. nentries
+	 * stores current number of entries in a hash table. freeList is a linked
+	 * list of free elements.
+	 *
+	 * To keep these fields consistent in a partitioned table we need to
+	 * synchronize access to them using a spinlock. But it turned out that a
+	 * single spinlock can create a bottleneck. To prevent lock contention an
+	 * array of MUTEXES_NUM spinlocks is used. Each spinlock protects one
+	 * element of nentries and freeList arrays.
+	 *
+	 * If hash table is not partitioned only nentries[0] and

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-01-29 Thread Aleksander Alekseev

I tested this patch on x64 and ARM servers for a few hours today. The
only problem I could find is that INSERT works considerably slower after
applying a patch. Beside that everything looks fine - no crashes, tests
pass, memory doesn't seem to leak, etc.

> Okay, now for some badness.  I've restored a database containing 2
> tables, one 318MB, another 24kB.  The 318MB table contains 5 million
> rows with a sequential id column.  I get a problem if I try to delete
> many rows from it:
> # delete from contacts where id % 3 != 0 ;
> WARNING:  out of shared memory
> WARNING:  out of shared memory
> WARNING:  out of shared memory

I didn't manage to reproduce this. Thom, could you describe exact steps
to reproduce this issue please?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-22 Thread Aleksander Alekseev

Hi,

> First of all, why not merge both patches into one? They aren't too
> big anyway.

Agree.

> I think comments should be changed, to be more informative here.

> Add a comment here too.

> Maybe you should explain this magic number 7 in the comment above?

Done.

> Then, this thread became too tangled. I think it's worth to write a
> new message with the patch, the test script, some results and brief
> overview of how does it really works. It will make following review
> much easier.

Sure.

HASHHDR represents a hash table. It could be usual or partitioned.
Partitioned table is stored in a shared memory and accessed by multiple
processes simultaneously. To prevent data corruption hash table is
partitioned and each process has to acquire a lock for a corresponding
partition before accessing data in it. Number of partition is determine
by lower bits of key's hash value. Most tricky part is --- dynahash
knows nothing about these locks, all locking is done on calling side.

Since shared memory is pre-allocated and can't grow to allocate memory
in a hash table we use freeList. Also we use nentries field to count
current number of elements in a hash table. Since hash table is used by
multiple processes these fields are protected by a spinlock. Which as
it turned out could cause lock contention and create a bottleneck.

After trying a few approaches I discovered that using one spinlock per
partition works quite well. Here are last benchmark results:

http://www.postgresql.org/message-id/20151229184851.1bb7d1bd@fujitsu

Note that "no locks" solution cant be used because it doesn't guarantee
that all available memory will be used in some corner cases.

You can find a few more details and a test script in the first message
of this thread. If you have any other questions regarding this patch
please don't hesitate to ask.

Best regards,
Aleksander

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 9673fe0..0c8e4fb 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -495,7 +495,7 @@ pgss_shmem_startup(void)
 	info.hash = pgss_hash_fn;
 	info.match = pgss_match_fn;
 	pgss_hash = ShmemInitHash("pg_stat_statements hash",
-			  pgss_max, pgss_max,
+			  pgss_max,
 			  ,
 			  HASH_ELEM | HASH_FUNCTION | HASH_COMPARE);
 
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 39e8baf..dd5acb7 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -62,7 +62,7 @@ InitBufTable(int size)
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
 
 	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
-  size, size,
+  size,
   ,
   HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
 }
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 81506ea..4c18701 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -237,7 +237,7 @@ InitShmemIndex(void)
 	hash_flags = HASH_ELEM;
 
 	ShmemIndex = ShmemInitHash("ShmemIndex",
-			   SHMEM_INDEX_SIZE, SHMEM_INDEX_SIZE,
+			   SHMEM_INDEX_SIZE,
 			   , hash_flags);
 }
 
@@ -255,17 +255,12 @@ InitShmemIndex(void)
  * exceeded substantially (since it's used to compute directory size and
  * the hash table buckets will get overfull).
  *
- * init_size is the number of hashtable entries to preallocate.  For a table
- * whose maximum size is certain, this should be equal to max_size; that
- * ensures that no run-time out-of-shared-memory failures can occur.
- *
  * Note: before Postgres 9.0, this function returned NULL for some failure
  * cases.  Now, it always throws error instead, so callers need not check
  * for NULL.
  */
 HTAB *
 ShmemInitHash(const char *name, /* table string name for shmem index */
-			  long init_size,	/* initial table size */
 			  long max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
@@ -299,7 +294,7 @@ ShmemInitHash(const char *name, /* table string name for shmem index */
 	/* Pass location of hashtable header to hash_create */
 	infoP->hctl = (HASHHDR *) location;
 
-	return hash_create(name, init_size, infoP, hash_flags);
+	return hash_create(name, max_size, infoP, hash_flags);
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 9c2e49c..8d9b36a 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -373,8 +373,7 @@ void
 InitLocks(void)
 {
 	HASHCTL		info;
-	long		init_table_size,
-max_table_size;
+	long		max_table_size;
 	bool		found;
 
 	/*
@@ -382,7 +381,6 @@ InitLocks(void)
 	 * calculations must agree with LockShmemSize!
 	 */
 	max_table_size = NLOCKENTS();
-	init_table_size = max_table_size / 2;
 
 	/*
 	 * Allocate hash table for LOCK structs.  This stores

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-22 Thread Aleksander Alekseev

This patch affects header files. By any chance didn't you forget to run
`make clean` after applying it? As we discussed above, when you
change .h files autotools doesn't rebuild dependent .c files:

http://www.postgresql.org/message-id/caf4au4y4mtapp2u3ugtbqd_z0chesjwfnavrgk4umfcwa4e...@mail.gmail.com

On Thu, 21 Jan 2016 16:49:12 +0530
Dilip Kumar <dilipbal...@gmail.com> wrote:

> On Tue, Jan 12, 2016 at 1:50 PM, Aleksander Alekseev <
> a.aleks...@postgrespro.ru> wrote:
> 
> >
> > increasing number of lock partitions (see columns "no locks",
> > "lwlock" and "spinlock array"). Previously it couldn't help us (see
> > "master" column) because of a bottleneck.
> >
> > If you have any other questions regarding this patch please don't
> > hesitate to ask.
> >
> 
> I have done some performance bench marking for this
> patch.(dynahash-optimization-v10-step1.patch)
> 
> Machine Detail:
> cpu   : POWER8
> cores: 24 (192 with HT)
> 
> Non Default Parameter:
> 
> Shared Buffer= 30GB
> max_wal_size= 10GB
> max_connections=500
> 
> Test1:
> *schema.sql* and *pgbench.sql* are same files which Aleksander has
> attached in first mail of the thread.
> 
> psql -d postgres -f schema.sql
> pgbench -c$ -j$ -f pgbench.sql  postgres
> 
> clientbasepatch
> 1145145
> 2262258
> 4453472
> 8749732
> 1611141128
> 3217271789
> 6427292793
> 12835343520
> 
> With this test i did not see any improvement, though i think it should
> improve the performance, do you have any suggestion to see the
> results same as yours ?
> 
> I have also dump stacks using some script and I have observed many
> stacks dumps as you mentioned in initial thread. And after patch, I
> found that number of lock waiting stacks are reduced.
> 
> Stack Dump:
> ---
> #0  0x7f25842899a7 in semop () from /lib64/libc.so.6
> #1  0x007116d0 in PGSemaphoreLock (sema=0x7f257cb170d8) at
> pg_sema.c:387
> #2  0x0078955f in LWLockAcquire (lock=0x7f247698a980,
> mode=LW_EXCLUSIVE) at lwlock.c:1028
> #3  0x007804c4 in LockAcquireExtended
> #4  0x0077fe91 in LockAcquire
> #5  0x0077e862 in LockRelationOid
> #6  0x0053bc7b in find_inheritance_children
> #7  0x0053bd4f in find_all_inheritors
> #8  0x006fc0a2 in expand_inherited_rtentry
> #9  0x006fbf91 in expand_inherited_tables
> 
> I have tried to analyze using perf also, I can see that amount of time
> taken in hash_search_with_hash_value is reduced from 3.86%(master) to
> 1.72%(patch).
> 
> I have plan to do further investigation, in different scenarios of
> dynahash.
> 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: ResourceOwner optimization for tables with many partitions

2016-01-25 Thread Aleksander Alekseev


> Um, that's not too surprising, because they're exactly the same patch?

Wrong diff. Here is correct one. 

Everything else is right. I just re-checked :)

step2a:

number of transactions actually processed: 16325
latency average: 49.008 ms
latency stddev: 8.780 ms
tps = 163.182594 (including connections establishing)
tps = 163.189818 (excluding connections establishing)

step2b-fixed:

number of transactions actually processed: 16374
latency average: 48.867 ms
latency stddev: 8.223 ms
tps = 163.658269 (including connections establishing)
tps = 163.666318 (excluding connections establishing)

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index dabaf5a..39f3115 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -297,6 +297,9 @@ hashvarlena(PG_FUNCTION_ARGS)
  * of 2.  There is no need to do mod a prime (mod is sooo slow!).
  * If you need less than 32 bits, use a bitmask.
  *
+ * This procedure never fails. Its important. Some code notably ResourceOwner
+ * relies on this.
+ *
  * Note: we could easily change this function to return a 64-bit hash value
  * by using the final values of both b and c.  b is perhaps a little less
  * well mixed than c, however.
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 69217b5..02319ef 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -37,17 +37,35 @@
  * which should be called before corresponding ResourceOwnerRemember* calls
  * (see below). Internally each type of resource is stored in separate
  * ResourceArray.
+ *
+ * There are two major reasons for using ResourceArray instead of, say,
+ * regular C arrays.
+ *
+ * Firstly we would like to prevent code duplication. For instance
+ * ResourceArray provides generic Remember/Forget/Enlarge procedures, so
+ * corresponding ResourceOwner* procedures are just a typesafe wrappers for
+ * these procedures.
+ *
+ * Secondly ResourceArray must be more efficient than regular C array.
+ * Current implementation in general could be considered a hash table. It has
+ * O(1) complexity of both Remember and Forget procedures.
  */
 typedef struct ResourceArray
 {
 	Datum	   *itemsarr;		/* buffer for storing values */
+	Datum		invalidval;		/* value that is considered invalid */
 	uint32		capacity;		/* capacity of array */
 	uint32		nitems;			/* how many items is stored in items array */
+	uint32		maxitems;		/* precalculated RESARRAY_MAX_ITEMS(capacity) */
+	uint32		lastidx;		/* index of last item returned by GetAny */
 }	ResourceArray;
 
 /*
  * This number is used as initial size of resource array. If given number of
  * items is not enough, we double array size and reallocate memory.
+ *
+ * Should be power of two since we use (arrsize - 1) as mask for hash value.
+ *
  */
 #define RESARRAY_INIT_SIZE 16
 
@@ -64,14 +82,27 @@ typedef struct ResourceArray
 #define DatumGetBuffer(datum)((Buffer)(datum))
 
 /*
+ * How many items could be stored in a resource array of given capacity. If
+ * this number is reached we need to resize an array to prevent hash collisions.
+ *
+ * This computation actually costs only two additions and one binary shift.
+ */
+#define RESARRAY_MAX_ITEMS(capacity) ((capacity)*3/4)
+
+/*
  * Initialize ResourceArray
  */
 static void
-ResourceArrayInit(ResourceArray * resarr)
+ResourceArrayInit(ResourceArray * resarr, Datum invalidval)
 {
 	Assert(resarr->itemsarr == NULL);
 	Assert(resarr->capacity == 0);
 	Assert(resarr->nitems == 0);
+	Assert(resarr->maxitems == 0);
+	Assert(resarr->invalidval == 0);
+	Assert(resarr->lastidx == 0);
+
+	resarr->invalidval = invalidval;
 }
 
 /*
@@ -82,11 +113,32 @@ ResourceArrayInit(ResourceArray * resarr)
 static void
 ResourceArrayAdd(ResourceArray * resarr, Datum data)
 {
+	Datum		idx;
+	Datum		mask = resarr->capacity - 1;
+
+	Assert(resarr->maxitems > resarr->nitems);
 	Assert(resarr->capacity > 0);
 	Assert(resarr->itemsarr != NULL);
-	Assert(resarr->nitems < resarr->capacity);
+	Assert(data != resarr->invalidval);
+
+	/* For small arrays don't calculate hashes */
+	if (resarr->capacity == RESARRAY_INIT_SIZE)
+	{
+		resarr->itemsarr[resarr->nitems] = data;
+		resarr->nitems++;
+		return;
+	}
+
+	idx = hash_any((void *) , sizeof(data)) & mask;
+
+	while (true)
+	{
+		if (resarr->itemsarr[idx] == resarr->invalidval)
+			break;
+		idx = (idx + 1) & mask;
+	}
 
-	resarr->itemsarr[resarr->nitems] = data;
+	resarr->itemsarr[idx] = data;
 	resarr->nitems++;
 }
 
@@ -98,24 +150,45 @@ ResourceArrayAdd(ResourceArray * resarr, Datum data)
 static bool
 ResourceArrayRemove(ResourceArray * resarr, Datum data)
 {
-	int			i,
+	uint32		i,
 j,
 lastidx;
+	Datum		idx;
+	Datum		mask = resarr->capacity - 1;
 
 	Assert(resarr->capacity > 0);
 	Assert(resarr->itemsarr != NULL);
+	Assert(data != resarr->invalidval);
+
+	/* For small arrays don't calculate hashes */
+	if (resarr->capacity == RESARRAY_INIT_SIZE)
+

Re: [HACKERS] Patch: ResourceOwner optimization for tables with many partitions

2016-01-25 Thread Aleksander Alekseev

Hello, Tom

> Also, there are defined ways to convert between Datum format and
> other formats (DatumGetPointer() etc).  This code isn't using 'em.

Fixed. But I was not sure how to deal with File and Buffer types since
they are ints (see fd.h and buf.h) and there is no DatumGetInt macro,
only DatumGetInt32/Int64. I don't know if there is a good reason for
this. So for now I just added these definitions right into resowner.c:

```
/* Convert File to Datum */
#define FileGetDatum(file) ((Datum)(file))

/* Convert Datum to File */
#define DatumGetFile(datum)((File)(datum))

/* Convert Buffer to Datum */
#define BufferGetDatum(buffer)((Datum)(buffer))

/* Convert Datum to Buffer */
#define DatumGetBuffer(datum)((Buffer)(datum))
```

... to make all code look similar and all intentions --- clear.

I have a feeling you could suggest a better solution :)

> This is certainly not doing what I had in mind for communication
> between ResourceOwnerGetAny and ResourceOwnerRelease, because the
> latter pays zero attention to "lastidx", but instead does a blind
> hash search regardless.
> 
> In general, I'm suspicious of the impact of this patch on "normal"
> cases with not-very-large resource arrays.  It might be hard to
> measure that because in such cases resowner.c is not a bottleneck;
> but that doesn't mean I want to give up performance in cases that
> perform well today.  You could probably set up a test harness that
> exercises ResourceOwnerAdd/Release/etc in isolation and get good
> timings for them that way, to confirm or disprove whether there's
> a performance loss for small arrays.
> 
> I still suspect it would be advantageous to have two different
> operating modes depending on the size of an individual resource
> array, and not bother with hashing until you got to more than a
> couple dozen entries. Given that you're rehashing anyway when you
> enlarge the array, this wouldn't cost anything except a few more
> lines of code, ISTM --- and you already have a net code savings
> because of the refactoring.

To be honest I don't have much faith in such micro benchmarks. They
don't consider how code will be executed under real load. Therefore any
results of such a benchmark wouldn't give us a clear answer which
solution is preferable. Besides different users can execute the same
code in different fashion.

I compared two implementations - "always use hashing" (step2a.path) and
"use hashing only for large arrays" (step2b.path). Both patches give
the same performance according to benchmark I described in a first
message of this thread.

Since none of these patches is better in respect of problem I'm trying
to solve I suggest to use step2b.path. It seems to be a good idea not to
use hashing for searching in small arrays. Besides in this case
behaviour of PostgreSQL will be changed less after applying a patch.
Users will probably appreciate this.

Best regards,
Aleksanderdiff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index b1d71b5..69217b5 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -29,6 +29,168 @@
 #include "utils/snapmgr.h"
 
 /*
+ * ResourceArray is a common structure for storing different types of resources.
+ *
+ * ResourceOwner can own `HeapTuple`s, `Relation`s, `Snapshot`s, etc. For
+ * each resource type there are procedures ResourceOwnerRemember* and
+ * ResourceOwnerForget*. There are also ResourceOwnerEnlarge* procedures
+ * which should be called before corresponding ResourceOwnerRemember* calls
+ * (see below). Internally each type of resource is stored in separate
+ * ResourceArray.
+ */
+typedef struct ResourceArray
+{
+	Datum	   *itemsarr;		/* buffer for storing values */
+	uint32		capacity;		/* capacity of array */
+	uint32		nitems;			/* how many items is stored in items array */
+}	ResourceArray;
+
+/*
+ * This number is used as initial size of resource array. If given number of
+ * items is not enough, we double array size and reallocate memory.
+ */
+#define RESARRAY_INIT_SIZE 16
+
+/* Convert File to Datum */
+#define FileGetDatum(file) ((Datum)(file))
+
+/* Convert Datum to File */
+#define DatumGetFile(datum)((File)(datum))
+
+/* Convert Buffer to Datum */
+#define BufferGetDatum(buffer)((Datum)(buffer))
+
+/* Convert Datum to Buffer */
+#define DatumGetBuffer(datum)((Buffer)(datum))
+
+/*
+ * Initialize ResourceArray
+ */
+static void
+ResourceArrayInit(ResourceArray * resarr)
+{
+	Assert(resarr->itemsarr == NULL);
+	Assert(resarr->capacity == 0);
+	Assert(resarr->nitems == 0);
+}
+
+/*
+ * Add a resource to ResourceArray
+ *
+ * Caller must have previously done ResourceArrayEnlarge()
+ */
+static void
+ResourceArrayAdd(ResourceArray * resarr, Datum data)
+{
+	Assert(resarr->capacity > 0);
+	Assert(resarr->itemsarr != NULL);
+	Assert(resarr->nitems < resarr->capacity);
+
+	resarr->itemsarr[resarr->nitems] = data;
+	resarr->nitems++;
+}
+
+/*
+ * Remove a resource from ResourceArray
+ *
+ *

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-27 Thread Aleksander Alekseev

> > This patch affects header files. By any chance didn't you forget to
> > run `make clean` after applying it? As we discussed above, when you
> > change .h files autotools doesn't rebuild dependent .c files:
> >
> 
> Yes, actually i always compile using "make clean;make -j20; make
> install" If you want i will run it again may be today or tomorrow and
> post the result.
> 
> 

Most likely HASHHDR.mutex is not only bottleneck in your case so my
patch doesn't help much. Unfortunately I don't have access to any
POWER8 server so I can't investigate this issue. I suggest to use a
gettimeofday trick I described in a first message of this thread. Its
time consuming but it gives a clear understanding which code is keeping
a lock.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: ResourceOwner optimization for tables with many partitions

2016-01-27 Thread Aleksander Alekseev

Hello, Tom.

I'm a bit concerned regarding assumption that sizeof int never exceeds 4
bytes. While this could be true today for most C compilers, standard
[1][2] doesn't guarantee that. Perhaps we should add something like:

StaticAssertStmt(sizeof(int) <= sizeof(int32),
"int size exceeds int32 size");

It costs nothing but could save a lot of time (not mentioning data
loss) some unlucky user.

[1] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf
[2] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-27 Thread Aleksander Alekseev

> This comment certainly requires some changes.

Fixed.

> BTW, could you explain why init_table_size was two times less than 
> max_table_size?

I have no clue. My best guess is that it was a reasonable thing to do in
the past. Then somebody changed a code and now there is little reason
to use init_table_size for partitioned tables.

> Why did you delete these two lines? I wonder if you should rewrite
> them instead?

```
MemSet(hctl, 0, sizeof(HASHHDR));
 
-   hctl->nentries = 0;
-   hctl->freeList = NULL;
```

These fields were initialized with zero values twice. It makes little
sense to me. 

> As far as I understood, this number was obtained experimentally?
> Maybe you should note that in the comment.

These numbers are very platform specific and will be outdated very
soon. I recall that my code was criticized for including exact numbers
not a long time ago. So I suggest to keep this part as is.

> For example, if you have nelem=25 and partitions_number=6.
> 25 / 6 = 4. And then you allocate 24 nelems, while 1 is lost.

Agree. Fixed.

> Except mentioned notes, I suppose the patch is good enough

I guess I will mark this patch as "Ready for Committer" then.
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 9673fe0..0c8e4fb 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -495,7 +495,7 @@ pgss_shmem_startup(void)
 	info.hash = pgss_hash_fn;
 	info.match = pgss_match_fn;
 	pgss_hash = ShmemInitHash("pg_stat_statements hash",
-			  pgss_max, pgss_max,
+			  pgss_max,
 			  ,
 			  HASH_ELEM | HASH_FUNCTION | HASH_COMPARE);
 
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 39e8baf..dd5acb7 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -62,7 +62,7 @@ InitBufTable(int size)
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
 
 	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
-  size, size,
+  size,
   ,
   HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
 }
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 81506ea..4c18701 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -237,7 +237,7 @@ InitShmemIndex(void)
 	hash_flags = HASH_ELEM;
 
 	ShmemIndex = ShmemInitHash("ShmemIndex",
-			   SHMEM_INDEX_SIZE, SHMEM_INDEX_SIZE,
+			   SHMEM_INDEX_SIZE,
 			   , hash_flags);
 }
 
@@ -255,17 +255,12 @@ InitShmemIndex(void)
  * exceeded substantially (since it's used to compute directory size and
  * the hash table buckets will get overfull).
  *
- * init_size is the number of hashtable entries to preallocate.  For a table
- * whose maximum size is certain, this should be equal to max_size; that
- * ensures that no run-time out-of-shared-memory failures can occur.
- *
  * Note: before Postgres 9.0, this function returned NULL for some failure
  * cases.  Now, it always throws error instead, so callers need not check
  * for NULL.
  */
 HTAB *
 ShmemInitHash(const char *name, /* table string name for shmem index */
-			  long init_size,	/* initial table size */
 			  long max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
@@ -299,7 +294,7 @@ ShmemInitHash(const char *name, /* table string name for shmem index */
 	/* Pass location of hashtable header to hash_create */
 	infoP->hctl = (HASHHDR *) location;
 
-	return hash_create(name, init_size, infoP, hash_flags);
+	return hash_create(name, max_size, infoP, hash_flags);
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 9c2e49c..8585a76 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -373,18 +373,10 @@ void
 InitLocks(void)
 {
 	HASHCTL		info;
-	long		init_table_size,
-max_table_size;
+	long		max_table_size;
 	bool		found;
 
 	/*
-	 * Compute init/max size to request for lock hashtables.  Note these
-	 * calculations must agree with LockShmemSize!
-	 */
-	max_table_size = NLOCKENTS();
-	init_table_size = max_table_size / 2;
-
-	/*
 	 * Allocate hash table for LOCK structs.  This stores per-locked-object
 	 * information.
 	 */
@@ -392,16 +384,15 @@ InitLocks(void)
 	info.keysize = sizeof(LOCKTAG);
 	info.entrysize = sizeof(LOCK);
 	info.num_partitions = NUM_LOCK_PARTITIONS;
+	max_table_size = NLOCKENTS();
 
 	LockMethodLockHash = ShmemInitHash("LOCK hash",
-	   init_table_size,
 	   max_table_size,
 	   ,
 	HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
 
 	/* Assume an average of 2 holders per lock */
 	max_table_size *= 2;
-	init_table_size *= 2;
 
 	/*
 	 * Allocate hash table for PROCLOCK structs.  This stores
@@ -413,7 +404,6 @@ InitLocks(void)
 	info.num_partitions = NUM_LOCK_PARTITIONS;

[HACKERS] Small patch for pgstat.c: fix comment + pgindent

2016-03-10 Thread Aleksander Alekseev

Hello

I noticed:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b6fb6471f6afaf649e52f38269fd8c5c60647669

... that comments for procedures pgstat_progress_update_param and
pgstat_progress_end_command are identical. Here is a patch that fixes
this. Also one empty line added after running pgindent on this file.

Best regards,
Aleksanderdiff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ce5da3e..5005fe5 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2733,6 +2733,7 @@ pgstat_bestart(void)
 	beentry->st_activity[pgstat_track_activity_query_size - 1] = '\0';
 	beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
 	beentry->st_progress_command_target = InvalidOid;
+
 	/*
 	 * we don't zero st_progress_param here to save cycles; nobody should
 	 * examine it until st_progress_command has been set to something other
@@ -2904,7 +2905,7 @@ pgstat_progress_update_param(int index, int64 val)
 /*---
  * pgstat_progress_end_command() -
  *
- * Update index'th member in st_progress_param[] of own backend entry.
+ * Clear command progress indicator of own backend entry.
  *---
  */
 void

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Small patch: fix warnings during compilation on FreeBSD

2016-03-10 Thread Aleksander Alekseev

Hello

I noticed that master branch of PostgreSQL currently compiles with
warnings on FreeBSD 10.2 RELEASE:

```
pg_locale.c:1290:12: warning: implicit declaration of function
'wcstombs_l' is invalid in C99 [-Wimplicit-function-declaration]

result = wcstombs_l(to, from, tolen, locale);

pg_locale.c:1365:13: warning: implicit declaration of function
'mbstowcs_l' is invalid in C99 [-Wimplicit-function-declaration]

result = mbstowcs_l(to, str, tolen, locale);

2 warnings generated.
```

The problem is that both procedures are declared in xlocale.h file
which is not included in given context. If I remember correctly in
this case compiler assumes that procedures return `int` (instead of
size_t) so I doubt we can ignore this issue.

Frankly I'm not sure what is a right way of fixing this. My best guess
is attached to this message. I tested this patch on Ubuntu Linux 14.04
and FreeBSD 10.2. It solves a described problem and apparently doesn't
break anything (code compiles, regression tests pass, etc).

Best regards,
Aleksanderdiff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 2e6dba1..2d2146c 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -15,7 +15,15 @@
 #include 
 #ifdef LOCALE_T_IN_XLOCALE
 #include 
-#endif
+#else			/* LOCALE_T_IN_XLOCALE */
+#ifdef HAVE_MBSTOWCS_L
+#include 
+#else
+#ifdef HAVE_WCSTOMBS_L
+#include 
+#endif   /* HAVE_WCSTOMBS_L */
+#endif   /* HAVE_MBSTOWCS_L */
+#endif   /* LOCALE_T_IN_XLOCALE */
 
 #include "utils/guc.h"
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small patch: fix warnings during compilation on FreeBSD

2016-03-15 Thread Aleksander Alekseev

> Please verify that the committed version solves your problem on
> FreeBSD.

I confirm this patch solves a problem.

> I've checked this on my OS X box, which turns out to have the
> interesting property that xlocale.h declares wcstombs_l(), but only
> if you previously included stdlib.h ... wtf?

Same on FreeBSD.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small patch: fix comments in contrib/pg_trgm/

2016-03-18 Thread Aleksander Alekseev

Here are a few more patches.

On Wed, 16 Mar 2016 18:11:04 +0300
Aleksander Alekseev <a.aleks...@postgrespro.ru> wrote:

> Typos for the most part.
> 



-- 
Best regards,
Aleksander Alekseev
http://eax.me/
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index 3f013e3..c9e5270 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -258,7 +258,7 @@ timestamp_recv(PG_FUNCTION_ARGS)
  errmsg("timestamp cannot be NaN")));
 #endif
 
-	/* rangecheck: see if timestamp_out would like it */
+	/* range check: see if timestamp_out would like it */
 	if (TIMESTAMP_NOT_FINITE(timestamp))
 		 /* ok */ ;
 	else if (timestamp2tm(timestamp, NULL, tm, , NULL, NULL) != 0 ||
@@ -792,7 +792,7 @@ timestamptz_recv(PG_FUNCTION_ARGS)
 	timestamp = (TimestampTz) pq_getmsgfloat8(buf);
 #endif
 
-	/* rangecheck: see if timestamptz_out would like it */
+	/* range check: see if timestamptz_out would like it */
 	if (TIMESTAMP_NOT_FINITE(timestamp))
 		 /* ok */ ;
 	else if (timestamp2tm(timestamp, , tm, , NULL, NULL) != 0 ||
@@ -1435,7 +1435,7 @@ AdjustIntervalForTypmod(Interval *interval, int32 typmod)
 		else
 			elog(ERROR, "unrecognized interval typmod: %d", typmod);
 
-		/* Need to adjust subsecond precision? */
+		/* Need to adjust sub-second precision? */
 		if (precision != INTERVAL_FULL_PRECISION)
 		{
 			if (precision < 0 || precision > MAX_INTERVAL_PRECISION)
@@ -1635,7 +1635,7 @@ IntegerTimestampToTimestampTz(int64 timestamp)
  * Both inputs must be ordinary finite timestamps (in current usage,
  * they'll be results from GetCurrentTimestamp()).
  *
- * We expect start_time <= stop_time.  If not, we return zeroes; for current
+ * We expect start_time <= stop_time.  If not, we return zeros; for current
  * callers there is no need to be tense about which way division rounds on
  * negative inputs.
  */
@@ -2276,7 +2276,7 @@ timestamp_hash(PG_FUNCTION_ARGS)
 
 
 /*
- * Crosstype comparison functions for timestamp vs timestamptz
+ * Cross-type comparison functions for timestamp vs timestamptz
  */
 
 Datum
@@ -2678,7 +2678,7 @@ overlaps_timestamp(PG_FUNCTION_ARGS)
 	{
 		/*
 		 * For ts1 = ts2 the spec says te1 <> te2 OR te1 = te2, which is a
-		 * rather silly way of saying "true if both are nonnull, else null".
+		 * rather silly way of saying "true if both are non-null, else null".
 		 */
 		if (te1IsNull || te2IsNull)
 			PG_RETURN_NULL();
@@ -2996,7 +2996,7 @@ timestamp_pl_interval(PG_FUNCTION_ARGS)
 		(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
 		 errmsg("timestamp out of range")));
 
-			/* Add days by converting to and from julian */
+			/* Add days by converting to and from Julian */
 			julian = date2j(tm->tm_year, tm->tm_mon, tm->tm_mday) + span->day;
 			j2date(julian, >tm_year, >tm_mon, >tm_mday);
 
@@ -3104,7 +3104,7 @@ timestamptz_pl_interval(PG_FUNCTION_ARGS)
 		(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
 		 errmsg("timestamp out of range")));
 
-			/* Add days by converting to and from julian */
+			/* Add days by converting to and from Julian */
 			julian = date2j(tm->tm_year, tm->tm_mon, tm->tm_mday) + span->day;
 			j2date(julian, >tm_year, >tm_mon, >tm_mday);
 
@@ -3309,7 +3309,7 @@ interval_mul(PG_FUNCTION_ARGS)
 	/*
 	 * The above correctly handles the whole-number part of the month and day
 	 * products, but we have to do something with any fractional part
-	 * resulting when the factor is nonintegral.  We cascade the fractions
+	 * resulting when the factor is non-integral.  We cascade the fractions
 	 * down to lower units using the conversion factors DAYS_PER_MONTH and
 	 * SECS_PER_DAY.  Note we do NOT cascade up, since we are not forced to do
 	 * so by the representation.  The user can choose to cascade up later,
@@ -3319,7 +3319,7 @@ interval_mul(PG_FUNCTION_ARGS)
 	/*
 	 * Fractional months full days into days.
 	 *
-	 * Floating point calculation are inherently inprecise, so these
+	 * Floating point calculation are inherently imprecise, so these
 	 * calculations are crafted to produce the most reliable result possible.
 	 * TSROUND() is needed to more accurately produce whole numbers where
 	 * appropriate.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a325943..9d60a45 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1634,7 +1634,7 @@ static struct config_bool ConfigureNamesBool[] =
 
 	{
 		{"syslog_sequence_numbers", PGC_SIGHUP, LOGGING_WHERE,
-			gettext_noop("Add sequence number to syslog messags to avoid duplicate suppression."),
+			gettext_noop("Add sequence number to syslog messages to avoid duplicate suppression."),
 			NULL
 		},
 		_sequence_numbers,
@@ -2681,7 +2681,7 @@ static struct config_int ConfigureNamesInt[] =
 
 	{
 		{"ssl_renegotiation_limit

Re: [HACKERS] Small patch: fix warnings during compilation on FreeBSD

2016-03-15 Thread Aleksander Alekseev

> Yeah.  In practice, there are exactly two cases we care about: either
> both of these functions will be declared in  like POSIX
> says, or both of them will be in .  There's no need to
> work harder than we have to do to figure that out.
> 
> I'm totally unimpressed with the proposal of depending on the
> __FreeBSD__ macro instead of having a proper configure check.  For
> one thing, we have no idea whether NetBSD or OpenBSD have this same
> issue.  For another, it might be version-specific, or might become so
> if FreeBSD decides to start following POSIX on this point someday.

OK, I'm not an expert in Autotools but this patch (see attachment) seems
to solve a problem.

Here are some notes.

Unfortunately test program requires two include files:

```
#include 
#include 

int main()
{
  size_t tmp = wcstombs_l(NULL, NULL, 0, 0);
  return 0;
}
```

Thus I need two checks - 1) that test program compiles when xlocal.h is
included 2) that test program does not compile if only stdlib.h is
included (what if wcstombs_l is actually declared there?).

On Ubuntu 14.04:

```
$ ./configure
...
checking whether wcstombs_l is available if stdlib.h is included... no
checking whether wcstombs_l is available if stdlib.h and xlocale.h are
included... no ...

$ cat ./src/include/pg_config.h | grep WCSTOMBS_L_IN_XLOCALE
/* #undef WCSTOMBS_L_IN_XLOCALE */

$ make -j2 -s
Writing postgres.bki
Writing schemapg.h
Writing postgres.description
Writing postgres.shdescription
Writing fmgroids.h
Writing fmgrtab.c
In file included from gram.y:14933:0:
scan.c: In function ‘yy_try_NUL_trans’:
scan.c:10321:23: warning: unused variable ‘yyg’ [-Wunused-variable]
 struct yyguts_t * yyg = (struct yyguts_t*)yyscanner; /* This var
may be unused depending upon options. */ ^
$ make check

...
===
 All 161 tests passed. 
===

```

On FreeBSD 10.2:

```
$ ./configure
...
checking whether wcstombs_l is available if stdlib.h is included... no
checking whether wcstombs_l is available if stdlib.h and xlocale.h are
included... yes ...

$ cat ./src/include/pg_config.h | grep WCSTOMBS_L_IN_XLOCALE
#define WCSTOMBS_L_IN_XLOCALE 1

$ gmake -j2 -s
Writing postgres.bki
Writing schemapg.h
Writing postgres.description
Writing postgres.shdescription
Writing fmgroids.h
Writing fmgrtab.c

$ gmake check

...
===
 All 161 tests passed. 
===
```

As you can see warnings are gone. Warning on Ubuntu was always there
and as comment suggests is irrelevant in current context.

Please note that these changes:

```
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T off_t) 1 << 31) << 31) - 1 + (((off_t) 1 <<
31) << 31))
```

... were generated but `autoreconf -iv`. I was not sure what to do
about them. Eventually I decided to keep them. Still these changes could
be safely discarded.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/
diff --git a/config/c-library.m4 b/config/c-library.m4
index 1d28c45..8140a03 100644
--- a/config/c-library.m4
+++ b/config/c-library.m4
@@ -317,3 +317,40 @@ if test "$pgac_cv_type_locale_t" = 'yes (in xlocale.h)'; then
   AC_DEFINE(LOCALE_T_IN_XLOCALE, 1,
 [Define to 1 if `locale_t' requires .])
 fi])])# PGAC_HEADER_XLOCALE
+
+# PGAC_C_WCSTOMBS_L_IN_XLOCALE
+# -
+# Check if wcstombs_l, mbstowcs_l, etc are declared in xlocale.h
+# and define WCSTOMBS_L_IN_XLOCALE if so.
+#
+AC_DEFUN([PGAC_C_WCSTOMBS_L_IN_XLOCALE],
+[
+saved_cflags=$CFLAGS
+CFLAGS=-Werror
+
+AC_CACHE_CHECK(whether wcstombs_l is available if stdlib.h is included, pgac_cv_wcstombs_l_stdlib_compiles,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM(
+[#include ],
+[size_t tmp = wcstombs_l(NULL, NULL, 0, 0);])],
+[pgac_cv_wcstombs_l_stdlib_compiles=yes],
+[pgac_cv_wcstombs_l_stdlib_compiles=no])]
+)
+
+AC_CACHE_CHECK(whether wcstombs_l is available if stdlib.h and xlocale.h are included, pgac_cv_wcstombs_l_xlocale_compiles,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM(
+[#include 
+#include ],
+[size_t tmp = wcstombs_l(NULL, NULL, 0, 0);])],
+[pgac_cv_wcstombs_l_xlocale_compiles=yes],
+[pgac_cv_wcstombs_l_xlocale_compiles=no])]
+)
+
+CFLAGS=$saved_cflags
+
+if test x"$pgac_cv_wcstombs_l_xlocale_compiles" = xyes ; then
+if test x"$pgac_cv_wcstombs_l_stdlib_compiles" = xno ; then
+AC_DEFINE(WCSTOMBS_L_IN_XLOCALE, 1,
+  [Define to 1 if wcstombs_l, mbstowcs_l, etc are declared in xlocale.h.])
+fi
+fi])# PGAC_C_WCSTOMBS_L_IN_XLOCALE
+
diff --git a/configure b/configure
index 0e51ac7..033224a 100755
--- a/configure
+++ b/configure
@@ -11132,6 +11132,76 @@ $as_echo "#define FLEXIBLE_ARRAY_MEMBER /**/" >>confdefs.h
 
   fi
 
+
+saved_cflags=$CFLAGS
+CFLAGS=-Werror
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether wcstombs_l is available if stdlib.h is included" >&5
+$as_echo_n "checking whether wcstom

Re: [HACKERS] Small patch: fix warnings during compilation on FreeBSD

2016-03-11 Thread Aleksander Alekseev

Hello, Tom

> I think what we need is configure logic to find out where wcstombs_l()
> is declared, and then #include  only if it's necessary to
> get that definition.  I haven't experimented but probably you could
> make such a check with nested uses of AC_CHECK_DECL.

Sounds like quite a dirty hack to me. Besides so far we have only two
procedures from xlocale.h and this requires two checks. If we go this
way someday there will be 15 checks for every procedure from xlocale.h
and logic like:

```
#if PROC1_DEFINED_IN_XLOCALE || PROC2_DEFINED_IN_XLOCALE ...
#include 
#endif
```

Perhaps we could just use __FreeBSD__ macro instead (see attachments)?
Or maybe create our own macro ALWAYS_INCLUDE_XLOCALE in configure
script which currently will depend only on used OS? Naturally this
definition could be changed in the future.

Best regards,
Aleksander

diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 2e6dba1..2bb5cbc 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -13,7 +13,8 @@
 #define _PG_LOCALE_
 
 #include 
-#ifdef LOCALE_T_IN_XLOCALE
+
+#if defined(LOCALE_T_IN_XLOCALE) || defined(__FreeBSD__)
 #include 
 #endif
 
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 2e6dba1..6363ca8 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -15,6 +15,10 @@
 #include 
 #ifdef LOCALE_T_IN_XLOCALE
 #include 
+#else
+#ifdef __FreeBSD__
+#include 
+#endif
 #endif
 
 #include "utils/guc.h"

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Small patch: fix double variable initializations in policy.c

2016-03-16 Thread Aleksander Alekseev

Hello

I discovered a pretty weird code.

policy.c:1083

```
char   *qual_value;
ParseState *qual_pstate = make_parsestate(NULL);

/* parsestate is built just to build the range table */
qual_pstate = make_parsestate(NULL);
```

policy.c:1125

```
char   *with_check_value;
ParseState *with_check_pstate = make_parsestate(NULL);

/* parsestate is built just to build the range table */
with_check_pstate = make_parsestate(NULL);
```

I'm quite sure that there is no need to initialize these variables
twice. First patch fixes this. Also I discovered that policy.c is not
properly pgindent'ed. Second patch fixes this too.

Naturally I checked that after applying these patches PostgreSQL still
compiles and pass all regression tests.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/
diff --git a/src/backend/commands/policy.c b/src/backend/commands/policy.c
index bb735ac..dfc6f00 100644
--- a/src/backend/commands/policy.c
+++ b/src/backend/commands/policy.c
@@ -1081,10 +1081,8 @@ AlterPolicy(AlterPolicyStmt *stmt)
 		if (!attr_isnull)
 		{
 			char	   *qual_value;
-			ParseState *qual_pstate = make_parsestate(NULL);
-
 			/* parsestate is built just to build the range table */
-			qual_pstate = make_parsestate(NULL);
+			ParseState *qual_pstate = make_parsestate(NULL);
 
 			qual_value = TextDatumGetCString(value_datum);
 			qual = stringToNode(qual_value);
@@ -1122,10 +1120,8 @@ AlterPolicy(AlterPolicyStmt *stmt)
 		if (!attr_isnull)
 		{
 			char	   *with_check_value;
-			ParseState *with_check_pstate = make_parsestate(NULL);
-
 			/* parsestate is built just to build the range table */
-			with_check_pstate = make_parsestate(NULL);
+			ParseState *with_check_pstate = make_parsestate(NULL);
 
 			with_check_value = TextDatumGetCString(value_datum);
 			with_check_qual = stringToNode(with_check_value);
diff --git a/src/backend/commands/policy.c b/src/backend/commands/policy.c
index dfc6f00..22fa344 100644
--- a/src/backend/commands/policy.c
+++ b/src/backend/commands/policy.c
@@ -496,7 +496,7 @@ RemoveRoleFromObjectPolicy(Oid roleid, Oid classid, Oid policy_id)
 
 	/* Must own relation. */
 	if (pg_class_ownercheck(relid, GetUserId()))
-		noperm = false; /* user is allowed to modify this policy */
+		noperm = false;			/* user is allowed to modify this policy */
 	else
 		ereport(WARNING,
 (errcode(ERRCODE_WARNING_PRIVILEGE_NOT_REVOKED),
@@ -511,15 +511,16 @@ RemoveRoleFromObjectPolicy(Oid roleid, Oid classid, Oid policy_id)
 	 */
 	if (!noperm && num_roles > 0)
 	{
-		int			i, j;
+		int			i,
+	j;
 		Oid		   *roles = (Oid *) ARR_DATA_PTR(policy_roles);
 		Datum	   *role_oids;
 		char	   *qual_value;
 		Node	   *qual_expr;
-		List   *qual_parse_rtable = NIL;
+		List	   *qual_parse_rtable = NIL;
 		char	   *with_check_value;
 		Node	   *with_check_qual;
-		List   *with_check_parse_rtable = NIL;
+		List	   *with_check_parse_rtable = NIL;
 		Datum		values[Natts_pg_policy];
 		bool		isnull[Natts_pg_policy];
 		bool		replaces[Natts_pg_policy];
@@ -536,15 +537,14 @@ RemoveRoleFromObjectPolicy(Oid roleid, Oid classid, Oid policy_id)
 
 		/*
 		 * All of the dependencies will be removed from the policy and then
-		 * re-added.  In order to get them correct, we need to extract out
-		 * the expressions in the policy and construct a parsestate just
-		 * enough to build the range table(s) to then pass to
-		 * recordDependencyOnExpr().
+		 * re-added.  In order to get them correct, we need to extract out the
+		 * expressions in the policy and construct a parsestate just enough to
+		 * build the range table(s) to then pass to recordDependencyOnExpr().
 		 */
 
 		/* Get policy qual, to update dependencies */
 		value_datum = heap_getattr(tuple, Anum_pg_policy_polqual,
-   RelationGetDescr(pg_policy_rel), _isnull);
+			  RelationGetDescr(pg_policy_rel), _isnull);
 		if (!attr_isnull)
 		{
 			ParseState *qual_pstate;
@@ -566,7 +566,7 @@ RemoveRoleFromObjectPolicy(Oid roleid, Oid classid, Oid policy_id)
 
 		/* Get WITH CHECK qual, to update dependencies */
 		value_datum = heap_getattr(tuple, Anum_pg_policy_polwithcheck,
-   RelationGetDescr(pg_policy_rel), _isnull);
+			  RelationGetDescr(pg_policy_rel), _isnull);
 		if (!attr_isnull)
 		{
 			ParseState *with_check_pstate;
@@ -665,7 +665,7 @@ RemoveRoleFromObjectPolicy(Oid roleid, Oid classid, Oid policy_id)
 
 	heap_close(pg_policy_rel, RowExclusiveLock);
 
-	return(noperm || num_roles > 0);
+	return (noperm || num_roles > 0);
 }
 
 /*
@@ -996,8 +996,8 @@ AlterPolicy(AlterPolicyStmt *stmt)
 
 	/* Get policy command */
 	polcmd_datum = heap_getattr(policy_tuple, Anum_pg_policy_polcmd,
-			 RelationGetDescr(pg_policy_rel),
-			 _isnull);
+RelationGetDescr(pg_policy_rel),
+_isnull);
 	Assert(!polcmd_isnull);
 	polcmd = DatumGetChar(polcmd_datum);
 
@@ -1029,15 +1029,15 @@ AlterPolicy(AlterPolicyStmt *stmt)
 	}
 	else
 	{
-		Oid*roles;
+		Oid		   *roles;
 		Datum		rol

Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-09 Thread Aleksander Alekseev

Hello, Michael

Thanks a lot for steps to reproduce you provided.

I tested your path on Ubuntu Linux 14.04 LTS (GCC 4.8.4) and FreeBSD
10.2 RELEASE (Clang 3.4.1). In both cases patch applies cleanly, there
are no warnings during compilation and all regression tests pass. A few
files are not properly pgindent'ed though:

```
diff --git a/src/interfaces/libpq/fe-exec.c
b/src/interfaces/libpq/fe-exec.c index c99f193..2769719 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -2031,7 +2031,7 @@ PQexecFinish(PGconn *conn)
conn->status == CONNECTION_BAD)
break;
else if ((conn->asyncStatus == PGASYNC_COPY_IN ||
- conn->asyncStatus == PGASYNC_COPY_OUT  ||
+ conn->asyncStatus == PGASYNC_COPY_OUT ||
  conn->asyncStatus == PGASYNC_COPY_BOTH) &&
 result->resultStatus == PGRES_FATAL_ERROR)
break;
diff --git a/src/interfaces/libpq/fe-protocol3.c
b/src/interfaces/libpq/fe-protocol3.c index 21a1d9b..280ca16 100644
--- a/src/interfaces/libpq/fe-protocol3.c
+++ b/src/interfaces/libpq/fe-protocol3.c
@@ -49,9 +49,9 @@ static intgetParamDescriptions(PGconn *conn, int
msgLength); static int getAnotherTuple(PGconn *conn, int msgLength);
 static int getParameterStatus(PGconn *conn);
 static int getNotify(PGconn *conn);
-static int getCopyStart(PGconn *conn,
-ExecStatusType copytype,
-int msgLength);
+static int getCopyStart(PGconn *conn,
+ExecStatusType copytype,
+int msgLength);
 static int getReadyForQuery(PGconn *conn);
 static void reportErrorPosition(PQExpBuffer msg, const char *query,
int loc, int encoding);
```

Indeed your patch solves issues you described. Still here is something
that concerns me:

```
$ gdb --args pg_receivexlog --verbose -D ./temp_data/

(gdb) b getCopyStart
Function "getCopyStart" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (getCopyStart) pending.
(gdb) r
Starting program: /usr/local/pgsql/bin/pg_receivexlog --verbose
-D ./temp_data/ [Thread debugging using libthread_db enabled]
Using host libthread_db library
"/lib/x86_64-linux-gnu/libthread_db.so.1". pg_receivexlog: starting log
streaming at 0/100 (timeline 1)

Breakpoint 1, getCopyStart (conn=0x610220, copytype=PGRES_COPY_BOTH,
msgLength=3) at fe-protocol3.c:1398 1398const char
*errmsg = NULL; (gdb) n
1400result = PQmakeEmptyPGresult(conn, copytype);
(gdb) 
1401if (!result)
(gdb) p result = 0
$1 = (PGresult *) 0x0
(gdb) c
Continuing.
pg_receivexlog: could not send replication command "START_REPLICATION":
out of memory pg_receivexlog: disconnected; waiting 5 seconds to try
again pg_receivexlog: starting log streaming at 0/100 (timeline 1)

Breakpoint 1, getCopyStart (conn=0x610180, copytype=PGRES_COPY_BOTH,
msgLength=3) at fe-protocol3.c:1398 1398const char
*errmsg = NULL; (gdb) n
1400result = PQmakeEmptyPGresult(conn, copytype);
(gdb) n
1401if (!result)
(gdb) p result = 0
$2 = (PGresult *) 0x0
(gdb) c
Continuing.
pg_receivexlog: could not send replication command "START_REPLICATION":
out of memory pg_receivexlog: disconnected; waiting 5 seconds to try
again pg_receivexlog: starting log streaming at 0/100 (timeline 1)

Breakpoint 1, getCopyStart (conn=0x610180, copytype=PGRES_COPY_BOTH,
msgLength=3) at fe-protocol3.c:1398 1398const char
*errmsg = NULL;
```

Granted this behaviour is a bit better then the current one. But
basically it's the same infinite loop only with pauses and warnings. I
wonder if this is a behaviour we really want. For instance wouldn't it
be better just to terminate an application in out-of-memory case? "Let
it crash" as Erlang programmers say.

Best regards,
Aleksander


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Yet another small patch - reorderbuffer.c:1099

2016-04-06 Thread Aleksander Alekseev

> > IMO the code is wrong.  There should be a single block comment
> > saying something like "Remove the node from its containing list.
> > In the FOO case, the list corresponds to BAR and therefore we
> > delete it because BAZ.  In the QUUX case the list is PLUGH and we
> > delete in because THUD." Then a single line dlist_delete(...)
> > follows.
> >
> > The current arrangement looks bizantine to me, for no reason.  If we
> > think that one of the two branches might do something additional to
> > the list deletion, surely that will be in a separate stanza with
> > its own comment; and if we ever want to remove the list deletion
> > from one of the two cases (something that strikes me as unlikely)
> > then we will need to fix the comment, too.
> 
> +1 to everything here except for the way byzantine is spelled.
> 

+1 and yet another patch.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 52c6986..360e6fd 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1159,17 +1159,17 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->base_snapshot_lsn = InvalidXLogRecPtr;
 	}
 
-	/* delete from list of known subxacts */
-	if (txn->is_known_as_subxact)
-	{
-		/* NB: nsubxacts count of parent will be too high now */
-		dlist_delete(>node);
-	}
-	/* delete from LSN ordered list of toplevel TXNs */
-	else
-	{
-		dlist_delete(>node);
-	}
+	/*
+	 * Remove TXN from its containing list. Note that following line covers
+	 * two cases.
+	 *
+	 * If TXN is a subxact we delete it from list of known subxacts. NB:
+	 * nsubxacts count of parent becomes too high in this case.
+	 *
+	 * Otherwise we delete TXN from ordered list of toplevel TXNs.
+	 *
+	 */
+	dlist_delete(>node);
 
 	/* now remove reference from buffer */
 	hash_search(rb->by_txn,


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Small patch for snapmgr.c

2016-04-08 Thread Aleksander Alekseev

Hello

Currently there is a following piece of code in snapmgr.c:

```
/* Copy all required fields */
snapshot = (Snapshot) MemoryContextAlloc(TopTransactionContext, size);
snapshot->satisfies = HeapTupleSatisfiesMVCC;
snapshot->xmin = serialized_snapshot->xmin;
snapshot->xmax = serialized_snapshot->xmax;
snapshot->xip = NULL;
snapshot->xcnt = serialized_snapshot->xcnt;
snapshot->subxip = NULL;
/* ... */

/* Copy XIDs, if present. */
if (serialized_snapshot->xcnt > 0) 
{
snapshot->xip = (TransactionId *) (snapshot + 1);
memcpy(snapshot->xip, serialized_xids,
   serialized_snapshot->xcnt * sizeof(TransactionId));
}

/* Copy SubXIDs, if present. */
if (serialized_snapshot->subxcnt > 0) 
{
snapshot->subxip = snapshot->xip + serialized_snapshot->xcnt;
memcpy(snapshot->subxip, ...
```

It assumes that subxcnt > 0 iff xcnt > 0. As I understand this is true.
At least I hope so, otherwise this code can call memcpy passing NULL as
a first argument. But Clang Static Analyzer is not aware of all this:

http://afiskon.ru/s/db/5c956077e9_snapmgr.png

I propose a patch that makes static analyzers happy and makes intention
of this code a little more obvious.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index b88e012..329bbba 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1538,7 +1538,8 @@ RestoreSnapshot(char *start_address)
 	SerializedSnapshotData *serialized_snapshot;
 	Size		size;
 	Snapshot	snapshot;
-	TransactionId *serialized_xids;
+	TransactionId *serialized_xids,
+  *xids_start;
 
 	serialized_snapshot = (SerializedSnapshotData *) start_address;
 	serialized_xids = (TransactionId *)
@@ -1562,10 +1563,12 @@ RestoreSnapshot(char *start_address)
 	snapshot->takenDuringRecovery = serialized_snapshot->takenDuringRecovery;
 	snapshot->curcid = serialized_snapshot->curcid;
 
+	xids_start = (TransactionId *) (snapshot + 1);
+
 	/* Copy XIDs, if present. */
 	if (serialized_snapshot->xcnt > 0)
 	{
-		snapshot->xip = (TransactionId *) (snapshot + 1);
+		snapshot->xip = xids_start;
 		memcpy(snapshot->xip, serialized_xids,
 			   serialized_snapshot->xcnt * sizeof(TransactionId));
 	}
@@ -1573,7 +1576,7 @@ RestoreSnapshot(char *start_address)
 	/* Copy SubXIDs, if present. */
 	if (serialized_snapshot->subxcnt > 0)
 	{
-		snapshot->subxip = snapshot->xip + serialized_snapshot->xcnt;
+		snapshot->subxip = xids_start + serialized_snapshot->xcnt;
 		memcpy(snapshot->subxip, serialized_xids + serialized_snapshot->xcnt,
 			   serialized_snapshot->subxcnt * sizeof(TransactionId));
 	}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Small patch: fix comments in contrib/pg_trgm/

2016-03-19 Thread Aleksander Alekseev

Typos for the most part.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/
diff --git a/contrib/pg_trgm/trgm_gin.c b/contrib/pg_trgm/trgm_gin.c
index ea8edef..2e41a9f 100644
--- a/contrib/pg_trgm/trgm_gin.c
+++ b/contrib/pg_trgm/trgm_gin.c
@@ -204,7 +204,7 @@ gin_trgm_consistent(PG_FUNCTION_ARGS)
 			 * And again, c (ntrue) is a lower bound of len2, but c <= len1
 			 * just by definition and, consequently, upper bound of
 			 * similarity is just c / len1.
-			 * So, independly on DIVUNION the upper bound formula is the same.
+			 * So, independently on DIVUNION the upper bound formula is the same.
 			 */
 			res = (nkeys == 0) ? false :
 ((float4) ntrue) / ((float4) nkeys))) >= similarity_threshold)
@@ -320,7 +320,7 @@ gin_trgm_triconsistent(PG_FUNCTION_ARGS)
 			else
 			{
 /*
- * As trigramsMatchGraph implements a montonic boolean function,
+ * As trigramsMatchGraph implements a monotonic boolean function,
  * promoting all GIN_MAYBE keys to GIN_TRUE will give a
  * conservative result.
  */
diff --git a/contrib/pg_trgm/trgm_op.c b/contrib/pg_trgm/trgm_op.c
index b2c1f6b..957def9 100644
--- a/contrib/pg_trgm/trgm_op.c
+++ b/contrib/pg_trgm/trgm_op.c
@@ -288,7 +288,7 @@ generate_trgm(char *str, int slen)
 }
 
 /*
- * Extract the next non-wildcard part of a search string, ie, a word bounded
+ * Extract the next non-wildcard part of a search string, i.e. a word bounded
  * by '_' or '%' meta-characters, non-word characters or string end.
  *
  * str: source string, of length lenstr bytes (need not be null-terminated)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Patch: typo, s/espaced/escaped/

2016-03-19 Thread Aleksander Alekseev

There is a typo in one word in this commit:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9a206d063c410df7cd5da01b169b23bff413fef5

Patch attached.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/
diff --git a/contrib/unaccent/generate_unaccent_rules.py b/contrib/unaccent/generate_unaccent_rules.py
index 2f5520c..a5eb42f 100644
--- a/contrib/unaccent/generate_unaccent_rules.py
+++ b/contrib/unaccent/generate_unaccent_rules.py
@@ -95,7 +95,7 @@ def parse_cldr_latin_ascii_transliterator(latinAsciiFilePath):
 # to the characters.
 #
 # Group 1: plain "src" char. Empty if group 2 is not.
-# Group 2: unicode-espaced "src" char (e.g. "\u0110"). Empty if group 1 is not.
+# Group 2: unicode-escaped "src" char (e.g. "\u0110"). Empty if group 1 is not.
 #
 # Group 3: plain "trg" char. Empty if group 4 is not.
 # Group 4: plain "trg" char between quotes. Empty if group 3 is not.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-21 Thread Aleksander Alekseev

> This is the point where I think I am missing something about patch.
> As far as I can understand, it uses the same freelist index
> (freelist_idx) for allocating and putting back the entry, so I think
> the chance of increment in one list and decrement in another is there
> when the value of freelist_idx is calculated differently for the same
> input, is it so, or there is something else in patch which I am
> missing?

You are right, nentries _can't_ be negative unless we are using getpid()
for calculating freelist_idx, since same index of nentries[] is used
when we add (increment) and remove (decrement) element from/to hash
table. The fact that we also borrow elements from other freelists if
there is no more elements in our freelist doesn't change anything.

> One idea is to jigger things so that we maintain a count of the total
> number of entries that doesn't change except when we allocate, and
> then for each freelist partition we maintain the number of entries in
> that freelist partition.  So then the size of the hash table, instead
> of being sum(nentries) is totalsize - sum(nfree).

This is an interesting idea. Still I strongly disagree that is should
be implemented in this concrete patch and discussed in this concrete
thread. Obviously such type of change deserves a separate research and
discussing since it has nothing to do with performance and since we
agreed that "change LOG2_NUM_LOCK_PARTITIONS value" and "change the
calling convention for ShmemInitHash()" patches should be implemented
separately. I added it to my TODO list.

Once again I suggest we merge this patch already:

http://www.postgresql.org/message-id/CA+Tgmobtf9nH566_jjs=jrtymq5hdqdarf5j7o+abdowqhe...@mail.gmail.com

I have a strong feeling that we are just wasting our time here.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: PROPOSAL: make PostgreSQL sanitizers-friendly (and prevent information disclosure)

2016-03-22 Thread Aleksander Alekseev

> > It's possible that memset() would be more convincing.
> 
> +1

OK, here is corrected patch.


-- 
Best regards,
Aleksander Alekseev
http://eax.me/
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b48e97c..273e0b0 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -337,6 +337,7 @@ gistXLogSplit(RelFileNode node, BlockNumber blkno, bool page_is_leaf,
 	for (ptr = dist; ptr; ptr = ptr->next)
 		npage++;
 
+	memset(, 0, sizeof(xlrec));
 	xlrec.origrlink = origrlink;
 	xlrec.orignsn = orignsn;
 	xlrec.origleaf = page_is_leaf;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index f090ca5..1275b6c 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -204,6 +204,7 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 {
 	spgxlogAddLeaf xlrec;
 
+	memset(, 0, sizeof(xlrec));
 	xlrec.newPage = isNew;
 	xlrec.storesNulls = isNulls;
 
@@ -448,6 +449,8 @@ moveLeafs(Relation index, SpGistState *state,
 		i = it->nextOffset;
 	}
 
+	memset(, 0, sizeof(xlrec));
+
 	/* Find a leaf page that will hold them */
 	nbuf = SpGistGetBuffer(index, GBUF_LEAF | (isNulls ? GBUF_NULLS : 0),
 		   size, );
@@ -723,6 +726,7 @@ doPickSplit(Relation index, SpGistState *state,
 	newLeafs = (SpGistLeafTuple *) palloc(sizeof(SpGistLeafTuple) * n);
 	leafPageSelect = (uint8 *) palloc(sizeof(uint8) * n);
 
+	memset(, 0, sizeof(xlrec));
 	STORE_STATE(state, xlrec.stateSrc);
 
 	/*
@@ -1501,6 +1505,7 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	newInnerTuple = addNode(state, innerTuple, nodeLabel, nodeN);
 
 	/* Prepare WAL record */
+	memset(, 0, sizeof(xlrec));
 	STORE_STATE(state, xlrec.stateSrc);
 	xlrec.offnum = current->offnum;
 
@@ -1741,6 +1746,7 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 	postfixTuple->allTheSame = innerTuple->allTheSame;
 
 	/* prep data for WAL record */
+	memset(, 0, sizeof(xlrec));
 	xlrec.newPage = false;
 
 	/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b119a47..a50a244 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4798,15 +4798,14 @@ BootStrapXLOG(void)
 	 * segment with logid=0 logseg=1. The very first WAL segment, 0/0, is not
 	 * used, so that we can use 0/0 to mean "before any valid WAL segment".
 	 */
+	memset(, 0, sizeof(checkPoint));
 	checkPoint.redo = XLogSegSize + SizeOfXLogLongPHD;
 	checkPoint.ThisTimeLineID = ThisTimeLineID;
 	checkPoint.PrevTimeLineID = ThisTimeLineID;
 	checkPoint.fullPageWrites = fullPageWrites;
-	checkPoint.nextXidEpoch = 0;
 	checkPoint.nextXid = FirstNormalTransactionId;
 	checkPoint.nextOid = FirstBootstrapObjectId;
 	checkPoint.nextMulti = FirstMultiXactId;
-	checkPoint.nextMultiOffset = 0;
 	checkPoint.oldestXid = FirstNormalTransactionId;
 	checkPoint.oldestXidDB = TemplateDbOid;
 	checkPoint.oldestMulti = FirstMultiXactId;
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 1ff5728..a10c078 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -669,6 +669,11 @@ destroy_tablespace_directories(Oid tablespaceoid, bool redo)
 	char	   *subfile;
 	struct stat st;
 
+	/*
+	 * Prevents MemorySanitizer's "use-of-uninitialized-value" warning
+	 */
+	memset(, 0, sizeof(st));
+
 	linkloc_with_version_dir = psprintf("pg_tblspc/%u/%s", tablespaceoid,
 		TABLESPACE_VERSION_DIRECTORY);
 
diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
index 28f9fb5..45aa802 100644
--- a/src/backend/libpq/hba.c
+++ b/src/backend/libpq/hba.c
@@ -1008,14 +1008,9 @@ parse_hba_line(List *line, int line_num, char *raw_line)
 *cidr_slash = '\0';
 
 			/* Get the IP address either way */
+			memset(, 0, sizeof(hints));
 			hints.ai_flags = AI_NUMERICHOST;
 			hints.ai_family = AF_UNSPEC;
-			hints.ai_socktype = 0;
-			hints.ai_protocol = 0;
-			hints.ai_addrlen = 0;
-			hints.ai_canonname = NULL;
-			hints.ai_addr = NULL;
-			hints.ai_next = NULL;
 
 			ret = pg_getaddrinfo_all(str, NULL, , _result);
 			if (ret == 0 && gai_result)
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 1467355..ae7394c 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -341,14 +341,10 @@ pgstat_init(void)
 	/*
 	 * Create the UDP socket for sending and receiving statistic messages
 	 */
+	memset(, 0, sizeof(hints));
 	hints.ai_flags = AI_PASSIVE;
 	hints.ai_family = AF_UNSPEC;
 	hints.ai_socktype = SOCK_DGRAM;
-	hints.ai_protocol = 0;
-	hints.ai_addrlen = 0;
-	hints.ai_addr = NULL;
-	hints.ai_canonname = NULL;
-	hints.ai_next = NULL;
 	ret = pg_getaddrinfo_all("localhost", NULL, , );
 	if (ret || !addrs)
 	{
diff --git a/src/backend/utils/cache/inval.c b/src/backen

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-23 Thread Aleksander Alekseev

> > I have a strong feeling that we are just wasting our time here.
> 
> That is possible.  However, I would like it if you would give me the
> benefit of the doubt and assume that, if I seem to be more cautious
> than you would be were you a committer, there might possibly be some
> good reasons for that.  The fact is that, despite being more cautious
> than some people think I should be, I still manage to introduce quite
> a number of bugs via the patches I commit - see the thread 'Missing
> rows with index scan when collation is not "C"' on pgsql-bugs for just
> the very latest example of that.  Nobody thinks that will happen with
> *their* patch, of course, but it does all the same.

Oh, it explains a lot! You see, I couldn't figure out whats happening.
Patch was discussed and reviewed a long time ago, everyone seems to be
reasonably happy with it, etc. But for some reason it's ignored for
weeks and no one tells why. Now it makes sense.

I should probably mention that this patch was merged to PostgresPro
build a few months ago. Several our clients are already using this code
in production environment (guess who discovered this issue and wanted
it to be fixed). There were no complains so far.

> I'd still like an answer to the question of why this helps so much
> when there must be huge amounts of false sharing between the different
> mutexes.  Maybe it doesn't matter, though.

Well, the reason is that there is no more bottleneck here. Code is
executed like 1% of the time here and 99% of the time somewhere else.
There is a false sharing but it's not as expensive as 99% of other
code. Thus optimizing 1% of the code any further doesn't give noticeable
performance improvement.

I still believe that optimizing 1% blindly without considering possible
side effects this optimization can bring (other data alignment, some
additional machine instructions - just to name a few) and having no
way to measure these side effects is a bad idea. But I also admit a
possibility that I can somehow be wrong on this. So I rewrote this
patch one again :'( the way you suggested (without that alignment
related hack I tried, it's just too ugly). I also attached original
hashhdr-rmh.patch just to have both patches in one message so it would
be easier to find both patches in this thread.

If there are any other questions or doubts regarding these patches
please don't hesitate to ask.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 24a53da..06c413c 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -15,7 +15,7 @@
  * to hash_create.  This prevents any attempt to split buckets on-the-fly.
  * Therefore, each hash bucket chain operates independently, and no fields
  * of the hash header change after init except nentries and freeList.
- * A partitioned table uses a spinlock to guard changes of those two fields.
+ * A partitioned table uses spinlocks to guard changes of those fields.
  * This lets any subset of the hash buckets be treated as a separately
  * lockable partition.  We expect callers to use the low-order bits of a
  * lookup key's hash value as a partition number --- this will work because
@@ -111,6 +111,8 @@
 #define DEF_DIRSIZE			   256
 #define DEF_FFACTOR			   1	/* default fill factor */
 
+/* Number of freelists to be used for a partitioned hash table. */
+#define NUM_FREELISTS			32
 
 /* A hash bucket is a linked list of HASHELEMENTs */
 typedef HASHELEMENT *HASHBUCKET;
@@ -128,12 +130,17 @@ typedef HASHBUCKET *HASHSEGMENT;
  */
 struct HASHHDR
 {
-	/* In a partitioned table, take this lock to touch nentries or freeList */
-	slock_t		mutex;			/* unused if not partitioned table */
-
-	/* These fields change during entry addition/deletion */
-	long		nentries;		/* number of entries in hash table */
-	HASHELEMENT *freeList;		/* linked list of free elements */
+	/*
+	 * The freelist can become a point of contention on high-concurrency hash
+	 * tables, so we use an array of freelist, each with its own mutex and
+	 * nentries count, instead of just a single one.
+	 *
+	 * If hash table is not partitioned only nentries[0] and freeList[0] are
+	 * used and spinlocks are not used at all.
+	 */
+	slock_t		mutex[NUM_FREELISTS];		/* array of spinlocks */
+	long		nentries[NUM_FREELISTS];	/* number of entries */
+	HASHELEMENT *freeList[NUM_FREELISTS]; /* lists of free elements */
 
 	/* These fields can change, but not in a partitioned table */
 	/* Also, dsize can't change in a shared table, even if unpartitioned */
@@ -166,6 +173,9 @@ struct HASHHDR
 
 #define IS_PARTITIONED(hctl)  ((hctl)->num_partitions != 0)
 
+#define FREELIST_IDX(hctl, hashcode) \
+	(IS_PARTITIONED(hctl) ? hashcode % NUM_FREELISTS : 0)
+
 /*
  * Top control structure for a hashtable --- in a shared table, each backend
  * has its own copy (OK since no fields change a

[HACKERS] Small patch: Change calling convention for ShmemInitHash (and fix possible bug)

2016-03-24 Thread Aleksander Alekseev

Hello

I would like to continue discussion regarding changing calling
convention for ShmemInitHash procedure:

http://www.postgresql.org/message-id/CA+TgmoZm=uowt8a_xasfoogwufeelj861ntadiceopyfehv...@mail.gmail.com

Currently this procedure has two arguments --- init_size and max_size.
But since shared hash tables have fixed size there is little sense to
pass two arguments. In fact currently ShmemInitHash is always called
with init_size == max_size with only one exception, InitLocks procedure
(see lock.c), which I believe is actually a bug.

Patch is attached.

What do you think?

-- 
Best regards,
Aleksander Alekseev
http://eax.me/
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 3d9b8e4..4213c3a 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -495,7 +495,7 @@ pgss_shmem_startup(void)
 	info.hash = pgss_hash_fn;
 	info.match = pgss_match_fn;
 	pgss_hash = ShmemInitHash("pg_stat_statements hash",
-			  pgss_max, pgss_max,
+			  pgss_max,
 			  ,
 			  HASH_ELEM | HASH_FUNCTION | HASH_COMPARE);
 
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 39e8baf..dd5acb7 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -62,7 +62,7 @@ InitBufTable(int size)
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
 
 	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
-  size, size,
+  size,
   ,
   HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
 }
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 81506ea..9df73b1 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -237,7 +237,7 @@ InitShmemIndex(void)
 	hash_flags = HASH_ELEM;
 
 	ShmemIndex = ShmemInitHash("ShmemIndex",
-			   SHMEM_INDEX_SIZE, SHMEM_INDEX_SIZE,
+			   SHMEM_INDEX_SIZE,
 			   , hash_flags);
 }
 
@@ -250,14 +250,7 @@ InitShmemIndex(void)
  * table at once.  (In practice, all creations are done in the postmaster
  * process; child processes should always be attaching to existing tables.)
  *
- * max_size is the estimated maximum number of hashtable entries.  This is
- * not a hard limit, but the access efficiency will degrade if it is
- * exceeded substantially (since it's used to compute directory size and
- * the hash table buckets will get overfull).
- *
- * init_size is the number of hashtable entries to preallocate.  For a table
- * whose maximum size is certain, this should be equal to max_size; that
- * ensures that no run-time out-of-shared-memory failures can occur.
+ * max_size is the estimated maximum number of hashtable entries.
  *
  * Note: before Postgres 9.0, this function returned NULL for some failure
  * cases.  Now, it always throws error instead, so callers need not check
@@ -265,7 +258,6 @@ InitShmemIndex(void)
  */
 HTAB *
 ShmemInitHash(const char *name, /* table string name for shmem index */
-			  long init_size,	/* initial table size */
 			  long max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
@@ -299,7 +291,7 @@ ShmemInitHash(const char *name, /* table string name for shmem index */
 	/* Pass location of hashtable header to hash_create */
 	infoP->hctl = (HASHHDR *) location;
 
-	return hash_create(name, init_size, infoP, hash_flags);
+	return hash_create(name, max_size, infoP, hash_flags);
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index b30b7b1..df41b29 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -377,8 +377,7 @@ void
 InitLocks(void)
 {
 	HASHCTL		info;
-	long		init_table_size,
-max_table_size;
+	long 		max_table_size;
 	bool		found;
 
 	/*
@@ -386,7 +385,6 @@ InitLocks(void)
 	 * calculations must agree with LockShmemSize!
 	 */
 	max_table_size = NLOCKENTS();
-	init_table_size = max_table_size / 2;
 
 	/*
 	 * Allocate hash table for LOCK structs.  This stores per-locked-object
@@ -398,14 +396,12 @@ InitLocks(void)
 	info.num_partitions = NUM_LOCK_PARTITIONS;
 
 	LockMethodLockHash = ShmemInitHash("LOCK hash",
-	   init_table_size,
 	   max_table_size,
 	   ,
 	HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
 
 	/* Assume an average of 2 holders per lock */
 	max_table_size *= 2;
-	init_table_size *= 2;
 
 	/*
 	 * Allocate hash table for PROCLOCK structs.  This stores
@@ -417,7 +413,6 @@ InitLocks(void)
 	info.num_partitions = NUM_LOCK_PARTITIONS;
 
 	LockMethodProcLockHash = ShmemInitHash("PROCLOCK hash",
-		   init_table_size,
 		   max_table_size,
 		   ,
  HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/s

Re: [HACKERS] Small patch: Change calling convention for ShmemInitHash (and fix possible bug)

2016-03-25 Thread Aleksander Alekseev

> No, I think we left it that way on purpose.  I don't remember the
> discussion exactly, but I don't think it's hurting anything.

This was a part of original dynahash optimization patch. Since that
patch was about performance improvement and this concrete change is
about refactoring, not performance, we agreed to discuss it later as a
separate patch.

> I don't think this actually buys us anything.

For sure it doesn't make anything worse. Current code is just
confusing. I spent quite some time trying to figure out what is a
reason for passing two arguments before I realized - there is none.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small patch: fix code duplication in heapam.c

2016-03-25 Thread Aleksander Alekseev

> I think this is a waste of time.  These functions are already very
> short; making them shorter will not significantly improve readability.
> It'll just force people who think they know what that code does to
> look at it again to see if it still does the same thing.
> 
> Let's spend our time arguing about changes that matter.  There are an
> infinite number of things like this that you could tinker with, but
> most of them are not worth tinkering with.

I must respectfully disagree. Granted, this is not a big issue and
we don't have to fix it right now. Probably next commit fest would be a
better time.

But this is not a huge patch that changing everything in unpredictable
way and requires a lot of hard thinking. We also have code review,
regression tests, alpha and beta tests to be reasonably sure that such
change doesn't break anything. (If not perhaps we should improve this
situation by introducing new ways of modular and property-based
testing, which I believe would be extremely useful say in case of
indexes, but this is a different story).

I don't believe we can afford to keep such a confusing code using
provided arguments as an excuse not to fixing it. ("OK, there are two
procedures that work differently... lets see... or not? well, thats
odd... lets make :vsplit and compare them line by line... damn, I spend
all that time to figure out that they are the same!") Otherwise such
"broken windows" will accumulate until it become a _real_ problem.
As we know from experience, to that time it's usually much harder to
fix anything than it is now.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small patch: fix code duplication in heapam.c

2016-03-24 Thread Aleksander Alekseev

> All that code is hotspot stuff, and turning it into a pile of nested
> procedures doesn't seem like it improves either performance or
> readability.

Your concern regarding performance is understandable. But I should note
that any standard compiler supports inlining these days (probably
this statement is true for at least a decade now). Here is assembly code
of patched version of heap_open produced by GCC 4.8.4 with -O2 flag:

(lldb) disassemble 
postgres`heap_open:
0x497db0 <+0>:   pushq  %rbx
0x497db1 <+1>:   callq  0x497af0  ; relation_open(...)
0x497db6 <+6>:   movq   %rax, %rbx
->  0x497db9 <+9>:   movq   0x30(%rax), %rax
0x497dbd <+13>:  movzbl 0x6f(%rax), %eax
0x497dc1 <+17>:  cmpb   $0x69, %al; 'i', RELKIND_INDEX
0x497dc3 <+19>:  je 0x497dce
0x497dc5 <+21>:  cmpb   $0x63, %al; 'c', COMPOSITE_TYPE
0x497dc7 <+23>:  je 0x497dd7
0x497dc9 <+25>:  movq   %rbx, %rax
0x497dcc <+28>:  popq   %rbx
0x497dcd <+29>:  retq   

As you see heap_open_check_relation procedure was successfully inlined.
Just to be sure that less smart compilers will more likely apply this
optimization I updated patch with 'inline' hints (see attachment).

And even if compiler decide not to apply inlining in this case there is
much more to consider than presence or absence of one 'call' assembly
instruction. For instance compiler may believe that on this concrete
architecture it will be more beneficial to make code shorter so it
would fit to CPU cache better.

Anyway I don't believe that imaginary benchmarks are worth trusting. I
personally don't have much faith in non-imaginary benchmarks either but
it's a different story.

In the same time I'm deeply convinced that this patch will make code
more readable at least because it makes code much shorter:

 src/backend/access/heap/heapam.c | 109 +++---
 1 file changed, 39 insertions(+), 70 deletions(-)

Thus we can see more code on the screen. Besides since there is no code
duplication there is less change that somebody someday will change say
heap_openrv without updating heap_openrv_extended accordingly. 

-- 
Best regards,
Aleksander Alekseev
http://eax.me/
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 34ba385..c55e6a7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1188,13 +1188,17 @@ try_relation_open(Oid relationId, LOCKMODE lockmode)
 }
 
 /* 
- *		relation_openrv - open any relation specified by a RangeVar
+ *		relation_openrv_extended - open any relation specified by a RangeVar
  *
- *		Same as relation_open, but the relation is specified by a RangeVar.
+ *		Same as relation_openrv, but with an additional missing_ok argument
+ *		allowing a NULL return rather than an error if the relation is not
+ *		found.  (Note that some other causes, such as permissions problems,
+ *		will still result in an ereport.)
  * 
  */
-Relation
-relation_openrv(const RangeVar *relation, LOCKMODE lockmode)
+inline Relation
+relation_openrv_extended(const RangeVar *relation, LOCKMODE lockmode,
+		 bool missing_ok)
 {
 	Oid			relOid;
 
@@ -1213,43 +1217,26 @@ relation_openrv(const RangeVar *relation, LOCKMODE lockmode)
 		AcceptInvalidationMessages();
 
 	/* Look up and lock the appropriate relation using namespace search */
-	relOid = RangeVarGetRelid(relation, lockmode, false);
+	relOid = RangeVarGetRelid(relation, lockmode, missing_ok);
+
+	/* Return NULL on not-found */
+	if (!OidIsValid(relOid))
+		return NULL;
 
 	/* Let relation_open do the rest */
 	return relation_open(relOid, NoLock);
 }
 
 /* 
- *		relation_openrv_extended - open any relation specified by a RangeVar
+ *		relation_openrv - open any relation specified by a RangeVar
  *
- *		Same as relation_openrv, but with an additional missing_ok argument
- *		allowing a NULL return rather than an error if the relation is not
- *		found.  (Note that some other causes, such as permissions problems,
- *		will still result in an ereport.)
+ *		Same as relation_open, but the relation is specified by a RangeVar.
  * 
  */
 Relation
-relation_openrv_extended(const RangeVar *relation, LOCKMODE lockmode,
-		 bool missing_ok)
+relation_openrv(const RangeVar *relation, LOCKMODE lockmode)
 {
-	Oid			relOid;
-
-	/*
-	 * Check for shared-cache-inval messages before trying to open the
-	 * relation.  See comments in relation_openrv().
-	 */
-	if (lockmode != NoLock)
-		AcceptInvalidationMessages();
-
-	/* Look up and lock the appropriate relation using namespace search */
-	relOid = RangeVarGetRelid(relation, lockmode, missing_ok);
-
-	/* Return NULL on not-found */
-	if (!OidIsValid(relOid))
-		return NULL;
-
-	/* Let relation_open do the rest */
-	return relation_open(relOid, NoLock);
+	return relation_openrv_extended(relation, lockmode, false

Re: [HACKERS] Small patch: Change calling convention for ShmemInitHash (and fix possible bug)

2016-03-25 Thread Aleksander Alekseev

> In short: the error in Aleksander's argument is the assumption that
> shared hashtables have fixed size.  That's simply false.

Well this is a bit embarrassing but I have to admit that you are right.
Dynahash code is a bit non-trivial to say the least (let me guess -
there is no point of suggesting a patch that splits it into two or
three separate implementations for each use case, right? :) and I
misunderstood how it actually works. My apologies.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: PROPOSAL: make PostgreSQL sanitizers-friendly (and prevent information disclosure)

2016-03-21 Thread Aleksander Alekseev

> Well, the documentation already says to avoid it:
> 
> http://www.postgresql.org/docs/current/static/xfunc-c.html
> 
>Another important point is to avoid leaving any uninitialized
>bits within data type values; for example, take care to zero out
>any alignment padding bytes that might be present in structs.
> 
> so I don't think what you're suggesting would be controversial
> at all; it looks like what you've done is found a(t least one)
> bug where the documented practice wasn't followed, and it's good
> to find any such places.

Well in this case here is a patch that fixes "use of uninitialized
value" reports by MemorySanitizer I managed to catch so far.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b48e97c..3e81865 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -328,7 +328,7 @@ gistXLogSplit(RelFileNode node, BlockNumber blkno, bool page_is_leaf,
 			  BlockNumber origrlink, GistNSN orignsn,
 			  Buffer leftchildbuf, bool markfollowright)
 {
-	gistxlogPageSplit xlrec;
+	gistxlogPageSplit xlrec = {0};
 	SplitedPageLayout *ptr;
 	int			npage = 0;
 	XLogRecPtr	recptr;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index f090ca5..7bab290 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -202,7 +202,7 @@ static void
 addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 		   SPPageDesc *current, SPPageDesc *parent, bool isNulls, bool isNew)
 {
-	spgxlogAddLeaf xlrec;
+	spgxlogAddLeaf xlrec = {0};
 
 	xlrec.newPage = isNew;
 	xlrec.storesNulls = isNulls;
@@ -400,7 +400,7 @@ moveLeafs(Relation index, SpGistState *state,
 	OffsetNumber *toDelete;
 	OffsetNumber *toInsert;
 	BlockNumber nblkno;
-	spgxlogMoveLeafs xlrec;
+	spgxlogMoveLeafs xlrec = {0};
 	char	   *leafdata,
 			   *leafptr;
 
@@ -701,7 +701,7 @@ doPickSplit(Relation index, SpGistState *state,
 	int			currentFreeSpace;
 	int			totalLeafSizes;
 	bool		allTheSame;
-	spgxlogPickSplit xlrec;
+	spgxlogPickSplit xlrec = {0};
 	char	   *leafdata,
 			   *leafptr;
 	SPPageDesc	saveCurrent;
@@ -1492,7 +1492,7 @@ spgAddNodeAction(Relation index, SpGistState *state,
  int nodeN, Datum nodeLabel)
 {
 	SpGistInnerTuple newInnerTuple;
-	spgxlogAddNode xlrec;
+	spgxlogAddNode xlrec = {0};
 
 	/* Should not be applied to nulls */
 	Assert(!SpGistPageStoresNulls(current->page));
@@ -1699,7 +1699,7 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 	BlockNumber postfixBlkno;
 	OffsetNumber postfixOffset;
 	int			i;
-	spgxlogSplitTuple xlrec;
+	spgxlogSplitTuple xlrec = {0};
 	Buffer		newBuffer = InvalidBuffer;
 
 	/* Should not be applied to nulls */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b119a47..50d3123 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4755,7 +4755,7 @@ XLOGShmemInit(void)
 void
 BootStrapXLOG(void)
 {
-	CheckPoint	checkPoint;
+	CheckPoint	checkPoint = {0};
 	char	   *buffer;
 	XLogPageHeader page;
 	XLogLongPageHeader longpage;
@@ -4802,11 +4802,9 @@ BootStrapXLOG(void)
 	checkPoint.ThisTimeLineID = ThisTimeLineID;
 	checkPoint.PrevTimeLineID = ThisTimeLineID;
 	checkPoint.fullPageWrites = fullPageWrites;
-	checkPoint.nextXidEpoch = 0;
 	checkPoint.nextXid = FirstNormalTransactionId;
 	checkPoint.nextOid = FirstBootstrapObjectId;
 	checkPoint.nextMulti = FirstMultiXactId;
-	checkPoint.nextMultiOffset = 0;
 	checkPoint.oldestXid = FirstNormalTransactionId;
 	checkPoint.oldestXidDB = TemplateDbOid;
 	checkPoint.oldestMulti = FirstMultiXactId;
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 1ff5728..e27a18f 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -667,7 +667,7 @@ destroy_tablespace_directories(Oid tablespaceoid, bool redo)
 	DIR		   *dirdesc;
 	struct dirent *de;
 	char	   *subfile;
-	struct stat st;
+	struct stat st = {0};
 
 	linkloc_with_version_dir = psprintf("pg_tblspc/%u/%s", tablespaceoid,
 		TABLESPACE_VERSION_DIRECTORY);
diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
index 28f9fb5..123f068 100644
--- a/src/backend/libpq/hba.c
+++ b/src/backend/libpq/hba.c
@@ -823,7 +823,7 @@ parse_hba_line(List *line, int line_num, char *raw_line)
 {
 	char	   *str;
 	struct addrinfo *gai_result;
-	struct addrinfo hints;
+	struct addrinfo hints = {0};
 	int			ret;
 	char	   *cidr_slash;
 	char	   *unsupauth;
@@ -1010,12 +1010,6 @@ parse_hba_line(List *line, int line_num, char *raw_line)
 			/* Get the IP address either way */
 			hints.ai_flags = AI_NUMERICHOST;
 			hints.ai_family = AF_UNSPEC;
-			hints.ai_socktype = 0;
-			hints.ai_protocol = 0;
-			hints.ai_addrlen = 0;
-			hints.ai_canonname = NULL;
-

[HACKERS] PROPOSAL: make PostgreSQL sanitizers-friendly (and prevent information disclosure)

2016-03-21 Thread Aleksander Alekseev

Hello

I was playing with CLang sanitizers[1][2][3][4] recently and discovered
something disturbing regarding how PostgreSQL works. Here is an example.

Lets create a breakpoint right before filling a CheckPoint structure:

(gdb) b xlog.c:4772
Breakpoint 1 at 0x7ffbad0556d4: file xlog.c, line 4772.
(gdb) c
Continuing.

Fill checkpoint with some random data:

(gdb) p memset(, 0xEA, sizeof(checkPoint))
$1 = 1110817376
(gdb) x/80xb 
0x7ffc4235ba60: 0xea 0xea 0xea 0xea 0xea 0xea 0xea 0xea
0x7ffc4235ba68: 0xea 0xea 0xea 0xea 0xea 0xea 0xea 0xea
0x7ffc4235ba70: 0xea 0xea 0xea 0xea 0xea 0xea 0xea 0xea
0x7ffc4235ba78: 0xea 0xea 0xea 0xea 0xea 0xea 0xea 0xea
0x7ffc4235ba80: 0xea 0xea 0xea 0xea 0xea 0xea 0xea 0xea
0x7ffc4235ba88: 0xea 0xea 0xea 0xea 0xea 0xea 0xea 0xea
0x7ffc4235ba90: 0xea 0xea 0xea 0xea 0xea 0xea 0xea 0xea
0x7ffc4235ba98: 0xea 0xea 0xea 0xea 0xea 0xea 0xea 0xea
0x7ffc4235baa0: 0xea 0xea 0xea 0xea 0xea 0xea 0xea 0xea
0x7ffc4235baa8: 0xea 0xea 0xea 0xea 0xea 0xea 0xea 0xea

Wait until checkPoint will be initialized:

4805 checkPoint.redo = XLogSegSize + SizeOfXLogLongPHD;
(gdb) 
4806 checkPoint.ThisTimeLineID = ThisTimeLineID;
(gdb) 
4807 checkPoint.PrevTimeLineID = ThisTimeLineID;
(gdb)
...

Lets see what's in memory:

(gdb) x/80xb 
0x7ffc4235ba60: 0x28 0x00 0x00 0x01 0x00 0x00 0x00 0x00
0x7ffc4235ba68: 0x01 0x00 0x00 0x00 0x01 0x00 0x00 0x00
0x7ffc4235ba70: 0x01 0xea 0xea 0xea 0x00 0x00 0x00 0x00
0x7ffc4235ba78: 0x03 0x00 0x00 0x00 0x10 0x27 0x00 0x00
0x7ffc4235ba80: 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x7ffc4235ba88: 0x03 0x00 0x00 0x00 0x01 0x00 0x00 0x00
0x7ffc4235ba90: 0x01 0x00 0x00 0x00 0x01 0x00 0x00 0x00
0x7ffc4235ba98: 0x3d 0x0a 0xec 0x56 0x00 0x00 0x00 0x00
0x7ffc4235baa0: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x7ffc4235baa8: 0x00 0x00 0x00 0x00 0xea 0xea 0xea 0xea

As you may see there are "holes" that were in fact not filled. Under
normal conditions they will be filled with data previously stored on
stack which could be anything including passwords and other private
data. Afterwards this structure is written to disk where potentially
someone who not supposed to see this data may see it.

I realize this is not a big problem in practice. But from security
nerd's point o few this is a pretty severe security issue. I think we
should fix this and all other cases where MemorySanitizer reports
"use-of-uninitialized-value" error.

Also I suggest we make PostgreSQL pass _all_ sanitizers checks. They
are able to find quite nasty bugs including but not limited to memory
leaks, array out of bound accesses, data races, etc. Imagine we could
find all these bugs during regression tests run on buildfarms.

What do you think?

[1] http://clang.llvm.org/docs/MemorySanitizer.html
[2] http://clang.llvm.org/docs/AddressSanitizer.html
[3] http://clang.llvm.org/docs/LeakSanitizer.html
[4] http://clang.llvm.org/docs/ThreadSanitizer.html

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: PROPOSAL: make PostgreSQL sanitizers-friendly (and prevent information disclosure)

2016-03-21 Thread Aleksander Alekseev

> I'm new here so someone more experienced would have to weigh in,
> but I would wonder a couple of things:
> 
> a. whether a braced struct assignment is supported in every
>C compiler that PostgreSQL still intends to support
> 
> b. whether such a struct assignment is guaranteed to initialize
>padding spaces as well as declared fields (in all supported
>C versions/compilers).
> 
> It's possible that memset() would be more convincing.

Frankly I'm not sure regarding all supported C versions/compilers. But
it seems to be a valid ANSI C. Here is a test program:

```
#include 

typedef struct {
  int i;
  char c;
  long l;
  short s;
} MyStruct;

int main()
{
  int i, sum = 0;
  char *c;
  MyStruct s = {0};

  s.i = 11;
  s.c = 22;
  s.l = 33;
  s.s = 44;

  c = (char*)
  for(i = 0; i < sizeof(s); i++) {
sum += *c;
c++;
  }

  printf("Sum: %d\n", sum);

  return 0;
}
```

I compiled it with various versions of GCC and CLang with different
optimization flags:

clang38 -O3 -ansi -g t.c -o t
gcc -O0 -ansi -g t.c -o t

In all cases running a program under debugger shows that structure is
properly initialized:

(gdb) b main
Breakpoint 1 at 0x4007ae: file t.c, line 12.
(gdb) r
Starting program: /usr/home/eax/temp/t 

Breakpoint 1, main () at t.c:12
12   int i, sum = 0;
(gdb) p memset(, 0xEA, sizeof(MyStruct))
$1 = -5376
(gdb) x/24xb 
0x7fffeb00: 0xea 0xea 0xea 0xea 0xea 0xea 0xea 0xea
0x7fffeb08: 0xea 0xea 0xea 0xea 0xea 0xea 0xea 0xea
0x7fffeb10: 0xea 0xea 0xea 0xea 0xea 0xea 0xea 0xea
(gdb) n
14   MyStruct s = {0};
(gdb) 
16   s.i = 11;
(gdb) x/24xb 
0x7fffeb00: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x7fffeb08: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x7fffeb10: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
(gdb) quit

Naturally we could use memset() as well. But I personally find it a bit
less readable. And in theory it doesn't prevent some _very_ "smart" C
compiler from not cleaning the whole structure anyway.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add something in struct PGPROC

2016-03-23 Thread Aleksander Alekseev

> Hi
> I just add something in struct PGPORC, such as "int level", But I
> cann't seet what I added in MyProc while debugging. That's confused, 
> 
> 
> zhangxiaojian

I think you forgot to run `make clean` after changing proc.h. When you
change header files dependent files are not recompiled automatically. 

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-23 Thread Aleksander Alekseev

> Thanks!  I can't think of anything else to worry about with regards to
> that version, so I have committed it.
> 

Thanks, Robert. And thanks everyone for contributing to this patch.


-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Small patch: fix code duplication in heapam.c

2016-03-24 Thread Aleksander Alekseev

Hello

I discovered that there is a lot of code duplication in heapam.c.
In particular relation_openrv and relation_openrv_extended procedures
and also heap_openrv and heap_openrv_extended procedures are almost the
same. Here is a patch that fixes this.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 34ba385..e9db6ad 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1188,13 +1188,17 @@ try_relation_open(Oid relationId, LOCKMODE lockmode)
 }
 
 /* 
- *		relation_openrv - open any relation specified by a RangeVar
+ *		relation_openrv_extended - open any relation specified by a RangeVar
  *
- *		Same as relation_open, but the relation is specified by a RangeVar.
+ *		Same as relation_openrv, but with an additional missing_ok argument
+ *		allowing a NULL return rather than an error if the relation is not
+ *		found.  (Note that some other causes, such as permissions problems,
+ *		will still result in an ereport.)
  * 
  */
 Relation
-relation_openrv(const RangeVar *relation, LOCKMODE lockmode)
+relation_openrv_extended(const RangeVar *relation, LOCKMODE lockmode,
+		 bool missing_ok)
 {
 	Oid			relOid;
 
@@ -1213,43 +1217,26 @@ relation_openrv(const RangeVar *relation, LOCKMODE lockmode)
 		AcceptInvalidationMessages();
 
 	/* Look up and lock the appropriate relation using namespace search */
-	relOid = RangeVarGetRelid(relation, lockmode, false);
+	relOid = RangeVarGetRelid(relation, lockmode, missing_ok);
+
+	/* Return NULL on not-found */
+	if (!OidIsValid(relOid))
+		return NULL;
 
 	/* Let relation_open do the rest */
 	return relation_open(relOid, NoLock);
 }
 
 /* 
- *		relation_openrv_extended - open any relation specified by a RangeVar
+ *		relation_openrv - open any relation specified by a RangeVar
  *
- *		Same as relation_openrv, but with an additional missing_ok argument
- *		allowing a NULL return rather than an error if the relation is not
- *		found.  (Note that some other causes, such as permissions problems,
- *		will still result in an ereport.)
+ *		Same as relation_open, but the relation is specified by a RangeVar.
  * 
  */
 Relation
-relation_openrv_extended(const RangeVar *relation, LOCKMODE lockmode,
-		 bool missing_ok)
+relation_openrv(const RangeVar *relation, LOCKMODE lockmode)
 {
-	Oid			relOid;
-
-	/*
-	 * Check for shared-cache-inval messages before trying to open the
-	 * relation.  See comments in relation_openrv().
-	 */
-	if (lockmode != NoLock)
-		AcceptInvalidationMessages();
-
-	/* Look up and lock the appropriate relation using namespace search */
-	relOid = RangeVarGetRelid(relation, lockmode, missing_ok);
-
-	/* Return NULL on not-found */
-	if (!OidIsValid(relOid))
-		return NULL;
-
-	/* Let relation_open do the rest */
-	return relation_open(relOid, NoLock);
+	return relation_openrv_extended(relation, lockmode, false);
 }
 
 /* 
@@ -1275,6 +1262,24 @@ relation_close(Relation relation, LOCKMODE lockmode)
 		UnlockRelationId(, lockmode);
 }
 
+/*
+ * Check Relation after opening. Internal procedure used by heap_open and
+ * heap_openrv_* procedures.
+ */
+static void
+heap_open_check_relation(Relation r)
+{
+	if (r->rd_rel->relkind == RELKIND_INDEX)
+		ereport(ERROR,
+(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("\"%s\" is an index",
+		RelationGetRelationName(r;
+	else if (r->rd_rel->relkind == RELKIND_COMPOSITE_TYPE)
+		ereport(ERROR,
+(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("\"%s\" is a composite type",
+		RelationGetRelationName(r;
+}
 
 /* 
  *		heap_open - open a heap relation by relation OID
@@ -1291,17 +1296,7 @@ heap_open(Oid relationId, LOCKMODE lockmode)
 	Relation	r;
 
 	r = relation_open(relationId, lockmode);
-
-	if (r->rd_rel->relkind == RELKIND_INDEX)
-		ereport(ERROR,
-(errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg("\"%s\" is an index",
-		RelationGetRelationName(r;
-	else if (r->rd_rel->relkind == RELKIND_COMPOSITE_TYPE)
-		ereport(ERROR,
-(errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg("\"%s\" is a composite type",
-		RelationGetRelationName(r;
+	heap_open_check_relation(r);
 
 	return r;
 }
@@ -1316,22 +1311,7 @@ heap_open(Oid relationId, LOCKMODE lockmode)
 Relation
 heap_openrv(const RangeVar *relation, LOCKMODE lockmode)
 {
-	Relation	r;
-
-	r = relation_openrv(relation, lockmode);
-
-	if (r->rd_rel->relkind == RELKIND_INDEX)
-		ereport(ERROR,
-(errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg("\"%s\" is an index",
-		RelationGetRelationName(r;
-	else if (r->rd_rel->relkind == RELKIND_COMPOSITE_TYPE)
-		ereport(ERROR,
-(errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg(&

Re: [HACKERS] pgbench small bug fix

2016-03-04 Thread Aleksander Alekseev

> Attached is a v3 which test integers more logically. I'm a lazy
> programmer who tends to minimize the number of key strokes.

Well. From what I can tell this patch is Ready for Committer. 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-03 Thread Aleksander Alekseev

> Won't it always use the same freelist to remove and add the entry from
> freelist as for both cases it will calculate the freelist_idx in same
> way?

No. If "our" freelist is empty when we try to remove an item from it we
borrow item from another freelist. Then this borrowed item will be
returned to "our" freelist instead of original. Without some sort of
additional logic there is no way to figure out what freelist was
original.

Generally speaking negative nentries value is not something that
couldn't be solved. But I would like to remind that in this context we
are discussing a quick and dirty solution created for benchmark
purposes in a first place.

> You are not convinced, then lets leave it to committer unless
> somebody else wants to try that suggestion.

Agree. Frankly I'm tired of rewriting this patch over and over and over
again. So I would like to avoid rewriting it once again unless there is
a clear indication that this way we would gain something. Benchmarks
shows that this is not a case thus it's only a matter of taste and
intuition. We know from experience that both are bad advisers in
optimization matter.

I suggest we merge this patch already:

http://www.postgresql.org/message-id/CA+Tgmobtf9nH566_jjs=jrtymq5hdqdarf5j7o+abdowqhe...@mail.gmail.com

... and get done with it. Clearly this patch is good enough ---
reviewed, tested, discussed, has proven performance extremum. When and
if dynahash will become a bottleneck again nothing will stop us from
optimizing it one more time for hardware we will have by then. Who knows
what this hardware will be like in 5-10 years? Don't we have better
things to do now than arguing about imaginary benchmarks and taste?

Lets see what is committer's opinion on this.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-03 Thread Aleksander Alekseev

Hello, Michael

> The easiest way to perform tests with this patch is to take a debugger
> and enforce the malloc'd pointers to NULL in the code paths.

I see. Still I don't think it's an excuse to not provide clear steps to
reproduce an issue. As I see it anyone should be able to easily check
your patch locally without having deep understanding of how libpq is
implemented or reading thread which contains 48 e-mails.

For instance you cold provide a step-by-step instruction like:

1. run gdb --args some-prog arg1 arg2 ...
2. b some_file.c:123
3. c
4. ...
5. we expect ... but actually see ...

Or you cold automate these steps using gdb batch file:

gdb --batch --command=gdb.script --args some-prog arg1 arg2

... where gdb.script is:

b some_file.c:123
c
... etc ...

Naturally all of this is optional. But it will simplify both
understanding and reviewing of this path. Thus chances that this patch
will be accepted will be increased and time required for this will be
reduced significantly.

Best regards,
Aleksander


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-03 Thread Aleksander Alekseev

Hello, Michael

I didn't checked your patch in detail yet but here is a thought I would
like to share.

In my experience usually it takes number of rewrites before patch will
be accepted. To make sure that after every rewrite your patch still
solves an issue you described you should probably provide a few test
programs too. If it's possible to make these programs part of regression
tests suite it would be just great.

Without such test programs I personally find it difficult to verify
that your patch fixes something. If such examples are provided your
patch would be much more likely to accepted. "See, this program crashes
everything. Now we apply a patch and everything works! Cool, heh?"

Best regards,
Aleksander


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench small bug fix

2016-03-03 Thread Aleksander Alekseev

> time pgbench -T 5 -R 0.1 -P 1 -c 2 -j 2

On my laptop this command executes 25 seconds instead of 5. I'm pretty
sure it IS a bug. Probably a minor one though.

I tested this patch on Ubuntu 14.04 LTS with GCC 4.8. It applies
cleanly on master branch (c7111d11) and solves a described problem.
No compilation warnings. Everything else works as before.

Still I wonder if code could be patched more cleanly. Instead of:

if(someint)
if(somebool)

... you should probably write:

if(someint > 0)
if(somebool == TRUE)

Also I suggest to introduce a few new boolean variables with meaningful
names instead of writing all these long expressions right inside of
if( ... ). 

As a side note I noticed that pgbench.c is not pgindent'ed. Since you
are modifying this file anyway probably you cold solve this issue too?
As a separate patch perhaps.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] PROPOSAL: Fast temporary tables

2016-03-01 Thread Aleksander Alekseev

Hello

There are applications that create and delete a lot of temporary
tables. Currently PostgreSQL doesn't handle such a use case well.
Consider the following benchmark/example.

postgresql.conf:

```
autovacuum = on
log_min_messages = debug2
```

temp-table.pgbench:

```
create temporary table tt1(x jsonb);
drop table tt1;
```

Benchmark:

# pgbench-server:

pgbench -h (ip) -f temp-table.pgbench -T 600 -P 1 -c 40 -j 12 test

# postgresql-server:

tail -f path/to/logfile | grep 'DEBUG: vacuuming'

At first everything is OK, PostgreSQL handles ~ 970 TPS. But after some
time this value starts to drop to 10-100 TPS, then return to normal. In
logfile we see:

```
DEBUG: vacuuming "pg_catalog.pg_class"
DEBUG: vacuuming "pg_catalog.pg_type"
DEBUG: vacuuming "pg_catalog.pg_index"
DEBUG: vacuuming "pg_catalog.pg_class"
DEBUG: vacuuming "pg_catalog.pg_type"
DEBUG: vacuuming "pg_catalog.pg_index"
DEBUG: vacuuming "pg_catalog.pg_depend"
DEBUG: vacuuming "pg_catalog.pg_type"
DEBUG: vacuuming "pg_catalog.pg_index"
DEBUG: vacuuming "pg_catalog.pg_class"
...

```

Catalog tables are bloating. But there was no real reason to write
anything to these tables in the first place since temporary tables
could be seen only from one session. Except to make possible for
autovacuum to find these tables.

I propose to solve described issue by introducing a new entity - fast
temporary tables (or maybe lightweight temporary tables? - name is
discussable):

create fast temporary table tt1(x jsonb);

Fast temporary tables work almost as usual temporary tables but they
are not present in the catalog. Information about tables is stored in
shared memory instead. This way we solve a bloating problem.

We should use *shared* memory so autovacuum could find these tables.
Note that state should be restored properly and acquired locks should
be released if one of backends terminates abnormally while accessing
shared memory.

Usually memory is considered an expensive resource. For these reason we
can't just change current behaviour of temporary tables. It would cause
a lot of problems for existing users. Also introducing a new type of
tables allows us to make some other changes. For instance, we could
drop trigger support for these tables if it would give us some sort of
benefit, e.g. better performance.

As I understand this feature is not too hard to implement. Basically
all usages (read and write) of catalog in context of temporary tables
should be found and carefully modified as described above. It reminds
me how interception of system API works. All procedures should receive
and return exact the same types of values as before, but implementation
should be changed a little bit.

Frankly so far I don't have a clear understanding which files exactly
would be modified, but I believe it would be at least:

* relcache.c
* genam.c  --- systable_* procedures in particular
* heapam.c --- I would like to avoid this, but as I understand \d will
  not see temporary tables otherwise

A few hints from people more experienced in this area would be
appreciated. Then we carefully check that everything works as expected
(indexes, autovacuum of temporary tables, etc), write regression tests
and we are done.

Here is what makes suggested approach in particular so interesting. I
think that using similar method in the future we could implement
writable temporary tables on replicas. This feature is very helpful in
OLAP tasks.

What do you think regarding described problem and proposed method of
solving it?

Best regards,
Aleksander


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-01 Thread Aleksander Alekseev

Hello, Amit

> I am not sure, if this is exactly what has been suggested by Robert,
> so it is not straightforward to see if his suggestion can allow us to
> use NUM_FREELISTS as 8 rather than 32.  I think instead of trying to
> use FREELISTBUFF, why not do it as Robert has suggested and try with
> different values of NUM_FREELISTS?

Well, what I did is in fact a bit more general solution then Robert
suggested. At first I just joined freeList and mutex arrays into one
array and tried different NUM_FREELISTS values (see FREELISTS column).
That was Robert's idea. Unfortunately it didn't give any considerable
speedup comparing with previous approach.

Then I tried to manually control sizeof every array item (see
different SIZE values). By making it larger we can put every array item
into a separate cache line. This approach helped a bit in terms of TPS
(before: +33.9%, after: +34.2%) but only if NUM_FREELISTS remains the
same (32).

So answering your question - it turned out that we _can't_ reduce
NUM_FREELISTS this way. 

Also I don't believe that +0.3% TPS according to synthetic benchmark
that doesn't represent any possible workload worth it considering
additional code complexity.

> In which case, do you think entries can go negative?  I think the
> nentries is incremented and decremented in the way as without patch,
> so I am not getting what can make it go negative.

In this context we are discussing a quick and dirty fix "what would
happen if FREELIST_IDX would be implemented as getpid() % NUM_FREE_LIST
instead of hash % NUM_FREELIST". The code is:

int freelist_idx = FREELIST_IDX(hctl, hashvalue); 

// ...

switch (action)
{

// ...

  case HASH_REMOVE:
if (currBucket != NULL)
{
  if (IS_PARTITIONED(hctl))
SpinLockAcquire(&(FREELIST(hctl, freelist_idx).mutex));

  // Assert(FREELIST(hctl, freelist_idx).nentries > 0); 
  FREELIST(hctl, freelist_idx).nentries--;

  /* remove record from hash bucket's chain. */
  *prevBucketPtr = currBucket->link;

// ...

No matter what freelist was used when element was added to a hash table.
We always try to return free memory to the same freelist number getpid()
% FREELIST_ITEMS and decrease number of elements in a hash table using
corresponding nentries field. For this reason nentries could be
negative. Still sum of all nentries remains valid.




I realize that this thread is quite long already and that its been a
while since previous commitfest ended. So I would like to provide a
short summery of this thread from my perspective:

1. Last version of a path is here:

http://www.postgresql.org/message-id/CA+Tgmobtf9nH566_jjs=jrtymq5hdqdarf5j7o+abdowqhe...@mail.gmail.com

Its reviewed, tested, well-commented and apparently nobody has strong
objections against it.

2. Robert suggested a few possible improvements. Experiments showed
that in best case resulting patches are as good as path from item 1,
but code become considerably more complex.

3. Number of further changes (changing calling convention for
ShmemInitHash(), changing LOG2_NUM_LOCK_PARTITIONS constant) will be
discussed separately when and if path from this thread will be applied.

> 
> 
> 
> >
> > > I am however wondering if it to set the freelist affinity based on
> > > something other than the hash value, like say the PID, so that the
> > > same process doesn't keep switching to a different freelist for
> > > every buffer eviction.
> >
> > Also I tried to use PID to determine freeList number:
> >
> > ```
> > #include 
> > #include 
> >
> > ...
> >
> > #define FREELIST_IDX(hctl, hashcode) \
> >   (IS_PARTITIONED(hctl) ? ((uint32)getpid()) % NUM_FREELISTS : 0)
> >
> >
> ...
> >
> > // now nentries could be negative in this case
> > // Assert(FREELIST(hctl, freelist_idx).nentries > 0);
> >
> >
> In which case, do you think entries can go negative?  I think the
> nentries is incremented and decremented in the way as without patch,
> so I am not getting what can make it go negative.
> 
> 
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Tiny patch: sigmask.diff

2016-04-04 Thread Aleksander Alekseev

Hello

sigmask macro is defined in win32.h like this:

```
#define sigmask(sig) ( 1 << ((sig)-1) )
```

And used in signal.c in this fashion:

```
for (i = 0; i < PG_SIGNAL_COUNT; i++)
{
if (exec_mask & sigmask(i))
{
```

Thus during first iteration we are doing `<< -1`. I think it's a bug.

Patch attached.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/
diff --git a/src/backend/port/win32/signal.c b/src/backend/port/win32/signal.c
index 36c6ebd..3724aa3 100644
--- a/src/backend/port/win32/signal.c
+++ b/src/backend/port/win32/signal.c
@@ -115,14 +115,14 @@ pgwin32_dispatch_queued_signals(void)
 
 		for (i = 0; i < PG_SIGNAL_COUNT; i++)
 		{
-			if (exec_mask & sigmask(i))
+			if (exec_mask & sigmask(i+1))
 			{
 /* Execute this signal */
 pqsigfunc	sig = pg_signal_array[i];
 
 if (sig == SIG_DFL)
 	sig = pg_signal_defaults[i];
-pg_signal_queue &= ~sigmask(i);
+pg_signal_queue &= ~sigmask(i+1);
 if (sig != SIG_ERR && sig != SIG_IGN && sig != SIG_DFL)
 {
 	LeaveCriticalSection(_signal_crit_sec);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Yet another small patch - reorderbuffer.c:1099

2016-04-05 Thread Aleksander Alekseev

> I recall discussing this code with Andres, and I think that he has
> mentioned me this is intentional, because should things be changed for
> a reason or another in the future, we want to keep in mind that a list
> of TXIDs and a list of sub-TXIDs should be handled differently.

I see. If this it true I think there should be a comment that explains
it. When you read such a code you suspect a bug. Not mentioning that
static code analyzers (I'm currently experimenting with Clang and PVS
Studio) complain about code like this.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-30 Thread Aleksander Alekseev

> I think this patch needs to be looked upon by committer now.  I have
> done review and added some code in this patch as well long back, just
> see the e-mail [1], patch is just same as it was in October 2015.  I
> think myself and Michael are in agreement that this patch solves the
> reported problem. There is one similar problem [2] reported by Heikki
> which I think can be fixed separately.
> [...]
> Let me know if you think anything more from myside can help in moving
> patch.

+1, same here. Lets see whats committer's opinion.


-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Access method extendability

2016-03-30 Thread Aleksander Alekseev

Hello

I did a brief review of bloom contrib and I don't think I like it much.
Here are some issues I believe should be fixed before committing it to
PostgreSQL.

1) Most of the code is not commented. Every procedure should at least
have a breif description of what it does, what arguments it receives
and what it returns. Same for structures and enums.

2) There are only 3 Asserts. For sure there are much more invariants in
this contrib.

3) I don't like this code (blinsert.c):

```
typedef struct
{
BloomState  blstate;
MemoryContext   tmpCtx;
chardata[BLCKSZ];
int64   count;
} BloomBuildState;

/* ... */

/*
 * (Re)initialize cached page in BloomBuildState.
 */
static void
initCachedPage(BloomBuildState *buildstate)
{
memset(buildstate->data, 0, BLCKSZ);
BloomInitPage(buildstate->data, 0); 
buildstate->count = 0;
}
```

It looks wrong because 1) blkstate and tmpCtx are left uninitialized 2)
as we discussed recently [1] you should avoid leaving "holes" with
uninitialized data in structures. Please fix this or provide a comment
that describes why things are done here the way they are done.

Perhaps there are also other places like this that I missed.

4) I don't think I like such a code either:

```
/* blutuls.c */

static BloomOptions *
makeDefaultBloomOptions(BloomOptions *opts)
{
int i;

if (!opts)
opts = palloc0(sizeof(BloomOptions));

/* ... */

/* see also blvacuum.c and other places I could miss */
``` 

Sometimes we create a new zero-initialized structure and sometimes we
use a provided one filled with some data. If I'll use this procedure
multiple times in a loop memory will leak. Thus sometimes we need
pfree() returned value manually and sometimes we don't. Such a code is
hard to reason about. You could do it much simpler choosing only one
thing to do --- either allocate a new structure or use a provided one.

5) Code is not pgindent'ed and has many trailing spaces.

[1] http://www.postgresql.org/message-id/56eff347.20...@anastigmatix.net

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Access method extendability

2016-03-30 Thread Aleksander Alekseev

> > http://www.postgresql.org/message-id/56eff347.20...@anastigmatix.net
> That discussion is about SQL-level types which could be stored on
> disk, not about in-memory structs

I must respectfully disagree. That discussion is also about memory
sanitizers and using them on buildfarms. Lets say you initialize a
structure like this:

st->f1 = 111;
st->f2 = 222;

... without using memset, so there could be a "hole" with uninitialized
data somewhere in between of f1 and f2.

Than some code calculates a hash of this structure or does memcpy - and
1) You get unreproducible behavior - hash is always different for the
same structure, thus it is stored in different hash buckets, etc, and as
a result you got bugs that sometimes reproduce and sometimes do not
2) There is one more place where sanitizers could report accesses to
uninitialized values and thus they still can't be used on buildfarms
where they could find a lot of serious bugs automatically. I believe
MemorySanitizer is smart enough to recognize trivial memcpy case, but
it could be confused in more complicated cases.

Anyway I suggest continue discussion of whether we should make
PostgreSQL sanitizers-friendly or not in a corresponding thread. So far
no one spoke against this idea. Thus I don't think that new patches
should complicate implementing it. Especially considering that it's
very simple to do and even is considered a good practice according to
PostgreSQL documentation.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small patch: --disable-setproctitle flag

2016-04-01 Thread Aleksander Alekseev

Hello, Andres

> Seems more appropriate to simply manually add a #undef
> HAVE_SETPROCTITLE to pg_config_manual.h in that case. Adding
> configure flags for ephemeral debugger issues seems like a high churn
> activity.

I think you are right. OK.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Access method extendability

2016-04-01 Thread Aleksander Alekseev

Hello, Alexander

> Hi!
> 
> New revision of patches is attached.

Code looks much better now, thanks. Still I believe it could be improved.

I don't think that using srand() / rand() in signValue procedure the
way you did is such a good idea. You create a side affect (changing
current randseed) which could cause problems in some cases. And there
is no real need for that. For instance you could use following formula
instead:

hash(attno || hashVal || j)

And a few more things.

> + memset(opaque, 0, sizeof(BloomPageOpaqueData));
> + opaque->maxoff = 0;

This looks a bit redundant.

> + for (my $i = 1; $i <= 10; $i++)

More idiomatic Perl would be `for my $i (1..10)`.

> + UnlockReleaseBuffer(buffer);
> + ReleaseBuffer(metaBuffer);
> + goto away;

In general I don't have anything against goto. But are you sure that
using it here is really justified?

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Two division by 0 errors in optimizer/plan/planner.c and optimizer/path/costsize.c

2016-03-28 Thread Aleksander Alekseev

Hello, Piotr.

Thanks for report. But I'm having some difficulties reproducing issues
you described.

I compiled PostgreSQL from master branch on FreeBSD 10.2 using this
command:

```
CC=/usr/local/bin/gcc49 CFLAGS="-O0 -g" \
  ./configure --enable-cassert --enable-debug \
  --prefix=/home/eax/postgresql-install \
  && gmake clean && gmake -j2 -s
```

Then I run reinit.sh:

```
#!/usr/bin/env bash

P=~/postgresql-install

pkill -9 postgres
make install

rm -rf $P/data
$P/bin/initdb -D $P/data

echo "max_prepared_transactions = 100" >> $P/data/postgresql.conf
echo "wal_level = hot_standby" >> $P/data/postgresql.conf
echo "wal_keep_segments = 128" >> $P/data/postgresql.conf
echo "max_connections = 10" >> $P/data/postgresql.conf
echo "listen_addresses = '*'" >> $P/data/postgresql.conf

echo '' > $P/data/logfile

echo "host all all 0.0.0.0/0 trust" >> $P/data/pg_hba.conf
echo "host replication all 0.0.0.0/0 trust" >> $P/data/pg_hba.conf
echo "local replication all trust" >> $P/data/pg_hba.conf

$P/bin/pg_ctl -w -D $P/data -l $P/data/logfile start
$P/bin/createdb `whoami`
$P/bin/psql -c "create table test(k int primary key, v text);"
```

..., connected to PostgreSQL using psql, in second terminal I attached
to the backend process using gdb710 and input `c`. Now in psql:

```
eax=# create table tt5(x int);
CREATE TABLE
eax=# create table b_star(x int);
CREATE TABLE
eax=# insert into b_star values (1), (2), (3);
INSERT 0 3
eax=# insert into tt5 values (2), (3), (4), (5);
INSERT 0 4
eax=# select 1
eax-# from public.tt5 as subq_0
eax-# where EXISTS (
eax(#select 1
eax(#from public.b_star as ref_0
eax(#where false
eax(# );
 ?column? 
--
(0 rows)

eax=# select 1
eax-# from unnest('{}'::boolean[]) a (x)
eax-# left join (
eax(#select *
eax(#from unnest('{}'::boolean[])
eax(#where false
eax(# ) b (x) on a.x = b.x;
 ?column? 
--
(0 rows)

```

Everything seems to work, no stacktraces in gdb.

Could you please provide more concrete steps to reproduce these
issues i.e, OS and compiler version, compilation flags (or package
version), cluster config, database schema, etc? These steps are required
at least to make sure that fixed code really fixes a problem. Also it
would be a good idea to include these steps to regression tests.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Access method extendability

2016-04-01 Thread Aleksander Alekseev

Hello

> Fixed.

Thanks. I don't have any other suggestions at the moment. Let see whats
committers opinion on this.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Yet another small patch - reorderbuffer.c:1099

2016-04-04 Thread Aleksander Alekseev

Hello

There is weired peace of code in reorderbuffer.c:

```
/* delete from list of known subxacts */
if (txn->is_known_as_subxact)
{
/* NB: nsubxacts count of parent will be too high now */
dlist_delete(>node);
}
/* delete from LSN ordered list of toplevel TXNs */
else 
{
dlist_delete(>node);
}
```

As you see `if` an `else` branches are exactly the same. I wonder
whether this is a bug or code just requires a bit of cleaning. In the
latter case here is a patch.

According to `git log` both branches were introduced in one commit
b89e1510. I added author and committer of this code to CC since they
have much better understanding of it than I do.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 9d78c8c..454b309 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1097,16 +1097,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/* delete from list of known subxacts */
-	if (txn->is_known_as_subxact)
-	{
-		/* NB: nsubxacts count of parent will be too high now */
-		dlist_delete(>node);
-	}
-	/* delete from LSN ordered list of toplevel TXNs */
-	else
-	{
-		dlist_delete(>node);
-	}
+	dlist_delete(>node);
 
 	/* now remove reference from buffer */
 	hash_search(rb->by_txn,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tiny patch: sigmask.diff

2016-04-04 Thread Aleksander Alekseev

> Surely this fix is completely wrong?  You'd have to touch every use of
> signum() to do it like that.  You'd also be introducing similarly-
> undefined behavior at the other end of the loop, where this coding
> would be asking to compute 1<<31, hence shifting into the sign bit,
> which is undefined unless you make the computation explicitly
> unsigned.

Oh, I didn't think about that...

> I think better is just to change the for-loop to iterate from 1 not 0.
> Signal 0 is invalid anyway, and is rejected in pg_queue_signal for
> example, so there can't be any event waiting there.

Agreed.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/
diff --git a/src/backend/port/win32/signal.c b/src/backend/port/win32/signal.c
index 36c6ebd..0ba2817 100644
--- a/src/backend/port/win32/signal.c
+++ b/src/backend/port/win32/signal.c
@@ -113,7 +113,7 @@ pgwin32_dispatch_queued_signals(void)
 		/* One or more unblocked signals queued for execution */
 		int			exec_mask = UNBLOCKED_SIGNAL_QUEUE();
 
-		for (i = 0; i < PG_SIGNAL_COUNT; i++)
+		for (i = 1; i < PG_SIGNAL_COUNT; i++)
 		{
 			if (exec_mask & sigmask(i))
 			{

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Small patch: --disable-setproctitle flag

2016-03-31 Thread Aleksander Alekseev

Hello

Recently I discovered that renaming processes using setproctitle() call
on BSD systems may sometimes cause problems. For instance there is
currently a bug in all versions of LLDB which makes it impossible to
debug a process that called setproctitle():

https://llvm.org/bugs/show_bug.cgi?id=26924#c3

Since LLVM stack is used by default in FreeBSD I believe it's quite a
severe problem. In case there is other software that doesn't handle
stproctitle() well and for users of LLDB <= 3.8 (most recent version)
I propose to add a --disable-setproctitle flag to configure script.
Corresponding patch is attached.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/
diff --git a/configure b/configure
index 24655dc..87e845e 100755
--- a/configure
+++ b/configure
@@ -819,6 +819,7 @@ with_wal_segsize
 with_CC
 enable_depend
 enable_cassert
+enable_setproctitle
 enable_thread_safety
 with_tcl
 with_tclconfig
@@ -1485,6 +1486,8 @@ Optional Features:
   --enable-tap-tests  enable TAP tests (requires Perl and IPC::Run)
   --enable-depend turn on automatic dependency tracking
   --enable-cassertenable assertion checks (for debugging)
+  --disable-setproctitle  do not use setproctitle() procedure for changing
+  process name
   --disable-thread-safety disable thread-safety in client libraries
   --disable-largefile omit support for large files
   --disable-float4-byval  disable float4 passed by value
@@ -5283,6 +5286,47 @@ done
 IFS=$ac_save_IFS
 
 
+#
+# setproctitle()
+#
+
+
+# Check whether --enable-setproctitle was given.
+if test "${enable_setproctitle+set}" = set; then :
+  enableval=$enable_setproctitle;
+  case $enableval in
+yes)
+  :
+  ;;
+no)
+  :
+  ;;
+*)
+  as_fn_error $? "no argument expected for --enable-setproctitle option" "$LINENO" 5
+  ;;
+  esac
+
+else
+  enable_setproctitle=yes
+
+fi
+
+
+# calling setproctitle() makes it impossible to connect to PostgreSQL backend
+# using LLDB <= 3.8, see https://llvm.org/bugs/show_bug.cgi?id=26924#c3
+if test "$enable_setproctitle" = yes; then
+  for ac_func in setproctitle
+do :
+  ac_fn_c_check_func "$LINENO" "setproctitle" "ac_cv_func_setproctitle"
+if test "x$ac_cv_func_setproctitle" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_SETPROCTITLE 1
+_ACEOF
+
+fi
+done
+
+fi
 
 #
 # Library directories
@@ -12425,7 +12468,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
+for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat pthread_is_threaded_np readlink setsid shm_open symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index c564a76..40daa32 100644
--- a/configure.in
+++ b/configure.in
@@ -580,6 +579,16 @@ done
 IFS=$ac_save_IFS
 AC_SUBST(INCLUDES)
 
+#
+# setproctitle()
+#
+PGAC_ARG_BOOL(enable, setproctitle, yes,
+  [do not use setproctitle() procedure for changing process name])
+# calling setproctitle() makes it impossible to connect to PostgreSQL backend
+# using LLDB <= 3.8, see https://llvm.org/bugs/show_bug.cgi?id=26924#c3
+if test "$enable_setproctitle" = yes; then
+  AC_CHECK_FUNCS([setproctitle])
+fi
 
 #
 # Library directories
@@ -1432,7 +1440,7 @@ PGAC_FUNC_WCSTOMBS_L
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-AC_CHECK_FUNCS([cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs wcstombs_l])
+AC_CHECK_FUNCS([cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat pthread_is_threaded_np readlink setsid shm_open symlink sync_file_range towlower utime utimes wcstombs wcstombs_l])
 
 AC_REPLACE_FUNCS(fseeko)
 case $host_os in

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parser extensions (maybe for 10?)

2016-04-22 Thread Aleksander Alekseev

> > And the above is called an ad-hominem.
> 
> An "ad hominem" attack means against the person rather than on the
> topic of the issue, but I don't think Aleksander did that.  I'm not
> sure why you think what he wrote was out of line.  It reads OK to me.

Frankly when I re-read my own e-mails sometimes I find them a little
bit dry to say the least. But it's not intentional. I'm just having some
difficulties expressing myself on my second language. I should probably
use more words like "please" and "thank you" to smooth this effect. My
sincere apologies to anyone who was offended in any way. 

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parser extensions (maybe for 10?)

2016-04-19 Thread Aleksander Alekseev

> On 2016-04-19 12:20:03 +0300, Aleksander Alekseev wrote:
> > Can we guarantee that extensions don't conflict? In fact we can
> > since we already do it. If all tests pass there is no conflict.
> 
> How does that follow? Even if you were to test all possible extensions
> together - obviously not possible - how do passing tests prove the
> grammar to be conflict free?

Do we currently test that all existing extensions work together? No.
And in fact it doesn't matter whether they work together or not. What
matters that concrete subset of extensions chosen by given user work
together. We don't guarantee that extensions are bug free either. In
fact I'm quite sure there are many bugs in PostgreSQL extensions and
PostgreSQL itself. But if `make check` pass probably extension doesn't
have more bugs than usual. Why syntax extension should suddenly be an
exception of these rules?

Also I would like to remind that suggested approach is only about
syntax sugar. The resulting desugared query would be the same as usual.
If it's invalid we just discard it.

For the record - I'm not telling that this SQL extending feature should
necessarily be implemented. Frankly I'm personally quite against it.
I can't think of any real cases when it would be very useful and I don't
think that this feature is worth an effort, not mentioning further
support. All I'm telling is that it could be done using methods that are
well-known for decades.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Should we remove unused code?

2016-04-18 Thread Aleksander Alekseev

Hello

Today I've read this post:

http://blog.2ndquadrant.com/code-coverage/

I think its great and that everyone in this mailing list should
familiarize themselves with it.

Reading the coverage report, I discovered that some code is never
executed. Sometimes its OK (error reporting in out-of-memory cases),
sometimes its not (not enough tests). But there are also cases I'm not
sure about.

For instance there are two flags - HASH_SEGMENT and HASH_FFACTOR that
are in fact never used (see attachment). I wonder whether we should
keep code like this or not. 

What do you think?

-- 
Best regards,
Aleksander Alekseev
http://eax.me/
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 1a9f70c..9b32ceb 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -452,16 +452,6 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
 		hctl->num_partitions = info->num_partitions;
 	}
 
-	if (flags & HASH_SEGMENT)
-	{
-		hctl->ssize = info->ssize;
-		hctl->sshift = my_log2(info->ssize);
-		/* ssize had better be a power of 2 */
-		Assert(hctl->ssize == (1L << hctl->sshift));
-	}
-	if (flags & HASH_FFACTOR)
-		hctl->ffactor = info->ffactor;
-
 	/*
 	 * SHM hash tables have fixed directory size passed by the caller.
 	 */
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 007ba2c..0646a72 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -81,9 +81,7 @@ typedef struct HASHCTL
 
 /* Flags to indicate which parameters are supplied */
 #define HASH_PARTITION	0x0001	/* Hashtable is used w/partitioned locking */
-#define HASH_SEGMENT	0x0002	/* Set segment size */
 #define HASH_DIRSIZE	0x0004	/* Set directory size (initial and max) */
-#define HASH_FFACTOR	0x0008	/* Set fill factor */
 #define HASH_ELEM		0x0010	/* Set keysize and entrysize */
 #define HASH_BLOBS		0x0020	/* Select support functions for binary keys */
 #define HASH_FUNCTION	0x0040	/* Set user defined hash function */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parser extensions (maybe for 10?)

2016-04-18 Thread Aleksander Alekseev

> I cannot to imagine extensible parser based on bison. But the parser
> can be replaced by custom parser.
> 
> Some like pgpool or pgbouncer does. The extension can assign own
> parser. Custom parser will be called first, and the integrated parser
> will be used from extension or as fallback. This can helps with new
> statements for background workers, theoretically it can helps with
> extending PostgreSQL SQL. Custom parser can do translation from SQL1
> to SQL2 dialect, or can do translation from SQL1 to internal calls.
> The custom parser usually should not implement full SQL - only few
> statements.
> 
> Is it this idea more workable?

What if there are two or more contribs that extend the parser? Can we
be sure that these contribs will not conflict?

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Wire protocol compression

2016-04-21 Thread Aleksander Alekseev

> I guess since the usual answer for compression was "use what SSL
> provides you for free", it's rather unlikely that someone bothered to
> make a proxy just for that purpose, and really, a proxy is just
> another moving part in your setup: not everyone will be thrilled to
> add that.

It just doesn't sound like a feature that should be implemented
separately for every single application that uses TCP. Granted TCP proxy
is not the most convenient way to solve a task. Maybe it could be
implemented in OpenVPN or on Linux TCP/IP stack level.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Wire protocol compression

2016-04-21 Thread Aleksander Alekseev

> > or on Linux TCP/IP stack level.
> >
> 
> Yes, but if you want to have both compression and encryption it is
> crucial to apply compression *before* encryption and I don't see how
> this can happen with this approach.

If I'm not mistaken IPSec gives you both compression and encryption.
Just as an example. 

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parser extensions (maybe for 10?)

2016-04-19 Thread Aleksander Alekseev

> no - it is not possible. Not with Bison parser - it cannot work with
> unknown syntax - so isn't possible implement one part by parser A, and
> second part by parser B.
> 
> But we can parsers P1 and P2. P1 knows string XX, P2 knows YY. Buildin
> parser (BP) knows SQL
> 
> We can have registered parsers P1, P2, BP.
> 
> for string SELECT
> 
> P1 fails,
> P2 fails,
> BP processes it
> 
> for string YY
> 
> P1 fails,
> P2 process it,
> BP is not called
> 
> But transformations can be allowed too (but it is slower)
> 
> for string 
> 
> P1 does transformation to YYY
> P2 does transformation to SELECT
> BP processes it

I look on this a little bit differently.

Current pipeline is approximately like this:

```
query string -> LEX -> [lexemes] -> SYNT -> QueryAST -> PLANNER
```

Or in Haskell-like notation:

```
lex :: String -> [Lexeme]
synt :: [Lexeme] -> AST
```

Its reasonably simple to extend a lexer. Lets say that AST type doesn't
change, i.e. extensions provide only syntax sugar. After desugaring
query transforms to old-good SELECT, UPDATE, procedures calls, etc. In
this case what extension does is actually:

```
type Parser = [Lexeme] -> AST
extendParser :: Parser -> Parser
```

Can we guarantee that extensions don't conflict? In fact we can since
we already do it. If all tests pass there is no conflict.

The only tricky part I see is that sometimes we want:

```
extendParser1 ( extendParser2 ( default ))
```

... and sometimes:

```
extendParser2 ( extendParser1 ( default ))
```

I don't think that order of extension will matter most of the time. But
we still should provide a mechanism to change this order. For instance,
contribs could provide a default priority of parser extension.
Extensions with higher priority are applied first. Also user can
resolve conflicts by manually overriding these priorities:

```
select pg_parser_extension_priorities();
select pg_override_parser_extension_priority('some_extension', 100500);
```

I think it should work.

Thought?

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Coverage report

2016-04-19 Thread Aleksander Alekseev

Hello, Alvaro.

> Some people have already heard about this.

Yes, we did :) Nice job!

> * Should we run something other than "make check-world"  As far as I
> know, that covers all or almost all the tests we have; are there
> things that we should have and are not running?  If so, how do we go
> about enabling them?

I'm personally quite concerned that currently everything that is in .h
files is considered a public API and thus should not be changed, or some
third-party extensions can break. Can we run `make check` on some (or
maybe all?) extensions from PGXN? This way we could guarantee that if
changing API breaks some extensions, this are extensions that actually
nobody uses or knows about.

Not mentioning that in this case we will check code coverage of these
extensions too.

> * buildfarm doesn't run make check-world for reasons of its own.
> Maybe we should run exactly what a typical buildfarm animal runs?
> That way, the coverage report would be closer to what we're actually
> verifying.

How about something like `make check; make -C contrib check` ?

> * Should we keep historical reports to study how numbers change in
> time? Maybe save one report a week or something like that?

Yes, I think we should. One report a day and a history for a week or two
should be enough.

> * Any other comments?

I noticed that output of genhtml looks like this:

```
Writing directory view page.
Overall coverage rate:
  lines..: 63.2% (179603 of 284239 lines)
  functions..: 69.5% (10550 of 15183 functions)
```

Maybe we should send an e-mail to pgsql-hackers@ automatically if these
numbers drops.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Should we remove unused code?

2016-04-18 Thread Aleksander Alekseev

Hello, Tom.

Thanks for your reply.

> The fact that make check-world doesn't run any code that uses some
> feature isn't a great argument for removing it.  Consider third-party
> extensions, for starters.

What is current policy regarding such sort of things? Is everything
that is in .h files considered a public API and should be backward
compatible? In this case - for how long, forever or until next major
release? Or lets say there is some API that is not used neither
internally nor by any package available on PGXN. Can we break API then?

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parser extensions (maybe for 10?)

2016-04-18 Thread Aleksander Alekseev

> It depends - can be allowed only one - like plpgsql extensions, or
> can be serialized like pg log extensions

OK, I understand "can be allowed only one". I doubt it would be a very
useful feature though.

I'm not sure what do you mean by "serialized like pg log extensions".
Lets say extension A allows "SELECT ASAP ..." queries and extension B
--- "... ORDER BY RANK". What happens when user executes "SELECT
ASAP ... ORDER BY RANK" query?

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parser extensions (maybe for 10?)

2016-04-19 Thread Aleksander Alekseev

> As Tom says, we can't easily break it down into multiple co-operating
> pieces, so lets forget that as unworkable.

I'm sorry but didn't I just demonstrate the opposite? If so it's very
easy to prove - give a counterexample. As I understand approach I
described handles cases named by Tom just fine. In fact the idea of
transforming ASTs (a.k.a metaprogramming) is successfully used by
programmers for about 50 years now.

(As a side note - I'm not a native English speaker but I believe such
type of logic is known as "argument from authority".)

> What is possible is a whole new grammar... for example if we imagine
> 
>  SET client_language_path = 'foo, postgresql'
> 
> Works similar to search_path, but not userset. We try to parse
> incoming statements against the foo parser first, if that fails we
> try postgresql. The default setting would be simply 'postgresql', so
> no match -> syntax error.
> 
> We could make that easier by making the postgresql parser a plugin
> itself. So to produce a new one you just copy the files, modify them
> as needed then insert a new record into pg_language as an extension.
> 

I think its not an extension but a replacement of a grammar. This
approach implies that every extension implements a parser from scratch.
Not sure if anyone will do it in practice to change SQL syntax a little
bit.

I'm not telling that such a feature will be completely worthless. But
why not to make a step further and not to implement plugable protocols?
E.g. make PostgreSQL compatible with MySQL and/or MongoDB? Or maybe
implement SQL dialect that forbids implicit type conversion. Or add
build-in connection pooling mechanism. I wonder though if all of this
could already be implemented as an extension without any changes in
PostgreSQL core. 

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Wire protocol compression

2016-04-21 Thread Aleksander Alekseev

> Does it make sense to you guys to discuss compression outside of TLS?
> There are potentially huge bandwidth savings which could benefit both
> WAN and non-WAN scenarios, and decoupling this problem from TLS would
> make it both accessible to everyone (assuming PostgreSQL clients
> follow). It would be a protocol change though.

I personally don't think it's something that should be implemented in
PostgreSQL core. As a third-party TCP-proxy (on both client and server
sides) with gzip/lz4 support perhaps. I'll be not surprised if it turns
out that such projects already exist.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [Patch] RBTree iteration interface improvement

2016-07-27 Thread Aleksander Alekseev

Hello

While working on some new feature for PostgreSQL (which is still in
development and is irrelevant in this context) I noticed that current
RBTree implementation interface is following:

```
extern void rb_begin_iterate(RBTree *rb, RBOrderControl ctrl);
extern RBNode *rb_iterate(RBTree *rb);
```

As you see it doesn't allow to do multiple iterations over a single
tree. I believe it's a major flaw for a general-purpose container.

Proposed patch introduces a new iteration interface that doesn't has
such a flaw. Naturally I wrote some unit tests, but I was not sure where
exactly to place them in PostgreSQL source code. For now I've just
uploaded them to GitHub:

https://github.com/afiskon/c-algorithms-examples/blob/master/rbtree_example.c

According to these tests new implementation works as fast as current
implementation and produces exactly same results.

I didn't dare to remove current interface since in theory some
extensions can use it. Still I believe such a move is worth considering.

-- 
Best regards,
Aleksander Alekseev
diff --git a/src/backend/access/gin/ginbulk.c b/src/backend/access/gin/ginbulk.c
index d6422ea..eac76c4 100644
--- a/src/backend/access/gin/ginbulk.c
+++ b/src/backend/access/gin/ginbulk.c
@@ -255,7 +255,7 @@ qsortCompareItemPointers(const void *a, const void *b)
 void
 ginBeginBAScan(BuildAccumulator *accum)
 {
-	rb_begin_iterate(accum->tree, LeftRightWalk);
+	rb_begin_left_right_walk(accum->tree, >tree_walk);
 }
 
 /*
@@ -271,7 +271,7 @@ ginGetBAEntry(BuildAccumulator *accum,
 	GinEntryAccumulator *entry;
 	ItemPointerData *list;
 
-	entry = (GinEntryAccumulator *) rb_iterate(accum->tree);
+	entry = (GinEntryAccumulator *) rb_left_right_walk(>tree_walk);
 
 	if (entry == NULL)
 		return NULL;			/* no more entries */
diff --git a/src/backend/lib/rbtree.c b/src/backend/lib/rbtree.c
index 4fa8a1d..cbe256f 100644
--- a/src/backend/lib/rbtree.c
+++ b/src/backend/lib/rbtree.c
@@ -871,3 +871,278 @@ rb_iterate(RBTree *rb)
 
 	return rb->iterate(rb);
 }
+
+/*
+ * Begin left right walk.
+ */
+void
+rb_begin_left_right_walk(RBTree *rb, RBTreeLeftRightWalk * lrw)
+{
+	lrw->rb = rb;
+	lrw->last_visited = NULL;
+	lrw->is_over = (lrw->rb->root == RBNIL);
+}
+
+/*
+ * Left right walk: get next node. Returns NULL if there is none.
+ */
+RBNode *
+rb_left_right_walk(RBTreeLeftRightWalk * lrw)
+{
+	if (lrw->is_over)
+		return NULL;
+
+	if (lrw->last_visited == NULL)
+	{
+		lrw->last_visited = lrw->rb->root;
+		while (lrw->last_visited->left != RBNIL)
+			lrw->last_visited = lrw->last_visited->left;
+
+		return lrw->last_visited;
+	}
+
+	if (lrw->last_visited->right != RBNIL)
+	{
+		lrw->last_visited = lrw->last_visited->right;
+		while (lrw->last_visited->left != RBNIL)
+			lrw->last_visited = lrw->last_visited->left;
+
+		return lrw->last_visited;
+	}
+
+	for (;;)
+	{
+		RBNode	   *came_from = lrw->last_visited;
+
+		lrw->last_visited = lrw->last_visited->parent;
+		if (lrw->last_visited == NULL)
+		{
+			lrw->is_over = true;
+			break;
+		}
+
+		if (lrw->last_visited->left == came_from)
+			break;/* came from left sub-tree, return current
+ * node */
+
+		/* else - came from right sub-tree, continue to move up */
+	}
+
+	return lrw->last_visited;
+}
+
+/*
+ * Begin right left walk.
+ */
+void
+rb_begin_right_left_walk(RBTree *rb, RBTreeRightLeftWalk * rlw)
+{
+	rlw->rb = rb;
+	rlw->last_visited = NULL;
+	rlw->is_over = (rlw->rb->root == RBNIL);
+}
+
+/*
+ * Right left walk: get next node. Returns NULL if there is none.
+ */
+RBNode *
+rb_right_left_walk(RBTreeRightLeftWalk * rlw)
+{
+	if (rlw->is_over)
+		return NULL;
+
+	if (rlw->last_visited == NULL)
+	{
+		rlw->last_visited = rlw->rb->root;
+		while (rlw->last_visited->right != RBNIL)
+			rlw->last_visited = rlw->last_visited->right;
+
+		return rlw->last_visited;
+	}
+
+	if (rlw->last_visited->left != RBNIL)
+	{
+		rlw->last_visited = rlw->last_visited->left;
+		while (rlw->last_visited->right != RBNIL)
+			rlw->last_visited = rlw->last_visited->right;
+
+		return rlw->last_visited;
+	}
+
+	for (;;)
+	{
+		RBNode	   *came_from = rlw->last_visited;
+
+		rlw->last_visited = rlw->last_visited->parent;
+		if (rlw->last_visited == NULL)
+		{
+			rlw->is_over = true;
+			break;
+		}
+
+		if (rlw->last_visited->right == came_from)
+			break;/* came from right sub-tree, return current
+ * node */
+
+		/* else - came from left sub-tree, continue to move up */
+	}
+
+	return rlw->last_visited;
+}
+
+/*
+ * Begin direct walk (a.k.a pre-order traversal).
+ *
+ * Pre-order traversal while duplicating nodes and edges can make a complete
+ * duplicate of a binary tree. It can also be used to make a prefix expression
+ * (Polish notation) from expression trees: traverse the expression tree

Re: [HACKERS] [Patch] RBTree iteration interface improvement

2016-07-27 Thread Aleksander Alekseev

And as always, I didn't re-check a patch before sending it :) Here is a
corrected version without any fprintf's.

-- 
Best regards,
Aleksander Alekseev
diff --git a/src/backend/access/gin/ginbulk.c b/src/backend/access/gin/ginbulk.c
index d6422ea..eac76c4 100644
--- a/src/backend/access/gin/ginbulk.c
+++ b/src/backend/access/gin/ginbulk.c
@@ -255,7 +255,7 @@ qsortCompareItemPointers(const void *a, const void *b)
 void
 ginBeginBAScan(BuildAccumulator *accum)
 {
-	rb_begin_iterate(accum->tree, LeftRightWalk);
+	rb_begin_left_right_walk(accum->tree, >tree_walk);
 }
 
 /*
@@ -271,7 +271,7 @@ ginGetBAEntry(BuildAccumulator *accum,
 	GinEntryAccumulator *entry;
 	ItemPointerData *list;
 
-	entry = (GinEntryAccumulator *) rb_iterate(accum->tree);
+	entry = (GinEntryAccumulator *) rb_left_right_walk(>tree_walk);
 
 	if (entry == NULL)
 		return NULL;			/* no more entries */
diff --git a/src/backend/lib/rbtree.c b/src/backend/lib/rbtree.c
index 4fa8a1d..eb645a2 100644
--- a/src/backend/lib/rbtree.c
+++ b/src/backend/lib/rbtree.c
@@ -871,3 +871,278 @@ rb_iterate(RBTree *rb)
 
 	return rb->iterate(rb);
 }
+
+/*
+ * Begin left right walk.
+ */
+void
+rb_begin_left_right_walk(RBTree *rb, RBTreeLeftRightWalk * lrw)
+{
+	lrw->rb = rb;
+	lrw->last_visited = NULL;
+	lrw->is_over = (lrw->rb->root == RBNIL);
+}
+
+/*
+ * Left right walk: get next node. Returns NULL if there is none.
+ */
+RBNode *
+rb_left_right_walk(RBTreeLeftRightWalk * lrw)
+{
+	if (lrw->is_over)
+		return NULL;
+
+	if (lrw->last_visited == NULL)
+	{
+		lrw->last_visited = lrw->rb->root;
+		while (lrw->last_visited->left != RBNIL)
+			lrw->last_visited = lrw->last_visited->left;
+
+		return lrw->last_visited;
+	}
+
+	if (lrw->last_visited->right != RBNIL)
+	{
+		lrw->last_visited = lrw->last_visited->right;
+		while (lrw->last_visited->left != RBNIL)
+			lrw->last_visited = lrw->last_visited->left;
+
+		return lrw->last_visited;
+	}
+
+	for (;;)
+	{
+		RBNode	   *came_from = lrw->last_visited;
+
+		lrw->last_visited = lrw->last_visited->parent;
+		if (lrw->last_visited == NULL)
+		{
+			lrw->is_over = true;
+			break;
+		}
+
+		if (lrw->last_visited->left == came_from)
+			break;/* came from left sub-tree, return current
+ * node */
+
+		/* else - came from right sub-tree, continue to move up */
+	}
+
+	return lrw->last_visited;
+}
+
+/*
+ * Begin right left walk.
+ */
+void
+rb_begin_right_left_walk(RBTree *rb, RBTreeRightLeftWalk * rlw)
+{
+	rlw->rb = rb;
+	rlw->last_visited = NULL;
+	rlw->is_over = (rlw->rb->root == RBNIL);
+}
+
+/*
+ * Right left walk: get next node. Returns NULL if there is none.
+ */
+RBNode *
+rb_right_left_walk(RBTreeRightLeftWalk * rlw)
+{
+	if (rlw->is_over)
+		return NULL;
+
+	if (rlw->last_visited == NULL)
+	{
+		rlw->last_visited = rlw->rb->root;
+		while (rlw->last_visited->right != RBNIL)
+			rlw->last_visited = rlw->last_visited->right;
+
+		return rlw->last_visited;
+	}
+
+	if (rlw->last_visited->left != RBNIL)
+	{
+		rlw->last_visited = rlw->last_visited->left;
+		while (rlw->last_visited->right != RBNIL)
+			rlw->last_visited = rlw->last_visited->right;
+
+		return rlw->last_visited;
+	}
+
+	for (;;)
+	{
+		RBNode	   *came_from = rlw->last_visited;
+
+		rlw->last_visited = rlw->last_visited->parent;
+		if (rlw->last_visited == NULL)
+		{
+			rlw->is_over = true;
+			break;
+		}
+
+		if (rlw->last_visited->right == came_from)
+			break;/* came from right sub-tree, return current
+ * node */
+
+		/* else - came from left sub-tree, continue to move up */
+	}
+
+	return rlw->last_visited;
+}
+
+/*
+ * Begin direct walk (a.k.a pre-order traversal).
+ *
+ * Pre-order traversal while duplicating nodes and edges can make a complete
+ * duplicate of a binary tree. It can also be used to make a prefix expression
+ * (Polish notation) from expression trees: traverse the expression tree
+ * pre-orderly.
+ *
+ * https://en.wikipedia.org/wiki/Tree_traversal#Applications
+ */
+void
+rb_begin_direct_walk(RBTree *rb, RBTreeDirectWalk * dw)
+{
+	dw->rb = rb;
+	dw->last_visited = NULL;
+	dw->is_over = (dw->rb->root == RBNIL);
+}
+
+/*
+ * Direct walk: get next node. Returns NULL if there is none.
+ */
+RBNode *
+rb_direct_walk(RBTreeDirectWalk * dw)
+{
+	if (dw->is_over)
+		return NULL;
+
+	if (dw->last_visited == NULL)
+	{
+		dw->last_visited = dw->rb->root;
+		return dw->last_visited;
+	}
+
+	if (dw->last_visited->left != RBNIL)
+	{
+		dw->last_visited = dw->last_visited->left;
+		return dw->last_visited;
+	}
+
+	do
+	{
+		if (dw->last_visited->right != RBNIL)
+		{
+			dw->last_visited = dw->last_visited->right;
+			break;
+		}
+
+		/* go up and one step right */
+		for (;;)
+		{
+			RBNode	   *came_from = dw->la

Re: [HACKERS] [Patch] RBTree iteration interface improvement

2016-07-28 Thread Aleksander Alekseev

> Can you explain use case where you need it?

Sure. You can consider RBTree as a container that always keeps its
elements in sorted order.  Now imagine you would like to write a code
like this:

```
/* iterate over items in sorted order */
while(item1 = left_right_walk(tree))
{

  /* another iteration, probably even in different procedure */
  while(item2 = left_right_walk(tree))
  {
/* ... some logic ... */
  }

}
```

Currently you can't do it.

Or maybe you have different objects, e.g. IndexScanDesc's, that should
iterate over some tree's independently somewhere in indexam.c
procedures. Exact order may depend on user's query so you don't even
control it.

-- 
Best regards,
Aleksander Alekseev


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)

2016-07-29 Thread Aleksander Alekseev

Hello

Some time ago we discussed an idea of "fast temporary tables":

https://www.postgresql.org/message-id/20160301182500.2c81c3dc%40fujitsu

In two words the idea is following.



PostgreSQL stores information about all relations in pg_catalog. Some
applications create and delete a lot of temporary tables. It causes a
bloating of pg_catalog and running auto vacuum on it. It's quite an
expensive operation which affects entire database performance.

We could introduce a new type of temporary tables. Information about
these tables is stored not in a catalog but in backend's memory. This
way user can solve a pg_catalog bloating problem and improve overall
database performance.



I took me a few months but eventually I made it work. Attached patch
has some flaws. I decided not to invest a lot of time in documenting
it or pgindent'ing all files yet. In my experience it will be rewritten
entirely 3 or 4 times before merging anyway :) But it _works_ and
passes all tests I could think of, including non-trivial cases like
index-only or bitmap scans of catalog tables.

Usage example:

```
CREATE FAST TEMP TABLE fasttab_test1(x int, s text);

INSERT INTO fasttab_test1 VALUES (1, 'aaa'), (2, 'bbb'), (3, 'ccc');

UPDATE fasttab_test1 SET s = 'ddd' WHERE x = 2;

DELETE FROM fasttab_test1 WHERE x = 3;

SELECT * FROM fasttab_test1 ORDER BY x;

DROP TABLE fasttab_test1;
```

More sophisticated examples could be find in regression tests:

./src/test/regress/sql/fast_temp.sql

Any feedback on this patch will be much appreciated!

-- 
Best regards,
Aleksander Alekseev
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 0689cc9..940210b 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1693,7 +1693,7 @@
   
   
p = permanent table, u = unlogged table,
-   t = temporary table
+   t = temporary table, f = fast temporary table
   
  
 
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index 1fa6de0..56de4dc 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/access/common
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = heaptuple.o indextuple.o printtup.o reloptions.o scankey.o \
+OBJS = fasttab.o heaptuple.o indextuple.o printtup.o reloptions.o scankey.o \
 	tupconvert.o tupdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/common/fasttab.c b/src/backend/access/common/fasttab.c
new file mode 100644
index 000..0a20247
--- /dev/null
+++ b/src/backend/access/common/fasttab.c
@@ -0,0 +1,1678 @@
+/* TODO TODO comment the general idea - in-memory tuples and indexes, hooks principle, FasttabSnapshots, etc */
+
+#include "c.h"
+#include "postgres.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "access/amapi.h"
+#include "access/fasttab.h"
+#include "access/relscan.h"
+#include "access/valid.h"
+#include "access/sysattr.h"
+#include "access/htup_details.h"
+#include "catalog/pg_class.h"
+#include "catalog/pg_type.h"
+#include "catalog/pg_depend.h"
+#include "catalog/pg_inherits.h"
+#include "catalog/pg_statistic.h"
+#include "storage/bufmgr.h"
+#include "utils/rel.h"
+#include "utils/inval.h"
+#include "utils/memutils.h"
+
+/*
+		  TYPEDEFS, MACRO DECLARATIONS AND CONST STATIC VARIABLES
+ */
+
+/* #define FASTTAB_DEBUG 1 */
+
+#ifdef FASTTAB_DEBUG
+static int32 fasttab_scan_tuples_counter = -1;
+#endif
+
+/* list of in-memory catalog tuples */
+typedef struct
+{
+	dlist_node	node;
+	HeapTuple	tup;
+}	DListHeapTupleData;
+
+typedef DListHeapTupleData *DListHeapTuple;
+
+#define FasttabSnapshotIsAnonymous(sn) ( !PointerIsValid((sn)->name) )
+#define FasttabSnapshotIsRoot(sn) ( !PointerIsValid((sn)->prev) )
+#define FasttabTransactionInProgress() \
+	( PointerIsValid(FasttabSnapshotGetCurrent()->prev))
+/* like strcmp but for integer types --- int, uint32, Oid, etc */
+#define FasttabCompareInts(x, y) ( (x) == (y) ? 0 : ( (x) > (y) ? 1 : -1 ))
+
+struct FasttabSnapshotData;		/* forward declaration required for
+ * relation_is_inmem_tuple_function typedef */
+typedef struct FasttabSnapshotData *FasttabSnapshot;
+
+typedef bool (*relation_is_inmem_tuple_function)
+			(Relation relation, HeapTuple tup, FasttabSnapshot fasttab_snapshot,
+		 int tableIdx);
+
+#define FasttabRelationMaxOidAttributes 2
+
+typedef const struct
+{
+	Oid			relationId;
+	relation_is_inmem_tuple_function is_inmem_tuple_fn;
+	AttrNumber	noidattr;
+	AttrNumber	attrNumbers[FasttabRelationMaxOidAttributes];
+}	FasttabRelat

Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)

2016-08-01 Thread Aleksander Alekseev

Hello.

Thanks everyone for great comments!

> Well, jokes aside, that's a pretty lousy excuse for not writing any 
> docs

I think maybe I put it in a wrong way. There are currently a lot of
comments in a code, more then enough to understand how this feature
works. What I meant is that this is not a final version of a patch and
a few paragraphs are yet to be written. At least it's how I see it. If
you believe that some parts of the code are currently hard to understand
and some comments could be improved, please name it and I will be happy
to fix it.

> IMHO if this patch gets in, we should use it as the only temp table 
> implementation (Or can you think of cases where keeping rows in
> pg_class has advantages?). That'd also eliminate the need for FAST
> keyword in the CREATE TABLE command.

> Probably has zero value to have slow and fast temp tables (from
> catalogue cost perspective). So the FAST implementation should be used
> everywhere.

If there are no objections I see no reason no to do it in a next
version of a patch.

> I'm getting failures in the regression suite

I've run regression suite like 10 times in a row in different
environments with different build flags but didn't manage to reproduce
it. Also our DBAs are testing this feature for weeks now on real-world
applications and they didn't report anything like this. Could you
please describe how to reproduce this issue?

> This should to work on slaves - it is one of ToDo

Glad you noticed! In fact I'm currently researching a possibility of
using the same approach for creating writable temporary tables on
replicas.

> The key reason why I don't think that's negotiable is that if there
> aren't (apparently) catalog entries corresponding to the temp tables,
> that will almost certainly break many things in the backend and
> third-party extensions, not only user code patterns like this one.

> In short, I think that the way to make something like this work is to
> figure out how to have "virtual" catalog rows describing a temp table.

I'm afraid once again I put it in a wrong way. What I meant by
"Information about these tables is stored not in a catalog but in
backend's memory" is in fact that _records_ of pg_class, pg_type and
other catalog relations are stored in-memory. Naturally this records
are visible to the user (otherwise \d or \d+ would not work) and you
can do queries like ` select * from pg_class where relname = 'tt1' `.
In other words part of the catalog is indeed "virtual".

> I didn't see support for memory store for column's statistics. Some
> separate questions is about production statistics -
> pg_stat_user_table, ..

> That seems to work (both analyze and pg_stat_user_tables). Not sure 
> where it's in the code, and I'm not willing to reverse engineer it.

Right, `ANALYZE temp_table;` and everything else works. Besides
pg_class, pg_type, pg_attribute and other relations pg_statistic
records for temp tables are stored in-memory as well. IIRC a lot of
pg_stat* relations are in fact views and thus don't require any special
support. If you see that some statistics are broken please don't
hesitate to report it and I will fix it.

Hope I answered all questions so far. I look forward to receive more
comments and questions regarding this patch!

-- 
Best regards,
Aleksander Alekseev


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)

2016-08-04 Thread Aleksander Alekseev

Thanks everyone for your remarks and comments!

Here is an improved version of a patch.

Main changes:
* Patch passes `make installcheck`
* Code is fully commented, also no more TODO's

I wish I sent this version of a patch last time. Now I realize it was
really hard to read and understand. Hope I managed to correct this
flaw. If you believe that some parts of the code are still poorly
commented or could be improved in any other way please let me know.

And as usual, any other comments, remarks or questions are highly
appreciated!

-- 
Best regards,
Aleksander Alekseev
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 8f5332a..ac272e1 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1693,7 +1693,7 @@
   
   
p = permanent table, u = unlogged table,
-   t = temporary table
+   t = temporary table, f = fast temporary table
   
  
 
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index 1fa6de0..56de4dc 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/access/common
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = heaptuple.o indextuple.o printtup.o reloptions.o scankey.o \
+OBJS = fasttab.o heaptuple.o indextuple.o printtup.o reloptions.o scankey.o \
 	tupconvert.o tupdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/common/fasttab.c b/src/backend/access/common/fasttab.c
new file mode 100644
index 000..610790d
--- /dev/null
+++ b/src/backend/access/common/fasttab.c
@@ -0,0 +1,1884 @@
+/*-
+ *
+ * fasttab.c
+ *	  virtual catalog and fast temporary tables
+ *
+ * This file contents imlementation of special type of temporary tables ---
+ * fast temporary tables (FTT). From user perspective they work exactly as
+ * regular temporary tables. However there are no records about FTTs in
+ * pg_catalog. These records are stored in backend's memory instead and mixed
+ * with regular records during scans of catalog tables. We refer to
+ * corresponding tuples of catalog tables as "in-memory" or "virtual" tuples
+ * and to all these tuples together --- as "in-memory" or "virtual" catalog.
+ *
+ * Note that since temporary tables are visiable only in one session there is
+ * no need to use shared memory or locks for FTTs. Transactions support is
+ * very simple too. There is no need to track xmin/xmax, etc.
+ *
+ * FTTs are designed to to solve pg_catalog bloating problem. The are
+ * applications that create and delete a lot of temporary tables. It causes
+ * bloating of pg_catalog and running auto vacuum on it. It's quite an
+ * expensive operation that affects entire database performance.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/fasttab.c
+ *
+ *-
+ */
+
+#include "c.h"
+#include "postgres.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "access/amapi.h"
+#include "access/fasttab.h"
+#include "access/relscan.h"
+#include "access/valid.h"
+#include "access/sysattr.h"
+#include "access/htup_details.h"
+#include "catalog/pg_class.h"
+#include "catalog/pg_type.h"
+#include "catalog/pg_depend.h"
+#include "catalog/pg_inherits.h"
+#include "catalog/pg_statistic.h"
+#include "storage/bufmgr.h"
+#include "utils/rel.h"
+#include "utils/inval.h"
+#include "utils/memutils.h"
+
+/*
+		  TYPEDEFS, MACRO DECLARATIONS AND CONST STATIC VARIABLES
+ */
+
+/* #define FASTTAB_DEBUG 1 */
+
+#ifdef FASTTAB_DEBUG
+static int32 fasttab_scan_tuples_counter = -1;
+#endif
+
+/* List of in-memory tuples. */
+typedef struct
+{
+	dlist_node	node;
+	HeapTuple	tup;
+}	DListHeapTupleData;
+
+typedef DListHeapTupleData *DListHeapTuple;
+
+/* Like strcmp but for integer types --- int, uint32, Oid, etc. */
+#define FasttabCompareInts(x, y) ( (x) == (y) ? 0 : ( (x) > (y) ? 1 : -1 ))
+
+/* Forward declaration is required for relation_is_inmem_tuple_function */
+struct FasttabSnapshotData;
+typedef struct FasttabSnapshotData *FasttabSnapshot;
+
+/* Predicate that determines whether given tuple should be stored in-memory */
+typedef bool (*relation_is_inmem_tuple_function)
+			(Relation relation, HeapTuple tup, FasttabSnapshot fasttab_snapshot,
+		 int tableIdx);
+
+/* Capacity of FasttabRelati

1 2 3 >

1 - 100 of 229 matches

Mail list logo