date:20170106

On Sat, Jan 7, 2017 at 11:27 AM, Mithun Cy  wrote:
> On Thu, Jan 5, 2017 at 6:15 PM, Amit Kapila  wrote:
>>
>> Your test and results look good, what kind of m/c you have used to
>> test this.  Let me see if I or one of my colleague can do this and
>> similar test on some high-end m/c.
>
> As discussed with Amit, I have tried to run the same tests with below
> modification,
> 1. Increased the total rows to 10milion.
> 2. Set fsync off;
> 3. Changed tests as below. Updated all rows at a time.
>
> VACUUM FULL;
> BEGIN;
> UPDATE testtab SET col2 = md5(random()::text);
> ROLLBACK;
>
> I have run these tests on IBM power2 which have sufficient ram. I have
> set shared_buffer=32GB.
>
> My results show after this patch there is a slight increase in
> response time (psql \timing was used) for the above update statement.
> Which is around 5 to 10% delay.
>

I would like to add here, that the intention of the test was to stress
the changes of the patch to see the overhead patch can incur.  Now,
surely this is a synthetic test prepared to test this patch, but I
think it indicates that the changes have some overhead which might or
might not be ignorable depending on how important is to get this
patch.  I think if Warm tuples or indirect indexes need this patch and
they can't do without this, then it is worth considering this patch
along with those patches.  OTOH, if we can reduce the overhead, then
it might be okay proceed with this patch on the basis of simplicity it
can bring.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] rewrite HeapSatisfiesHOTAndKey

2017-01-06 Thread Mithun Cy

On Sat, Jan 7, 2017 at 11:27 AM, Mithun Cy  wrote:
Sorry Auto plain text setting has disturbed the table indentation.
Attaching the spreadsheet for same.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


HeapSatisfiesHOTAndKey_perf_results.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] rewrite HeapSatisfiesHOTAndKey

2017-01-06 Thread Mithun Cy

On Thu, Jan 5, 2017 at 6:15 PM, Amit Kapila  wrote:
>
> Your test and results look good, what kind of m/c you have used to
> test this.  Let me see if I or one of my colleague can do this and
> similar test on some high-end m/c.

As discussed with Amit, I have tried to run the same tests with below
modification,
1. Increased the total rows to 10milion.
2. Set fsync off;
3. Changed tests as below. Updated all rows at a time.

VACUUM FULL;
BEGIN;
UPDATE testtab SET col2 = md5(random()::text);
ROLLBACK;

I have run these tests on IBM power2 which have sufficient ram. I have
set shared_buffer=32GB.

My results show after this patch there is a slight increase in
response time (psql \timing was used) for the above update statement.
Which is around 5 to 10% delay.

Runs Response time in ms for update base. Response
Time in ms for update new patch.  %INC

1 158863.501
  167443.767
  5.4010304104

2 151061.793
  168908.536
 11.8142004312

3 153750.577
  164071.994
 6.7130915548

4 153639.165
  168364.225
 9.5841838245

5 149607.139
  166498.44
  11.2904378179

Under the same condition running original tests, that is, updating
rows which satisfy a condition col1 = :value1. I did not see any
regression.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block level parallel vacuum WIP

On Fri, Jan 6, 2017 at 11:08 PM, Masahiko Sawada  wrote:
> On Mon, Oct 3, 2016 at 11:00 AM, Michael Paquier
>  wrote:
>> On Fri, Sep 16, 2016 at 6:56 PM, Masahiko Sawada  
>> wrote:
>>> Yeah, I don't have a good solution for this problem so far.
>>> We might need to improve group locking mechanism for the updating
>>> operation or came up with another approach to resolve this problem.
>>> For example, one possible idea is that the launcher process allocates
>>> vm and fsm enough in advance in order to avoid extending fork relation
>>> by parallel workers, but it's not resolve fundamental problem.
>>
>
> I got some advices at PGConf.ASIA 2016 and started to work on this again.
>
> The most big problem so far is the group locking. As I mentioned
> before, parallel vacuum worker could try to extend the same visibility
> map page at the same time. So we need to make group locking conflict
> in some cases, or need to eliminate the necessity of acquiring
> extension lock. Attached 000 patch uses former idea, which makes the
> group locking conflict between parallel workers when parallel worker
> tries to acquire extension lock on same page.
>

How are planning to ensure the same in deadlock detector?  Currently,
deadlock detector considers members from same lock group as
non-blocking.  If you think we don't need to make any changes in
deadlock detector, then explain why so?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] _hash_addovflpage has a bug

On Sat, Jan 7, 2017 at 2:33 AM, Robert Haas  wrote:
> It looks to to me like the recent hash index changes have left
> _hash_addovflpage slightly broken.  I think that if that function
> reaches the point where it calls _hash_getbuf() to fetch the next page
> in the bucket chain, we also need to clear retain_pin.  Otherwise,
> we'll erroneously think that we're supposed to retain a pin even when
> the current page is an overflow page rather than the primary bucket
> page, which is wrong.
>

How?  I think we are ensuring that we retain pin only if it is a
bucket page.  The check is as below:
if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)

> A larger question, I suppose, is why this function wants to be certain
> of adding a new page even if someone else has already done so.  It
> seems like it might be smarter for it to just return the newly added
> page to the caller and let the caller try to use it.  _hash_doinsert
> has an Assert() that the tuple fits on the returned page, but that
> could be deleted.  As it stands, if two backends try to insert a tuple
> into the same full page at the same time, both of them will add an
> overflow page and we'll end up with 2 overflow pages each containing 1
> tuple.
>

I think it is because in the current algorithm we first get an
overflow page and then add it to the end.  Now we can change it such
that first, we acquire the lock on the tail page, check if we still
need an overflow page, if so, then proceed, else return the already
added overflow page.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Rename pg_switch_xlog to pg_switch_wal

2017-01-06 Thread Bruce Momjian

On Wed, Jan  4, 2017 at 09:38:42AM -0500, Stephen Frost wrote:
> > I think we've been far to cavalier lately about unnecessarily breaking
> > admin and monitoring tools. There's been pg_stat_activity backward
> > incompat changes in most of the last releases. It's a *PAIN* to develop
> > monitoring / admin tools that work with a number of releases.  It's fine
> > to cause that pain if there's some considerable benefit (e.g. not
> > triggering data loss seems like a case for that, as imo is unifying
> > configuration), but I don't see how that justifying breaking things
> > gratuitously.  Just renaming well known functions for a minor bit of
> > cleanliness seems not to survive a cost/benefit analysis.
> 
> I agree that we have been breaking monitoring tools on the regular for a
> while as we continue to add capabilities and improve things, which is
> part of the reason that I don't see this particular break as a very big
> deal- serious monitoring tools are going to need to be updated anyway.
> 
> I don't see that changing any time soon either- we are woefully far
> behind in this area compared to other databases and we should be trying
> to encourage people to continue improving these areas, not making things
> unnecessairly difficult by requiring backwards compatibility hacks.

I agree.  I know these changes are painful, and I will take
responsibility for perhaps the most cavalier breakage of
pg_stat_statements in renaming column procpid to pid.  (I still stand
behind that change.)

As painful as these breakages are *at* the time of the breakage,
long-term, it keeps our user API clean.  Frankly, because changes are
reviewed by so many people, we don't normally make dumb API mistakes
that require changes --- rather new features or complex interactions
require user API changes.  We often feel we have too many users to make
the change, but every few years we double our user base and all those
new users see a nice clean API that previous users had to adjust to.

If we don't make changes like this, our API becomes nonintuitive very
quickly, and let's face it, we expose a lot of internal interfaces to
our users, so clarity is extra important.

I don't think anyone is arguing that these API breakages are cost-free,
but I think long-term, the costs are minor compared to the clean API we
provide to users.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add support to COMMENT ON CURRENT DATABASE

On 1/3/17 11:52 PM, Ashutosh Bapat wrote:
> We will need to make CURRENT_DATABASE a reserved keyword. But I like
> this idea more than COMMENT ON CURRENT DATABASE.

We already have the reserved key word CURRENT_CATALOG, which is the
standard spelling.  But I wouldn't be bothered if we made
CURRENT_DATABASE somewhat reserved as well.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication/backup defaults

On 1/5/17 12:01 PM, Andres Freund wrote:
> On 2017-01-05 08:38:32 -0500, Peter Eisentraut wrote:
>> I also suggest making the defaults for both 20 instead of 10.  That
>> leaves enough room that almost nobody ever has to change them, whereas
>> 10 can be a bit tight for some not-outrageous installations (8 standbys
>> plus backup?).
> 
> I'm afraid we need to add initdb integration / testing for those. I mean
> we have initdb test down to 10 connections to deal with limited
> resources...

Those initdb defaults were last touched in 2005, before the use of
System V shared memory was reduced to a minimum.  It might be worth
revisiting that.  The only way to end up with a low number of connection
slots would seem to be a very low semaphore configuration.

In the build farm, I have found 6 critters that do not end up with the
100/128MB setting: sidewinder, curculio, coypu, brolga, lorikeet,
opossum.  I wonder what limitations initdb is bumping against.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication/backup defaults

On 1/5/17 4:56 PM, Michael Banck wrote:
>> You can't actually change the other two without changing wal_level.
> That actually goes both ways: I recently saw a server not start cause we
> were experimenting with temporarily setting wal_level to minimal for
> initial bulk loading, but did not reduce max_wal_senders back to zero.
> So it failed at startup with 'FATAL:  WAL streaming (max_wal_senders >
> 0) requires wal_level "replica" or "logical"'.

I think that was the point:  You can't change the default of
max_wal_senders without also changing the default of wal_level.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] merging some features from plpgsql2 project

2017-01-06 Thread Joel Jacobson

On Thu, Jan 5, 2017 at 7:03 AM, Robert Haas  wrote:
>
> I think it would be a good idea to lock all the people who really care
> about PL/pgsql in a room until they agree on what changes should be
> made for the next version of the language.  If they don't agree
> quickly enough, we can resort to the techniques described in
> https://en.wikipedia.org/wiki/Papal_election,_1268%E2%80%9371

I think that's a very good idea, and I'm happy to be locked into such a room.

I think such a discussion will be very fruitful,
given the others in the room have also
already decided they want a new language
and are there to discuss "the next version of the language",
instead of debating why they don't think we need a new language.

It would also be good if those people could bring laptops
with all their plpgsql code bases, to check if any of
the proposed possibly non-backwards compatible
syntax proposals would break nothing, just a few functions,
or a lot of functions in their code bases.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for pg_receivexlog --format=plain|tar

On Fri, Jan 6, 2017 at 11:07 PM, Magnus Hagander  wrote:
> A few further notes:

Thanks for the review.

> You are using the filemode to gzopen and the mode_compression variable to
> set the compression level. The pre-existing code in pg_basebackup uses
> gzsetparams(). Is there a particular reason you didn't do it the same way?
>
> Small comment:
> -   if (pad_to_size)
> +   if (pad_to_size && dir_data->compression == 0)
> {
> /* Always pre-pad on regular files */
>
>
> That "always" is not true anymore. Commit-time cleanup can be done of that.
>
> The rest of this looks good to me, but please comment on the gzopen part
> before we proceed to commit :)

Yes using gzsetparams() looks cleaner. I actually thought about using
the same logic as pg_dump. Attached is an updated patch.

There is something I forgot. With this patch,
FindStreamingStart()@pg_receivexlog.c is actually broken. In short it
forgets to consider files that have been compressed at the last run of
pg_receivexlog and will try to stream changes from the beginning. I
can see that gzip -l provides this information... But I have yet to
find something in zlib that allows a cheap lookup as startup of
streaming should be fast. Looking at how gzip -l does it may be faster
than looking at the docs.
-- 
Michael
diff --git a/doc/src/sgml/ref/pg_receivexlog.sgml b/doc/src/sgml/ref/pg_receivexlog.sgml
index bfa055b58b..8c1ea9a2e2 100644
--- a/doc/src/sgml/ref/pg_receivexlog.sgml
+++ b/doc/src/sgml/ref/pg_receivexlog.sgml
@@ -180,6 +180,19 @@ PostgreSQL documentation

   
  
+
+ 
+  -Z level
+  --compress=level
+  
+   
+Enables gzip compression of transaction logs, and specifies the
+compression level (0 through 9, 0 being no compression and 9 being best
+compression).  The suffix .gz will
+automatically be added to all filenames.
+   
+  
+ 
 
 

diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 3f83d87e50..87ab1bb92f 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -472,7 +472,7 @@ LogStreamerMain(logstreamer_param *param)
 	stream.partial_suffix = NULL;
 
 	if (format == 'p')
-		stream.walmethod = CreateWalDirectoryMethod(param->xlog, do_sync);
+		stream.walmethod = CreateWalDirectoryMethod(param->xlog, 0, do_sync);
 	else
 		stream.walmethod = CreateWalTarMethod(param->xlog, compresslevel, do_sync);
 
diff --git a/src/bin/pg_basebackup/pg_receivexlog.c b/src/bin/pg_basebackup/pg_receivexlog.c
index b6f57a878c..9d18f22c59 100644
--- a/src/bin/pg_basebackup/pg_receivexlog.c
+++ b/src/bin/pg_basebackup/pg_receivexlog.c
@@ -34,6 +34,7 @@
 /* Global options */
 static char *basedir = NULL;
 static int	verbose = 0;
+static int	compresslevel = 0;
 static int	noloop = 0;
 static int	standby_message_timeout = 10 * 1000;		/* 10 sec = default */
 static volatile bool time_to_abort = false;
@@ -75,6 +76,7 @@ usage(void)
 	printf(_("  --synchronous  flush transaction log immediately after writing\n"));
 	printf(_("  -v, --verbose  output verbose messages\n"));
 	printf(_("  -V, --version  output version information, then exit\n"));
+	printf(_("  -Z, --compress=0-9 compress logs with given compression level\n"));
 	printf(_("  -?, --help show this help, then exit\n"));
 	printf(_("\nConnection options:\n"));
 	printf(_("  -d, --dbname=CONNSTR   connection string\n"));
@@ -338,7 +340,8 @@ StreamLog(void)
 	stream.synchronous = synchronous;
 	stream.do_sync = true;
 	stream.mark_done = false;
-	stream.walmethod = CreateWalDirectoryMethod(basedir, stream.do_sync);
+	stream.walmethod = CreateWalDirectoryMethod(basedir, compresslevel,
+stream.do_sync);
 	stream.partial_suffix = ".partial";
 
 	ReceiveXlogStream(conn, );
@@ -389,6 +392,7 @@ main(int argc, char **argv)
 		{"status-interval", required_argument, NULL, 's'},
 		{"slot", required_argument, NULL, 'S'},
 		{"verbose", no_argument, NULL, 'v'},
+		{"compress", required_argument, NULL, 'Z'},
 /* action */
 		{"create-slot", no_argument, NULL, 1},
 		{"drop-slot", no_argument, NULL, 2},
@@ -419,7 +423,7 @@ main(int argc, char **argv)
 		}
 	}
 
-	while ((c = getopt_long(argc, argv, "D:d:h:p:U:s:S:nwWv",
+	while ((c = getopt_long(argc, argv, "D:d:h:p:U:s:S:nwWvZ:",
 			long_options, _index)) != -1)
 	{
 		switch (c)
@@ -469,6 +473,15 @@ main(int argc, char **argv)
 			case 'v':
 verbose++;
 break;
+			case 'Z':
+compresslevel = atoi(optarg);
+if (compresslevel < 0 || compresslevel > 9)
+{
+	fprintf(stderr, _("%s: invalid compression level \"%s\"\n"),
+			progname, optarg);
+	exit(1);
+}
+break;
 /* action */
 			case 1:
 do_create_slot = true;
@@ -535,6 +548,16 @@ main(int argc, char **argv)
 		exit(1);
 	}
 
+#ifndef HAVE_LIBZ
+	if (compresslevel != 0)
+	{
+		fprintf(stderr,
+_("%s: this build does

[HACKERS] Subtle bug in "Simplify tape block format" commit

2017-01-06 Thread Peter Geoghegan

It seems that commit 01ec2563 has a subtle bug, which stems from the
fact that logtape.c no longer follows the rule described above
ltsGetFreeBlock():

/*
 * Select a currently unused block for writing to.
 *
 * NB: should only be called when writer is ready to write immediately,
 * to ensure that first write pass is sequential.
 */
static long
ltsGetFreeBlock(LogicalTapeSet *lts)
{
...
}

Previously, all calls to ltsGetFreeBlock() were immediately followed
by a corresponding call to ltsWriteBlock(); we wrote out the
newly-allocated-in-first-pass block there and then. It's a good idea
for a corresponding ltsWriteBlock() call to happen immediately, and
it's *essential* for it to happen before any later block is written
during the first write pass (when tuples are initially dumped out to
form runs), since, as noted above ltsWriteBlock():

/*
 * Write a block-sized buffer to the specified block of the underlying file.
 *
 * NB: should not attempt to write beyond current end of file (ie, create
 * "holes" in file), since BufFile doesn't allow that.  The first write pass
 * must write blocks sequentially.
...

However, a "write beyond current end of file" now seems to actually be
attempted at times, resulting in an arcane error during sorting. This
is more or less because LogicalTapeWrite() doesn't heed the warning
from ltsGetFreeBlock(), as seen here:

while (size > 0)
{
if (lt->pos >= TapeBlockPayloadSize)
{
...

/*
 * First allocate the next block, so that we can store it in the
 * 'next' pointer of this block.
 */
nextBlockNumber = ltsGetFreeBlock(lts);

/* set the next-pointer and dump the current block. */
TapeBlockGetTrailer(lt->buffer)->next = nextBlockNumber;
ltsWriteBlock(lts, lt->curBlockNumber, (void *) lt->buffer);
...
}
...

}

Note that LogicalTapeWrite() now allocates each block (calls
ltsGetFreeBlock()) one block in advance of current block, immediately
before *current* block is written (not the next block/block just
allocated). So, the corresponding ltsWriteBlock() call *must* happen
at an unspecified time in the future, typically the next time control
ends up at exactly the same point, where the block that was "next"
becomes "current".

I'm not about to argue that we should go back to following this
ltsGetFreeBlock() rule, though; I can see why Heikki refactored
LogicalTapeWrite() to allocate blocks a block in advance. Still, we
need to be more careful in avoiding the underlying problem that the
ltsGetFreeBlock() rule was intended to prevent. Attached patch 0001-*
has logtape.c be sure to write out a tape's buffer every time
tuplesort ends a run.

I have a test case. Build Postgres with only 0002-* applied, which
makes MAX_PHYSICAL_FILESIZE far smaller than its current value
(current value is 2^30, while this reduces it to BLCKSZ). Then:

postgres=# set replacement_sort_tuples = 0; set work_mem = 64;
SET
SET
postgres=# with a as (select repeat('a', 7000) i from
generate_series(1, 7)) select * from a order by i;
ERROR:  XX000: could not write block 5 of temporary file: Success
LOCATION:  ltsWriteBlock, logtape.c:204

This "block 5" corresponds to an fd.c file/BufFile segment that does
not in fact exist when this error is raised. Since BufFiles multiplex
BLCKSZ-sized segment files in this build of Postgres, it's quite
likely, in general, that writes marking the end of a run will straddle
"segment boundaries", which is what it takes for buffile.c/logtape.c
to throw this error. (The BufFile contract does not allow "holes" in
files, but segment boundaries must be crossed as the critical point
(just before a tape switch) to actually see this error -- you'd have
to be very unlucky to have things line up in exactly the wrong way
with 1GiB BufFile segments, but it can still happen.)

-- 
Peter Geoghegan
From d4a611e94a3b4504f034fdb31a8ab4a72955b887 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Thu, 5 Jan 2017 11:29:53 -0800
Subject: [PATCH 2/2] Reduce MAX_PHYSICAL_FILESIZE value to BLCKSZ

This constant controls the size of underlying fd.c managed files that
the BufFile abstraction multiplexes for its callers.  A very low value
for the constant can be useful for flushing out bugs.
---
 src/backend/storage/file/buffile.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 7ebd636..ac3e400 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -47,7 +47,7 @@
  * The reason is that we'd like large temporary BufFiles to be spread across
  * multiple tablespaces when available.
  */
-#define MAX_PHYSICAL_FILESIZE	0x4000
+#define MAX_PHYSICAL_FILESIZE	BLCKSZ
 #define BUFFILE_SEG_SIZE		(MAX_PHYSICAL_FILESIZE / BLCKSZ)
 
 /*
-- 
2.7.4

From 389859e1a31fab7fe906176d33660583c12e4277 Mon Sep

Re: [HACKERS] Block level parallel vacuum WIP

2017-01-06 Thread Claudio Freire

On Fri, Jan 6, 2017 at 2:38 PM, Masahiko Sawada  wrote:
>  table_size | indexes | parallel_degree |   time
> +-+-+--
>  6.5GB  |   0 |   1 | 00:00:14
>  6.5GB  |   0 |   2 | 00:00:02
>  6.5GB  |   0 |   4 | 00:00:02

Those numbers look highly suspect.

Are you sure you're not experiencing caching effects? (ie: maybe you
ran the second and third vacuums after the first, and didn't flush the
page cache, so the table was cached)

>  6.5GB  |   2 |   1 | 00:02:18
>  6.5GB  |   2 |   2 | 00:00:38
>  6.5GB  |   2 |   4 | 00:00:46
...
>  13GB   |   0 |   1 | 00:03:52
>  13GB   |   0 |   2 | 00:00:49
>  13GB   |   0 |   4 | 00:00:50
..
>  13GB   |   2 |   1 | 00:12:42
>  13GB   |   2 |   2 | 00:01:17
>  13GB   |   2 |   4 | 00:02:12

These would also be consistent with caching effects


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ALTER TABLE .. ALTER COLUMN .. ERROR: attribute .. has wrong type

Alvaro Herrera  writes:
> Tom Lane wrote:
>> Hmm.  The bespoke code for constructing the attno map bothers me;
>> surely there is existing code that does that?  If not, it'd still
>> make more sense to factor it out, I think, because there will be
>> other needs for it in future.

> There isn't any that I could find -- all the existing callers of
> map_variable_attnos build their map in other ways (while walking an
> attribute array at construction time).

[ pokes around... ]  The code I was thinking of is convert_tuples_by_name
in access/common/tupconvert.c.  There's a bit of an API mismatch in that
it wants to wrap the mapping array in a TupleConversionMap struct; but
maybe we could refactor tupconvert.c to offer a way to get just the map
array.

> I also modified the algorithm to use the relcache instead of walking the
> child's attribute list for each parent attribute (that was silly).

Hmm.  That might be better in a big-O sense but I doubt it's faster for
reasonable numbers of columns.

> My rationale when writing the event trigger code was that each command
> would only be published once, for the parent table, not recursively for
> each child.  So only the original expression should be seen.

Oh good; then we're just talking about a localized bug fix and not a
protocol break for event triggers.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cluster wide option to control symbol case folding

2017-01-06 Thread Bruce Momjian

On Mon, Jan  2, 2017 at 01:25:35PM -0500, Robert Haas wrote:
> > But, I can easily imagine a good number of people deciding they want
> > mixed case on the server, and so quoting their identifiers. And, then
> > deciding PostgreSQL is defective, rather than deciding their favorite
> > administration or query tool is defective. Almost all of the tools I
> > tried worked fine when I had all lower case symbols on the server. Based
> > on observing the generated SQL, most of the tools that failed for me
> > when I had mixed case symbols on the server would work against a case
> > preserving mode in PostgreSQL. The tools generally pass through the
> > catalog reported symbols without manipulation.
> 
> Right.  Tom's argument makes a lot of sense when you think about this
> from the point of view of someone writing extensions or tools that are
> designed to work with anybody's PostgreSQL instance.  When you think
> about it from the point of view of a user wanting to write an
> application that can work with any of a number of databases, then your
> argument has a lot of merit to it.  I'm not sure there's any way to
> split the baby here: tool authors will obviously prefer that
> PostgreSQL's behavior in this area be invariable, while people trying
> to develop portable database applications will prefer configurability.
> As far as I can see, this is a zero sum game that is bound to have one
> winner and one loser.

Please let me restate the above.  For those working only in the Postgres
ecosystem, the rules are pretty clear --- quote nothing and use only
lowercase, or quote everything.  The reason "quote nothing" is
insufficient is that tools like pgAdmin will quote mixed-case
identifiers during object creation, so technically it is difficult to
have pure "quote nothing" behavior in Postgres unless you control all
the tooling.

Now, clearly, we have users coming from non-Postgres databases where the
behavior is different, i.e. Oracle might be "quote nothing and use only
uppercase".  It seems SQLAnywhere is "quote nothing and case is
preserved".

The problem with opening Postgres up to user-modifiable case folding is
that Postgres ecosystem queries will have to adjust to the fact that
case folding is no longer predictable.  For database applications the
fix might be as easy as changing the session state, but for extensions
and libraries, they would need to quote _everything_ to handle
uncontrollable case folding behavior.  So, in a way, the crazy quoting
we are trying to avoid for migrated queries now happens in the Postgres
ecosystem, and we might even have more of it.

This basically pushes the quoting overhead from users moving to Postgres
from other databases to Postgres ecosystem tooling.  Whether that is
better or worse overall is a judgement call, as Robert stated.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ALTER TABLE .. ALTER COLUMN .. ERROR: attribute .. has wrong type

2017-01-06 Thread Alvaro Herrera

Tom Lane wrote:
> Alvaro Herrera  writes:
> > Alvaro Herrera wrote:
> >> Here's a first attempt at fixing this.  It makes the test pass, but I
> >> have the feeling that more complex ones might need more work.
> 
> > Here's another one with three main differences:
> 
> Hmm.  The bespoke code for constructing the attno map bothers me;
> surely there is existing code that does that?  If not, it'd still
> make more sense to factor it out, I think, because there will be
> other needs for it in future.

There isn't any that I could find -- all the existing callers of
map_variable_attnos build their map in other ways (while walking an
attribute array at construction time).

So I did as you suggest, 'cause it sounds like a good idea, but the
problem crops up of where to put the new function.  The obvious
candidate is rewriteManip.c next to map_variable_attnos itself, but the
include creep is a bit bothersome -- maybe it indicates that the new
function should be elsewhere.  But then, the whole of rewriteManip seems
not terribly well delimited to the rewriter itself but just an assorted
collection of walkers, mutators, and similar utilities used by code all
over the place, so perhaps this is not a problem.

I also modified the algorithm to use the relcache instead of walking the
child's attribute list for each parent attribute (that was silly).

Here's the new version.

> Otherwise, this seems sound in terms of fixing the observed problem,
> but what are the implications for event triggers exactly?  Does a
> trigger see only the original expression, or only the modified expression,
> or ???

My rationale when writing the event trigger code was that each command
would only be published once, for the parent table, not recursively for
each child.  So only the original expression should be seen.  I have not
yet verified the actual behavior in the differing attnums case.  One
problem at a time ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index e217f57..2977b59 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -7999,12 +7999,68 @@ ATPrepAlterColumnType(List **wqueue,
ReleaseSysCache(tuple);
 
/*
-* The recursion case is handled by ATSimpleRecursion.  However, if we 
are
-* told not to recurse, there had better not be any child tables; else 
the
-* alter would put them out of step.
+* Recurse manually by queueing a new command for each child, if
+* necessary. We cannot apply ATSimpleRecursion here because we need to
+* remap attribute numbers in the USING expression, if any.
+*
+* If we are told not to recurse, there had better not be any child
+* tables; else the alter would put them out of step.
 */
if (recurse)
-   ATSimpleRecursion(wqueue, rel, cmd, recurse, lockmode);
+   {
+   Oid relid = RelationGetRelid(rel);
+   ListCell   *child;
+   List   *children;
+
+   children = find_all_inheritors(relid, lockmode, NULL);
+
+   /*
+* find_all_inheritors does the recursive search of the 
inheritance
+* hierarchy, so all we have to do is process all of the relids 
in the
+* list that it returns.
+*/
+   foreach(child, children)
+   {
+   Oid childrelid = lfirst_oid(child);
+   Relationchildrel;
+
+   if (childrelid == relid)
+   continue;
+
+   /* find_all_inheritors already got lock */
+   childrel = relation_open(childrelid, NoLock);
+   CheckTableNotInUse(childrel, "ALTER TABLE");
+
+   /*
+* Remap the attribute numbers.  If no USING expression 
was
+* specified, there is no need for this step.
+*/
+   if (def->cooked_default)
+   {
+   AttrNumber *attmap;
+   int maplength;
+   boolfound_whole_row;
+
+   /* create a copy to scribble on */
+   cmd = copyObject(cmd);
+
+   attmap = build_attno_map(rel, childrel, 
);
+   ((ColumnDef *) cmd->def)->cooked_default =
+   map_variable_attnos(def->cooked_default,
+   
1, 0,
+

[HACKERS] _hash_addovflpage has a bug

It looks to to me like the recent hash index changes have left
_hash_addovflpage slightly broken.  I think that if that function
reaches the point where it calls _hash_getbuf() to fetch the next page
in the bucket chain, we also need to clear retain_pin.  Otherwise,
we'll erroneously think that we're supposed to retain a pin even when
the current page is an overflow page rather than the primary bucket
page, which is wrong.

A larger question, I suppose, is why this function wants to be certain
of adding a new page even if someone else has already done so.  It
seems like it might be smarter for it to just return the newly added
page to the caller and let the caller try to use it.  _hash_doinsert
has an Assert() that the tuple fits on the returned page, but that
could be deleted.  As it stands, if two backends try to insert a tuple
into the same full page at the same time, both of them will add an
overflow page and we'll end up with 2 overflow pages each containing 1
tuple.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add support for SRF and returning composites to pl/tcl

Jim Nasby  writes:
> On 11/8/16 8:33 AM, Tom Lane wrote:
>> As things stand in HEAD, the behavior is about the same, but the error
>> messages are not --- in one case they mention triggers and of course the
>> other doesn't.  There are a couple of other minor things in the way of
>> unifying the two hunks of code, so I concluded it probably wasn't worth
>> the trouble.  But feel free to take another look if it bugs you.

> I had to add a bit of cruft to pltcl_build_tuple_result but it's not 
> that bad. tg_tupdesc could potentially be eliminated, but I don't know 
> if it's really worth it.

> Note that this does change some of the trigger error messages, but I 
> don't think that's really an issue?

The only thing that seems significant is that we'd change the SQLSTATE
for the "odd number of list items" error: pltcl_trigger_handler has

ereport(ERROR,
(errcode(ERRCODE_E_R_I_E_TRIGGER_PROTOCOL_VIOLATED),
 errmsg("trigger's return list must have even number of elements")));

and that would become

ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("column name/value list must have even number of elements")));

But that's probably fine; it's hard to believe anyone is depending on this
particular case --- and I think the argument that this is legitimately
a TRIGGER_PROTOCOL_VIOLATED case is a bit thin anyway.

While I was checking the patch to verify that it didn't change any
behavior, I noticed that it did, and there's a pre-existing bug here:
pltcl_build_tuple_result is applying utf_e2u to the Tcl_GetString results,
but surely that should be utf_u2e instead.  Fortunately we haven't shipped
this code yet.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical Replication WIP

0005-Add-separate-synchronous-commit-control-for-logical--v16.patch.gz

This looks a little bit hackish.  I'm not sure how this would behave
properly when either synchronous_commit or
logical_replication_synchronous_commit is changed at run time with a reload.

I'm thinking maybe this and perhaps some other WAL receiver settings
should be properties of a subscription, like ALTER SUBSCRIPTION ...
SET/RESET.

Actually, maybe I'm a bit confused what this is supposed to achieve.
synchronous_commit has both a local and a remote meaning.  What behavior
are the various combinations of physical and logical replication
supposed to accomplish?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical Replication WIP

Comments on 0004-Add-logical-replication-workers-v16.patch.gz:

I didn't find any major problems.  At times while I was testing strange
things it was not clear why "nothing is happening".  I'll do some more
checking in that direction.

Fixup patch attached that enhances some error messages, fixes some
typos, and other minor changes.  See also comments below.

---

The way check_max_logical_replication_workers() is implemented creates
potential ordering dependencies in postgresql.conf.  For example,

max_logical_replication_workers = 100
max_worker_processes = 200

fails, but if you change the order, it works.  The existing
check_max_worker_processes() has the same problem, but I suspect because
it only checks against MAX_BACKENDS, nobody has ever seriously hit that
limit.

I suggest just removing the check.  If you set
max_logical_replication_workers higher than max_worker_processes and you
hit the lower limit, then whatever is controlling max_worker_processes
should complain with its own error message.

---

The default for max_logical_replication_workers is 4, which seems very
little.  Maybe it should be more like 10 or 20.  The "Quick setup"
section recommends changing it to 10.  We should at least be
consistent there: If you set a default value that is not 0, then it
should enough that we don't need to change it again in the Quick
setup.  (Maybe the default max_worker_processes should also be
raised?)

+max_logical_replication_workers = 10 # one per subscription + one per
instance needed on subscriber

I think this is incorrect (copied from max_worker_processes?).  The
launcher does not count as one of the workers here.

On a related note, should the minimum not be 0 instead of 1?

---

About the changes to libpqrcv_startstreaming().  The timeline is not
really an option in the syntax.  Just passing in a string that is
pasted in the final command creates too much coupling, I think.  I
would keep the old timeline (TimeLineID tli) argument, and make the
options const char * [], and let startstreaming() assemble the final
string, including commas and parentheses.  It's still not a perfect
abstraction, because you need to do the quoting yourself, but much
better.  (Alternatively, get rid of the startstreaming call and just
have callers use libpqrcv_PQexec directly.)

---

Some of the header files are named inconsistently with their .c files.
I think src/include/replication/logicalworker.h should be split into
logicalapply.h and logicallauncher.h.  Not sure about
worker_internal.h.  Maybe rename apply.c to worker.c?

(I'm also not fond of throwing publicationcmds.h and
subscriptioncmds.h together into replicationcmds.h.  Maybe that could
be changed, too.)

---

Various FATAL errors in logical/relation.c when the target relation is
not in the right state.  Could those not be ERRORs?  The behavior is
the same at the moment because background workers terminate on
uncaught exceptions, but that should eventually be improved.

A FATAL error will lead to a

LOG:  unexpected EOF on standby connection

on the publisher, because the process just dies without protocol
shutdown.  (And then it reconnects and tries again.  So we might as
well not die and just retry again.)

---

In LogicalRepRelMapEntry, rename rel to localrel, so it's clearer in
the code using this struct.  (Maybe reloid -> localreloid)

---

Partitioned tables are not supported in either publications or as
replication targets.  This is expected but should be fixed before the
final release.

---

In apply.c:

The comment in apply_handle_relation() makes a point that the schema
validation is done later, but does not tell why.  The answer is
probably because it doesn't matter and it's more convenient, but it
should be explained in the comment.

See XXX comment in logicalrep_worker_stop().

The get_flush_position() return value is not intuitive from the
function name.  Maybe make that another pointer argument for clarity.

reread_subscription() complains if the subscription name was changed.
I don't know why that is a problem.

---

In launcher.c:

pg_stat_get_subscription should hold LogicalRepWorkerLock around the
whole loop, so that it doesn't get inconsistent results when workers
change during the loop.

---

In relation.c:

Inconsistent use of uint32 vs LogicalRepRelId.  Pick one. :)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 30e7b9d70b5c6488dd74eb648c62372911096544 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 5 Jan 2017 12:00:00 -0500
Subject: [PATCH] fixup! Add logical replication workers

---
 doc/src/sgml/catalogs.sgml |  6 ++
 doc/src/sgml/ref/create_subscription.sgml  |  2 +-
 src/backend/executor/execReplication.c |  6 +-
 src/backend/replication/logical/apply.c| 88 --
 src/backend/replication/logical/launcher.c | 34 ++--

Re: [HACKERS] Increase pltcl test coverage

Jim Nasby  writes:
> On 10/31/16 3:24 PM, Jim Nasby wrote:
>> This patch increases test coverage for pltcl, from 70% to 83%. Aside
>> from that, the work on this uncovered 2 new bugs (the trigger return one
>> I just submitted, as well as a bug in the SRF/composite patch).

> Rebased patch attached. Test coverage is now at 85% (by line count).

This is in a format that neither patch(1) nor "git apply" recognize.
Please resubmit in a more usual format, diff -c or diff -u perhaps.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ALTER TABLE .. ALTER COLUMN .. ERROR: attribute .. has wrong type

Alvaro Herrera  writes:
> Alvaro Herrera wrote:
>> Here's a first attempt at fixing this.  It makes the test pass, but I
>> have the feeling that more complex ones might need more work.

> Here's another one with three main differences:

Hmm.  The bespoke code for constructing the attno map bothers me;
surely there is existing code that does that?  If not, it'd still
make more sense to factor it out, I think, because there will be
other needs for it in future.

Otherwise, this seems sound in terms of fixing the observed problem,
but what are the implications for event triggers exactly?  Does a
trigger see only the original expression, or only the modified expression,
or ???

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-01-06 Thread Thomas Munro

On Sat, Jan 7, 2017 at 9:01 AM, Thomas Munro
 wrote:
> On Tue, Jan 3, 2017 at 10:53 PM, Thomas Munro
>  wrote:
>> I will post a new rebased version soon with that and
>> some other nearby problems fixed.
>
> Here is a new WIP patch.

I forgot to mention: this applies on top of barrier-v5.patch, over here:

https://www.postgresql.org/message-id/CAEepm%3D3g3EC734kgriWseiJPfUQZeoMWdhAfzOc0ecewAa5uXg%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ALTER TABLE .. ALTER COLUMN .. ERROR: attribute .. has wrong type

2017-01-06 Thread Alvaro Herrera

Alvaro Herrera wrote:
> Tom Lane wrote:
> 
> > We could probably fix the specific issue being seen here by passing the
> > expression tree through a suitable attno remapping,
> 
> Here's a first attempt at fixing this.  It makes the test pass, but I
> have the feeling that more complex ones might need more work.

Here's another one with three main differences:

1. Make the whole-row check an ereport() not elog().  You can use a
whole-row expression in USING, which makes it fire, so better make it
translatable.  An artificial example is in the new regression tests,
  ALTER TABLE test_type_diff2 ALTER COLUMN int_four TYPE int4 USING 
(pg_column_size(test_type_diff2));
but I suppose somebody with more imagination could come up with
something actually interesting.

2. The foreign table case was broken, as evidenced by the foreign_table
regression test.

3. If there is no USING expression, there is no need to do the whole
map_variable_attnos() dance.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index e217f57..77b4bf5 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -7999,12 +7999,98 @@ ATPrepAlterColumnType(List **wqueue,
ReleaseSysCache(tuple);
 
/*
-* The recursion case is handled by ATSimpleRecursion.  However, if we 
are
-* told not to recurse, there had better not be any child tables; else 
the
-* alter would put them out of step.
+* Recurse manually by queueing a new command for each child, if
+* necessary. We cannot apply ATSimpleRecursion here because we need to
+* remap attribute numbers in the USING expression, if any.
+*
+* If we are told not to recurse, there had better not be any child
+* tables; else the alter would put them out of step.
 */
if (recurse)
-   ATSimpleRecursion(wqueue, rel, cmd, recurse, lockmode);
+   {
+   Oid relid = RelationGetRelid(rel);
+   ListCell   *child;
+   List   *children;
+
+   children = find_all_inheritors(relid, lockmode, NULL);
+
+   /*
+* find_all_inheritors does the recursive search of the 
inheritance
+* hierarchy, so all we have to do is process all of the relids 
in the
+* list that it returns.
+*/
+   foreach(child, children)
+   {
+   Oid childrelid = lfirst_oid(child);
+   Relationchildrel;
+
+   if (childrelid == relid)
+   continue;
+
+   /* find_all_inheritors already got lock */
+   childrel = relation_open(childrelid, NoLock);
+   CheckTableNotInUse(childrel, "ALTER TABLE");
+
+   /*
+* Remap the attribute numbers.  If no USING expression 
was
+* specified, there is no need for this step.
+*/
+   if (def->cooked_default)
+   {
+   TupleDesc   ptdesc;
+   TupleDesc   ctdesc;
+   AttrNumber *attmap;
+   AttrNumber  pnum;
+   boolfound_whole_row;
+
+   /*
+* Build an attribute map for 
map_variable_attnos.  This is
+* O(N^2) on the number of attributes ...
+*/
+   ptdesc = RelationGetDescr(rel);
+   ctdesc = RelationGetDescr(childrel);
+   attmap = (AttrNumber *) 
palloc0(sizeof(AttrNumber) *
+   
ptdesc->natts);
+   for (pnum = 1; pnum <= ptdesc->natts; pnum++)
+   {
+   boolfound = false;
+   AttrNumber  cnum;
+
+   for (cnum = 1; cnum <= ctdesc->natts; 
cnum++)
+   {
+   if 
(strcmp(NameStr(ptdesc->attrs[pnum - 1]->attname),
+
NameStr(ctdesc->attrs[cnum - 1]->attname)) == 0)
+   {
+   attmap[pnum - 1] = cnum;
+   found = true;
+

Re: [HACKERS] WIP: Barriers

2017-01-06 Thread Thomas Munro

On Wed, Nov 23, 2016 at 2:28 PM, Robert Haas  wrote:
> The code here looks OK.  A few thoughts:
>
> - I'm a little unsure whether it's a good idea to remove the existing
> barrier.h and immediately add a new barrier.h that does something
> totally different.  It's tempting to try to find another name for
> these things just to avoid that.  Then again, I bet there's not much
> non-core code that is relying on the existing barrier.h, so maybe it's
> OK.  On the third hand, the risk of confusing developers may be
> non-nil even if we don't confuse any code.  I think I'll go commit
> remove-useless-barrier-v4.patch now and see if anyone shows up to
> complain...

Here is a new version with updated copyright messages and some dtrace
probes which provide an easy way to measure the time spend waiting at
barriers.

-- 
Thomas Munro
http://www.enterprisedb.com


barrier-v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Odd behavior with PG_TRY

Kevin Grittner  writes:
> On Fri, Jan 6, 2017 at 5:43 AM, Michael Paquier
>  wrote:
>> If a variable is modified within PG_TRY and then referenced in
>> PG_CATCH it needs to be marked as volatile to be strictly in
>> conformance with POSIX. This also ensures that any compiler does not
>> do any stupid optimizations with those variables in the way they are
>> referenced and used.

> That sort of begs the question of why PG_exception_stack is not
> marked as volatile, since the macros themselves modify it within
> the PG_TRY block and reference it within the PG_CATCH block.  Is
> there some reason this variable is immune to the problem?

It's not a local variable.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgres_fdw : altering foreign table not invalidating prepare statement execution plan.

Etsuro Fujita  writes:
> On 2017/01/06 21:25, Ashutosh Bapat wrote:
>> On Thu, Jan 5, 2017 at 2:49 PM, Etsuro Fujita
>>  wrote:
>>> On 2017/01/03 15:57, Ashutosh Bapat wrote:
 The patch looks good to me, but I feel there are too many testscases.

>>> I don't object to that, but (1) the tests I added wouldn't be that
>>> time-consuming, and (2) they would be more expected to help find bugs, in
>>> general, so I'd vote for keeping them.  How about leaving that for the
>>> committer's judge?

>> Ok. Marking this as ready for committer.

> Thanks!

Pushed.  I ended up simplifying the tests some, partly because I agreed it
seemed like overkill, but mostly because they weren't testing the bug.
The prepared statements that had parameters would have been replanned
anyway, because plancache.c wouldn't have generated enough plans to decide
if a generic plan would be ok.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Off-by-one oddity in minval for decreasing sequences

2017-01-06 Thread Daniel Verite

 Hi,

When testing the patch at https://commitfest.postgresql.org/12/768/
("sequence data type" by Peter E.), I notice that there's a preexisting
oddity in the fact that sequences created with a negative increment
in current releases initialize the minval to -(2^63)+1 instead of -2^63,
the actual lowest value for a bigint.

postgres=# CREATE SEQUENCE s INCREMENT BY -1;
CREATE SEQUENCE

postgres=# SELECT seqmin,seqmin+pow(2::numeric,63)
   FROM pg_sequence where seqrelid='s'::regclass;
seqmin|  ?column?  
--+
 -9223372036854775807 | 1.

But it's still possible to set it to -2^63 manually either by
altering the sequence or by specifying it explicitly at CREATE time
with CREATE SEQUENCE s MINVALUE -9223372036854775808
so it's inconsistent with the potential argument that we couldn't
store this value for some reason.

postgres=# ALTER SEQUENCE s minvalue -9223372036854775808;
ALTER SEQUENCE
postgres=# select seqmin,seqmin+pow(2::numeric,63)
   from pg_sequence where seqrelid='s'::regclass;
seqmin|  ?column?  
--+
 -9223372036854775808 | 0.


The defaults comes from these definitions, in include/pg_config_manual.h

/*
 * Set the upper and lower bounds of sequence values.
 */
#define SEQ_MAXVALUEPG_INT64_MAX
#define SEQ_MINVALUE(-SEQ_MAXVALUE)

with no comment as to why SEQ_MINVALUE is not PG_INT64_MIN.

When using other types than bigint, Peter's patch fixes the inconsistency
but also makes it worse by ISTM applying the rule that the lowest value
is forbidden for int2 and int4 in addition to int8.

I'd like to suggest that we don't do that starting with HEAD, by
setting seqmin to the real minimum of the supported range, because
wasting that particular value seems silly and a hazard if
someone wants to use a sequence to store any integer
as opposed to just calling nextval().

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Odd behavior with PG_TRY

2017-01-06 Thread Kevin Grittner

On Fri, Jan 6, 2017 at 5:43 AM, Michael Paquier
 wrote:

> If a variable is modified within PG_TRY and then referenced in
> PG_CATCH it needs to be marked as volatile to be strictly in
> conformance with POSIX. This also ensures that any compiler does not
> do any stupid optimizations with those variables in the way they are
> referenced and used.

That sort of begs the question of why PG_exception_stack is not
marked as volatile, since the macros themselves modify it within
the PG_TRY block and reference it within the PG_CATCH block.  Is
there some reason this variable is immune to the problem?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] merging some features from plpgsql2 project

2017-01-06 Thread Pavel Stehule

Hi


>
> >
> > some examples based on Ada doc
> >
> > FUNCTION xxx RETURN int AS
> >   PRAGMA yyy -- pragma has function scope
> > BEGIN
> >
> > FUNCTION xxx RETURN int AS
> > BEGIN
> >   DECLARE
> > PRAGMA yyy -- pragma has block scope
>
> ok, sub-block makes sense over statement level IMO.
>

I am sending proof concept (parser only implementation) - it allows to
control query plan usage on function and on block level

Examples

CREATE OR REPLACE FUNCTION fx()
RETURNS int AS $$
PRAGMA use_query_plan_cache(off); -- disable query plan cache on function
level
DECLARE r record;
BEGIN
  FOR r IN SELECT ... -- some complex query, where we prefer on one shot
plan
  LOOP
DECLARE
  PRAGMA use_query_plan_cache(on); -- enable query plan cache for block
BEGIN
  ... statements inside cycle reuses query plan
END;
  END LOOP;
END;
$$ LANGUAGE plpgsql;

or

BEGIN
  ...
  DECLARE
PRAGMA use_query_plan_cache(off);
  BEGIN
-- these queries has fresh plan only
SELECT ...
SELECT ...
  END; -- end of PRAGMA scope
  ...
  -- usual behave
END;

The behave is static - controlled on compile time only - the controlled
feature can be enabled/disabled. The impact on runtime is zero

* the syntax is verbose - readable - I prefer strong clean signal for
readers so something internals is different
* consistent with Ada, PL/SQL
* remove one reason for dynamic SQL
* allows to mix queries with without query plan cache - interesting for
patter FOR IN slow query LOOP fast query; END LOOP;

* there is small risk of compatibility break - if somebody use variables
named PRAGMA, because new reserved keyword is necessary - fails on syntax
error - so it is easy identified.
* this syntax can be reused - autonomous_transaction like PL/SQL. I read a
manual of Gnu Ada - and this is used often for implementation legacy
(obsolete) behave, functionality.

Notes, comments?

Regards

Pavel
diff --git a/src/pl/plpgsql/src/pl_comp.c b/src/pl/plpgsql/src/pl_comp.c
index 3c52d71..a5fd040 100644
--- a/src/pl/plpgsql/src/pl_comp.c
+++ b/src/pl/plpgsql/src/pl_comp.c
@@ -288,6 +288,7 @@ do_compile(FunctionCallInfo fcinfo,
 	int		   *in_arg_varnos = NULL;
 	PLpgSQL_variable **out_arg_variables;
 	MemoryContext func_cxt;
+	PLpgSQL_settings settings;
 
 	/*
 	 * Setup the scanner input and error info.  We assume that this function
@@ -373,6 +374,11 @@ do_compile(FunctionCallInfo fcinfo,
 	plpgsql_DumpExecTree = false;
 	plpgsql_start_datums();
 
+	/* Prepare default for PRAGMA directives */
+	settings.prev = NULL;
+	settings.use_query_plan_cache = true;
+	plpgsql_settings_init();
+
 	switch (function->fn_is_trigger)
 	{
 		case PLPGSQL_NOT_TRIGGER:
@@ -796,6 +802,7 @@ plpgsql_compile_inline(char *proc_source)
 	PLpgSQL_variable *var;
 	int			parse_rc;
 	MemoryContext func_cxt;
+	PLpgSQL_settings settings;
 
 	/*
 	 * Setup the scanner input and error info.  We assume that this function
@@ -851,6 +858,11 @@ plpgsql_compile_inline(char *proc_source)
 	plpgsql_DumpExecTree = false;
 	plpgsql_start_datums();
 
+	/* Prepare default for PRAGMA directives */
+	settings.prev = NULL;
+	settings.use_query_plan_cache = true;
+	plpgsql_settings_init();
+
 	/* Set up as though in a function returning VOID */
 	function->fn_rettype = VOIDOID;
 	function->fn_retset = false;
diff --git a/src/pl/plpgsql/src/pl_funcs.c b/src/pl/plpgsql/src/pl_funcs.c
index 906fe01..c7ee968 100644
--- a/src/pl/plpgsql/src/pl_funcs.c
+++ b/src/pl/plpgsql/src/pl_funcs.c
@@ -33,6 +33,7 @@
  * --
  */
 static PLpgSQL_nsitem *ns_top = NULL;
+static PLpgSQL_settings *settings_top = NULL;
 
 
 /* --
@@ -226,6 +227,66 @@ plpgsql_ns_find_nearest_loop(PLpgSQL_nsitem *ns_cur)
 
 
 /*
+ * Compilator settings routines
+ */
+
+void
+plpgsql_settings_init(PLpgSQL_settings *defval)
+{
+	settings_top = defval;
+}
+
+/*
+ * Creates new settings based on previous settings
+ */
+void
+plpgsql_settings_clone(void)
+{
+	PLpgSQL_settings *new = palloc(sizeof(PLpgSQL_settings));
+
+	Assert(settings_top != NULL);
+
+	memcpy(new, settings_top, sizeof(PLpgSQL_settings));
+	new->prev = settings_top;
+	settings_top = new;
+}
+
+/*
+ * apply a pragma to current settings
+ */
+void
+plpgsql_settings_pragma(PLpgSQL_pragma_type typ, bool value)
+{
+	Assert(settings_top != NULL);
+
+	switch (typ)
+	{
+		case PLPGSQL_PRAGMA_QUERY_PLAN_CACHE:
+			settings_top->use_query_plan_cache = value;
+	}
+}
+
+/*
+ * restore previous compiler settings
+ */
+void
+plpgsql_settings_pop(void)
+{
+	PLpgSQL_settings *prev;
+
+	Assert(settings_top != NULL);
+	prev = settings_top->prev;
+	pfree(settings_top);
+	settings_top = prev;
+}
+
+PLpgSQL_settings *
+plpgsql_settings_top(void)
+{
+	return settings_top;
+}
+
+/*
  * Statement type as a string, for use in error messages etc.
  */
 const char *
@@ -1534,7 +1595,7 @@ dump_getdiag(PLpgSQL_stmt_getdiag *stmt)
 static void
 dump_expr(PLpgSQL_expr *expr)
 {
-	printf("'%s'", expr->query);
+	printf("%s'%s'", expr->use_query_plan_cache ? "*" : "",

[HACKERS] Placement of InvokeObjectPostAlterHook calls

While reviewing Etsuro-san's patch to force replanning after FDW option
changes, I noticed that there is a great lack of consistency about where
InvokeObjectPostAlterHook calls have been placed relative to other actions
such as forced relcache invals.  I wonder exactly what expectations a
post-alter hook function could have about cache coherency, or indeed if
there's any clarity at all about what such a hook might do.  For instance,
it looks to me like the hook would generally need to do a
CommandCounterIncrement in order to be able to "see" the update it's being
called for, and I'm unsure that that would be safe at every call site.
Is that supposed to be allowed, or are we expecting that object access
hooks are only going to do open-loop actions that don't rely on the
details of the change?

I suppose this fits in well with our grand tradition of not documenting
hooks at all, but for a set of hooks as invasive as these are, I think
we ought to do better.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block level parallel vacuum WIP

2017-01-06 Thread Masahiko Sawada

On Mon, Oct 3, 2016 at 11:00 AM, Michael Paquier
 wrote:
> On Fri, Sep 16, 2016 at 6:56 PM, Masahiko Sawada  
> wrote:
>> Yeah, I don't have a good solution for this problem so far.
>> We might need to improve group locking mechanism for the updating
>> operation or came up with another approach to resolve this problem.
>> For example, one possible idea is that the launcher process allocates
>> vm and fsm enough in advance in order to avoid extending fork relation
>> by parallel workers, but it's not resolve fundamental problem.
>

I got some advices at PGConf.ASIA 2016 and started to work on this again.

The most big problem so far is the group locking. As I mentioned
before, parallel vacuum worker could try to extend the same visibility
map page at the same time. So we need to make group locking conflict
in some cases, or need to eliminate the necessity of acquiring
extension lock. Attached 000 patch uses former idea, which makes the
group locking conflict between parallel workers when parallel worker
tries to acquire extension lock on same page. I'm not sure this is the
best idea but it's very simple and enough to support parallel vacuum.
More smart idea could be needed when we want to support parallel DML
and so on.

001 patch adds parallel option to VACUUM command. As Robert suggested
before, parallel option is set with parallel degree.

=# VACUUM (PARALLEL 4) table_name;

..means 4 background processes are launched and background process
executes lazy_scan_heap while the launcher (leader) process waits for
all vacuum workers finish. If N = 1 or without parallel option, leader
process itself executes lazy_scan_heap.

Internal Design
=
I changed the parallel vacuum internal design. Collecting garbage on
table is processed in block-level parallel. For tables with indexes,
each index on table is assigned to each vacuum worker and all garbage
on a index are processed by particular assigned vacuum worker.

The all spaces for the array of dead tuple TIDs used by vacuum worker
are allocated in dynamic shared memory by launcher process. Vacuum
worker process stores dead tuple location into its dead tuple array
without lock, the TIDs in a dead tuple array are ordered by TID. Note
that entire space for dead tuple, that is a bunch of dead tuple array,
are not ordered.

If table has index, all dead tuple TIDs needs to be shared with all
vacuum workers before actual reclaiming dead tuples starts and these
data should be cleared after all vacuum worker finished to use them.
So I put two synchronization points at where before reclaiming dead
tuples and where after finished to reclaim them. At these points,
parallel vacuum worker waits for all other workers to reach to the
same point. Once all vacuum workers reached to same point, vacuum
worker resumes next operation.

For example, If a table has five indexes and we execute parallel lazy
vacuum on that table with three vacuum workers, two of three vacuum
workers are assigned two indexes and another one vacuum worker is
assigned to one indexes. After the amount of dead tuple of all vacuum
worker reached to maintenance_work_mem size vacuum worker starts to
reclaim dead tuple on table and index. A vacuum worker that is
assigned to one index finishes (probably first) and sleeps until other
two vacuum workers finish to vacuum. If table has no index then each
parallel vacuum worker vacuums each page as we go.

Performance
===
I measured the execution time of vacuum on dirty table with several
parallel degree in my poor environment.

 table_size | indexes | parallel_degree |   time
+-+-+--
 6.5GB  |   0 |   1 | 00:00:14
 6.5GB  |   0 |   2 | 00:00:02
 6.5GB  |   0 |   4 | 00:00:02
 6.5GB  |   1 |   1 | 00:00:13
 6.5GB  |   1 |   2 | 00:00:15
 6.5GB  |   1 |   4 | 00:00:18
 6.5GB  |   2 |   1 | 00:02:18
 6.5GB  |   2 |   2 | 00:00:38
 6.5GB  |   2 |   4 | 00:00:46
 13GB   |   0 |   1 | 00:03:52
 13GB   |   0 |   2 | 00:00:49
 13GB   |   0 |   4 | 00:00:50
 13GB   |   1 |   1 | 00:01:41
 13GB   |   1 |   2 | 00:01:59
 13GB   |   1 |   4 | 00:01:24
 13GB   |   2 |   1 | 00:12:42
 13GB   |   2 |   2 | 00:01:17
 13GB   |   2 |   4 | 00:02:12

In result of my measurement, vacuum execution time got better in some
cases but didn't improve in case where index = 1. I'll investigate the
cause.

ToDo
==
* Vacuum progress support.
* Storage parameter support, perhaps parallel_vacuum_workers parameter
which allows autovacuum to use parallel vacuum on specified table.

I register this to next CF.

Regards,

--
Masahiko

Re: [HACKERS] Performance degradation in Bitmapscan (commit 75ae538bc3168bf44475240d4e0487ee2f3bb376)

On Fri, Jan 6, 2017 at 11:06 AM, Andres Freund  wrote:
> On 2017-01-06 11:01:32 -0500, Robert Haas wrote:
>> On Fri, Jan 6, 2017 at 10:43 AM, Andres Freund  wrote:
>> > On 2016-12-16 09:34:31 -0800, Andres Freund wrote:
>> >> > To fix his issue, we need something like your 0001.  Are you going to
>> >> > polish that up soon here?
>> >>
>> >> Yes.
>> >
>> > I've two versions of a fix for this. One of them basically increases the
>> > "spread" of buckets when the density goes up too much. It does so by
>> > basically shifting the bucket number to the left (e.g. only every second
>> > bucket can be the "primary" bucket for a hash value).  The other
>> > basically just replaces the magic constants in my previous POC patch
>> > with slightly better documented constants.  For me the latter works just
>> > as well as the former, even though aesthetically/theoretically the
>> > former sounds better.  I'm inclined to commit the latter, at least for
>> > now.
>>
>> Did you intend to attach the patches?
>
> No, I hadn't. You're interested in the "spreading" version?

I honestly have no opinion.  If you're confident that you know what
you want to do, it's fine with me if you just do it.  If you want my
opinion I probably need to see both patches and compare.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance degradation in Bitmapscan (commit 75ae538bc3168bf44475240d4e0487ee2f3bb376)

2017-01-06 Thread Andres Freund

On 2017-01-06 11:01:32 -0500, Robert Haas wrote:
> On Fri, Jan 6, 2017 at 10:43 AM, Andres Freund  wrote:
> > On 2016-12-16 09:34:31 -0800, Andres Freund wrote:
> >> > To fix his issue, we need something like your 0001.  Are you going to
> >> > polish that up soon here?
> >>
> >> Yes.
> >
> > I've two versions of a fix for this. One of them basically increases the
> > "spread" of buckets when the density goes up too much. It does so by
> > basically shifting the bucket number to the left (e.g. only every second
> > bucket can be the "primary" bucket for a hash value).  The other
> > basically just replaces the magic constants in my previous POC patch
> > with slightly better documented constants.  For me the latter works just
> > as well as the former, even though aesthetically/theoretically the
> > former sounds better.  I'm inclined to commit the latter, at least for
> > now.
> 
> Did you intend to attach the patches?

No, I hadn't. You're interested in the "spreading" version?

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance degradation in Bitmapscan (commit 75ae538bc3168bf44475240d4e0487ee2f3bb376)

On Fri, Jan 6, 2017 at 10:43 AM, Andres Freund  wrote:
> On 2016-12-16 09:34:31 -0800, Andres Freund wrote:
>> > To fix his issue, we need something like your 0001.  Are you going to
>> > polish that up soon here?
>>
>> Yes.
>
> I've two versions of a fix for this. One of them basically increases the
> "spread" of buckets when the density goes up too much. It does so by
> basically shifting the bucket number to the left (e.g. only every second
> bucket can be the "primary" bucket for a hash value).  The other
> basically just replaces the magic constants in my previous POC patch
> with slightly better documented constants.  For me the latter works just
> as well as the former, even though aesthetically/theoretically the
> former sounds better.  I'm inclined to commit the latter, at least for
> now.

Did you intend to attach the patches?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Indirect indexes

On Fri, Dec 30, 2016 at 5:35 PM, Alvaro Herrera
 wrote:
> Also, vacuuming: my answer continues to be that the killtuple
> interface should be good enough, ...

How deeply do you believe in that answer?  I mean, I grant you that
there are many use cases for which that will work fine, but
continuously-advancing keyspace is an example of a use case where the
index will grow without bound unless you REINDEX periodically, and
that sucks.  It's not clear to me that it's 100% unacceptable to
commit the feature with no other provision to remove dead tuples, but
if you do, I think it's likely to be a fairly major operational
problem for people who actually try to use this in production.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_stat_activity.waiting_start

2017-01-06 Thread Andres Freund

On 2017-01-06 10:43:32 -0500, Bruce Momjian wrote:
> On Thu, Jan  5, 2017 at 06:48:17PM -1000, Joel Jacobson wrote:
> > On Thu, Jan 5, 2017 at 4:59 PM, Bruce Momjian  wrote:
> > > Agreed.  No need in adding overhead for short-lived locks because the
> > > milli-second values are going to be meaningless to users. I would be
> > > happy if we could find some weasel value for non-heavyweight locks.
> > 
> > To avoid a NULL value for waiting_start, and thanks to non-heavyweight
> > locks don't exceed order-of-milliseconds, I think it would be
> > acceptable to just return now() whenever something wants to know
> > waiting_start i.e. when something selects from pg_stat_activity.
> > 
> > The exact value would only be within orders-of-milliseconds away from
> > now() anyway, so one can argue it's not that important, as long as the
> > documentation is clear on that point.
> 
> I don't think now() is a good value as it doesn't indicate to the user
> which values are real measurements and which are not.  NULL is probably
> the best.  +/-infinity is odd too.

Yea. If one wants to make NULL into now() it's trivial enough with a
single coalesce().

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_stat_activity.waiting_start

2017-01-06 Thread Bruce Momjian

On Thu, Jan  5, 2017 at 06:48:17PM -1000, Joel Jacobson wrote:
> On Thu, Jan 5, 2017 at 4:59 PM, Bruce Momjian  wrote:
> > Agreed.  No need in adding overhead for short-lived locks because the
> > milli-second values are going to be meaningless to users. I would be
> > happy if we could find some weasel value for non-heavyweight locks.
> 
> To avoid a NULL value for waiting_start, and thanks to non-heavyweight
> locks don't exceed order-of-milliseconds, I think it would be
> acceptable to just return now() whenever something wants to know
> waiting_start i.e. when something selects from pg_stat_activity.
> 
> The exact value would only be within orders-of-milliseconds away from
> now() anyway, so one can argue it's not that important, as long as the
> documentation is clear on that point.

I don't think now() is a good value as it doesn't indicate to the user
which values are real measurements and which are not.  NULL is probably
the best.  +/-infinity is odd too.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance degradation in Bitmapscan (commit 75ae538bc3168bf44475240d4e0487ee2f3bb376)

2017-01-06 Thread Andres Freund

On 2016-12-16 09:34:31 -0800, Andres Freund wrote:
> > To fix his issue, we need something like your 0001.  Are you going to
> > polish that up soon here?
>
> Yes.

I've two versions of a fix for this. One of them basically increases the
"spread" of buckets when the density goes up too much. It does so by
basically shifting the bucket number to the left (e.g. only every second
bucket can be the "primary" bucket for a hash value).  The other
basically just replaces the magic constants in my previous POC patch
with slightly better documented constants.  For me the latter works just
as well as the former, even though aesthetically/theoretically the
former sounds better.  I'm inclined to commit the latter, at least for
now.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Questionable tag usage

Magnus Hagander  writes:
> On Thu, Jan 5, 2017 at 5:50 AM, Tom Lane  wrote:
>> Well ... that will read nicely in output formats that have hyperlinks,
>> but not so well on plain dead trees where the cross-reference is either
>> invisible or an explicit footnote.  Our typical convention for this sort
>> of thing has been more like "... file for user name mapping (see > linkend="auth-username-maps">)".  That used to expand like
>> ...
>> You could argue that nobody reads the PG docs on dead trees anymore
>> and we should embrace the hyperlink style with enthusiasm.  I wouldn't
>> be against that personally, but there are a lot of places to change if
>> we decide that parenthetical "(see Section M.N)" hotlinks are passé.

> I don't think there are a lto of people who use dead tree editions anymore,
> but they certainly do exist. A lot of people use the PDFs though,
> particularly for offline reading or loading them in ebook readers. So it
> still has to be workable there.

PDFs do have hyperlinks, so that in itself isn't an argument for keeping
the dead-tree-friendly approach.  However, I've noticed some variation
among tools in whether a PDF hyperlink is visibly distinct, or whether
you have to mouse over it to find out that it would take you somewhere.
Not sure if that's enough of a usability fail to argue for keeping the
old way.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for pg_receivexlog --post-segment command


On 1/6/17 10:15 AM, Feike Steenbergen wrote:


On 6 January 2017 at 15:42, Magnus Hagander > wrote:


Is there actual value in providing both %p and %f? It's not like it's

really hard to do, but since the path will be specified on the same
commandline, you could just put it in the command?

As %f can be determined from %p I don't mind that much. However, having
a single static --whatever command may be very useful for configuration
management or backup tools that want to use static commands.

The other reason why I'd offer both is to have some uniformity with
archive_command, possibly allowing some reuse of code.


Agreed.  Currently pgBackRest only accepts %p for archive_command 
because it already knows where $PGDATA is.  Of course I can change that, 
but I think it makes sense to keep command options uniform anyway.


--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for pg_receivexlog --post-segment command

On 6 January 2017 at 15:42, Magnus Hagander  wrote:

> Is there actual value in providing both %p and %f? It's not like it's
really hard to do, but since the path will be specified on the same
commandline, you could just put it in the command?

As %f can be determined from %p I don't mind that much. However, having a
single static --whatever command may be very useful for configuration
management or backup tools that want to use static commands.

The other reason why I'd offer both is to have some uniformity with
archive_command, possibly allowing some reuse of code.

Re: [HACKERS] Make pg_basebackup -x stream the default

On Wed, Jan 4, 2017 at 10:43 AM, Magnus Hagander 
wrote:

>
>
> On Sun, Jan 1, 2017 at 12:47 AM, Michael Paquier <
> michael.paqu...@gmail.com> wrote:
>
>> On Sat, Dec 31, 2016 at 9:24 PM, Magnus Hagander 
>> wrote:
>> > On Tue, Dec 20, 2016 at 11:53 PM, Michael Paquier
>> >  wrote:
>> >> Recovery tests are broken by this patch, the backup() method in
>> >> PostgresNode.pm uses pg_basebackup -x:
>> >> sub backup
>> >> {
>> >> my ($self, $backup_name) = @_;
>> >> my $backup_path = $self->backup_dir . '/' . $backup_name;
>> >> my $port= $self->port;
>> >> my $name= $self->name;
>> >>
>> >> print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
>> >> TestLib::system_or_bail('pg_basebackup', '-D', $backup_path, '-p',
>> >> $port,
>> >> '-x', '--no-sync');
>> >> print "# Backup finished\n";
>> >> }
>> >
>> >
>> > Oh bleh. That's what I get for just running the tests for pg_basebackup
>> > itself.
>> >
>> > I removed it completely in this patch, making it use streaming. Or is
>> there
>> > a particular reason we want it to use fetch, that I'm not aware of?
>>
>> Not really. Both should be equivalent in the current tests. It may
>> actually be a good idea to add an argument in PostgresNode->backup to
>> define the WAL fetching method.
>>
>> > Attached is a new patch with this fix, and those suggested by Fujii as
>> well.
>>
>> -the backup. If this option is specified, it is possible to start
>> -a postmaster directly in the extracted directory without the need
>> -to consult the log archive, thus making this a completely
>> standalone
>> -backup.
>> +the backup. Unless the method none is
>> specified,
>> +it is possible to start a postmaster directly in the extracted
>> +directory without the need to consult the log archive, thus
>> +making this a completely standalone backup.
>> 
>> I find a bit weird to mention a method value in a paragraph before
>> listing them. Perhaps it should be a separate paragraph after the list
>> of methods available?
>>
>> -static bool includewal = false;
>> -static bool streamwal = false;
>> +static bool includewal = true;
>> +static bool streamwal = true;
>> Two booleans are used to define 3 states. It may be cleaner to use an
>> enum instead. The important point is that streaming WAL is enabled
>> only if "stream" method is used.
>>
>> Other than that the patch looks good to me. Tests pass.
>>
>
> Meh, just as I was going to respond "committed" I noticed this second
> round of review comments. Apologies, pushed without that.
>
> I agree on the change with includewal/streamwal. That was already the case
> in the existing code of course, but that doesn't mean it couldn't be made
> better. I'll take a look at doing that as a separate patch.
>
>
Here's a patch that does this. Does this match what you were thinking?


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 3f83d87..8ebf24e 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -61,6 +61,16 @@ typedef struct TablespaceList
  */
 #define MINIMUM_VERSION_FOR_PG_WAL	10
 
+/*
+ * Different ways to include WAL
+ */
+typedef enum
+{
+	NO_WAL,
+	FETCH_WAL,
+	STREAM_WAL
+} IncludeWal;
+
 /* Global options */
 static char *basedir = NULL;
 static TablespaceList tablespace_dirs = {NULL, NULL};
@@ -71,8 +81,7 @@ static bool noclean = false;
 static bool showprogress = false;
 static int	verbose = 0;
 static int	compresslevel = 0;
-static bool includewal = true;
-static bool streamwal = true;
+static IncludeWal includewal = STREAM_WAL;
 static bool fastcheckpoint = false;
 static bool writerecoveryconf = false;
 static bool do_sync = true;
@@ -1697,7 +1706,7 @@ BaseBackup(void)
 	 * If WAL streaming was requested, also check that the server is new
 	 * enough for that.
 	 */
-	if (streamwal && !CheckServerVersionForStreaming(conn))
+	if (includewal == STREAM_WAL && !CheckServerVersionForStreaming(conn))
 	{
 		/*
 		 * Error message already written in CheckServerVersionForStreaming(),
@@ -1731,9 +1740,9 @@ BaseBackup(void)
 		psprintf("BASE_BACKUP LABEL '%s' %s %s %s %s %s %s",
  escaped_label,
  showprogress ? "PROGRESS" : "",
- includewal && !streamwal ? "WAL" : "",
+ includewal == FETCH_WAL ? "WAL" : "",
  fastcheckpoint ? "FAST" : "",
- includewal ? "NOWAIT" : "",
+ includewal == NO_WAL ? "" : "NOWAIT",
  maxrate_clause ? maxrate_clause : "",
  format == 't' ? "TABLESPACE_MAP" : "");
 
@@ -1776,7 +1785,7 @@ BaseBackup(void)
 	PQclear(res);
 	MemSet(xlogend, 0, sizeof(xlogend));
 
-	if (verbose && includewal)
+	if (verbose && includewal != NO_WAL)
 		fprintf(stderr, _("transaction log start point: %s on timeline

Re: [HACKERS] [PATCH] guc-ify the formerly hard-coded MAX_SEND_SIZE to max_wal_send

2017-01-06 Thread Jonathon Nelson

On Fri, Jan 6, 2017 at 8:52 AM, Kevin Grittner  wrote:

> On Thu, Jan 5, 2017 at 7:32 PM, Jonathon Nelson  wrote:
> > On Thu, Jan 5, 2017 at 1:01 PM, Andres Freund 
> wrote:
> >> On 2017-01-05 12:55:44 -0600, Jonathon Nelson wrote:
>
> >>> In our lab environment and with a 16MiB setting, we saw substantially
> >>> better network utilization (almost 2x!), primarily over high bandwidth
> >>> delay product links.
> >>
> >> That's a bit odd - shouldn't the OS network stack take care of this in
> >> both cases?  I mean either is too big for TCP packets (including jumbo
> >> frames).  What type of OS and network is involved here?
> >
> > In our test lab, we make use of multiple flavors of Linux. No jumbo
> frames.
> > We simulated anything from 0 to 160ms RTT (with varying degrees of
> jitter,
> > packet loss, etc.) using tc. Even with everything fairly clean, at 80ms
> RTT
> > there was a 2x improvement in performance.
>
> Is there compression and/or encryption being performed by the
> network layers?  My experience with both is that they run faster on
> bigger chunks of data, and that might happen before the data is
> broken into packets.
>

There is no compression or encryption. The testing was with and without
various forms of hardware offload, etc. but otherwise there is no magic up
these sleeves.

-- 
Jon Nelson
Dyn / Principal Software Engineer

Re: [HACKERS] [PATCH] guc-ify the formerly hard-coded MAX_SEND_SIZE to max_wal_send

2017-01-06 Thread Kevin Grittner

On Thu, Jan 5, 2017 at 7:32 PM, Jonathon Nelson  wrote:
> On Thu, Jan 5, 2017 at 1:01 PM, Andres Freund  wrote:
>> On 2017-01-05 12:55:44 -0600, Jonathon Nelson wrote:

>>> In our lab environment and with a 16MiB setting, we saw substantially
>>> better network utilization (almost 2x!), primarily over high bandwidth
>>> delay product links.
>>
>> That's a bit odd - shouldn't the OS network stack take care of this in
>> both cases?  I mean either is too big for TCP packets (including jumbo
>> frames).  What type of OS and network is involved here?
>
> In our test lab, we make use of multiple flavors of Linux. No jumbo frames.
> We simulated anything from 0 to 160ms RTT (with varying degrees of jitter,
> packet loss, etc.) using tc. Even with everything fairly clean, at 80ms RTT
> there was a 2x improvement in performance.

Is there compression and/or encryption being performed by the
network layers?  My experience with both is that they run faster on
bigger chunks of data, and that might happen before the data is
broken into packets.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for pg_receivexlog --post-segment command

On Fri, Jan 6, 2017 at 2:55 PM, David Steele  wrote:

> On 1/6/17 8:49 AM, Feike Steenbergen wrote:
>
>>
>> On Fri, Jan 6, 2017 at 2:30 PM, David Steele > > wrote:
>>
>>> For my part I still prefer an actual command to be executed so it will
>>>
>> start/restart the archiver if it is not already running or died.  This
>> reduces the number of processes that I need to ensure are running.
>>
>>>
>>> If the consensus is that a signal is better then I'll make that work.
>>>
>> I will say this raises the bar on what is required to write a good
>> archive command and we already know it is quite a difficult task.
>>
>> On 6 January 2017 at 14:37, Magnus Hagander > > wrote:
>>
>>> I like the idea of a command as well, for flexibility. If you want a
>>>
>> signal, you can write a trivial command that sends the signal... Maximum
>> flexibility, as long as we don't create a lot of caveats for users.
>>
>> Agreed, I think it is also easier to understand the mechanism (instead
>> of a signal), and would allow for some reuse of already existing scripts.
>>
>> If we do use a full command (vs a signal), I propose we do also offer
>> the %p and %f placeholders for the command.
>>
>
> Agreed.  It shouldn't be that hard and could be very useful.  If nothing
> else it will eliminate the need to configure path to the pg_receivexlog
> queue in the archiver.


Is there actual value in providing both %p and %f? It's not like it's
really hard to do, but since the path will be specified on the same
commandline, you could just put it in the command?

E.g.
pg_receivexlog -D /blah/foo --whatever /some/where/myscript %p
if you want the path in it could just be written
pg_receivexlog -D /blah/foo --whatever /some/where/myscript /blah/foo/%f

And you have that path in the *same place* already? (unless you want to
teach the script about it, in which case you want just %f anyway)

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: [HACKERS] Support for pg_receivexlog --format=plain|tar


On 1/6/17 9:07 AM, Magnus Hagander wrote:

On Fri, Dec 30, 2016 at 6:41 AM, Michael Paquier
> wrote:
Attached is a simplified new version, I have kept the file descriptor
as originally done. Note that tests are actually difficult to work
out, there is no way to run in batch pg_receivexlog..


A few further notes:

You are using the filemode to gzopen and the mode_compression variable
to set the compression level. The pre-existing code in pg_basebackup
uses gzsetparams(). Is there a particular reason you didn't do it the
same way?

Small comment:
-   if (pad_to_size)
+   if (pad_to_size && dir_data->compression == 0)
{
/* Always pre-pad on regular files */


That "always" is not true anymore. Commit-time cleanup can be done of that.

The rest of this looks good to me, but please comment on the gzopen part
before we proceed to commit :)


I had planned to review this patch but have removed my name since it 
seems to be well in hand and likely to commit very soon and I won't have 
time to look at it until next week.


I will say that I'm happy to have this feature *and* eventually the 
post_command.


--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use strict in all Perl programs

On Fri, Jan 6, 2017 at 11:13 PM, David Steele  wrote:
> With regard to warnings, I prefer to use:
>
> use warnings FATAL => qw(all);
>
> This transforms all warnings into errors rather than just printing a message
> to stderr, which is very easy to miss among the other output.

Interesting. A couple of warnings have slipped a couple of times in
some TAP tests like those of pg_rewind, so it could be useful to
switch to that at least for the tests by detault.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use strict in all Perl programs


On 1/5/17 12:37 PM, Peter Eisentraut wrote:

On 12/31/16 1:34 AM, Michael Paquier wrote:

On Sat, Dec 31, 2016 at 3:07 PM, Peter Eisentraut
 wrote:

Here is a patch to add 'use strict' to all Perl programs (that I could
find), or move it to the right place where it was already there.  I
think that is a pretty standard thing to do nowadays.


committed that


What about adding as well "use warnings"? That's standard in all the TAP tests.


'use strict' can be statically checked using perl -c, but 'use warnings'
is run-time behavior, so one would have to extensively test the involved
programs.  Some cursory checking already reveals that this is going to
need to more investigation.  So in principle yes, but maybe later.


With regard to warnings, I prefer to use:

use warnings FATAL => qw(all);

This transforms all warnings into errors rather than just printing a 
message to stderr, which is very easy to miss among the other output.


--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

On Fri, Jan 6, 2017 at 6:28 AM, Amit Kapila  wrote:
>> Also, I'm thinking the bit could be stored in the line pointer rather
>> than the tuple, because with this design we don't need
>> LP_UNUSED/LP_NORMAL/LP_REDIRECT/LP_DEAD any more.  We could use one
>> bit to indicate dead or not-dead and the second bit to indicate
>> recently-modified or not-recently-modified.  With that approach,
>> clearing the bits only requires iterating over the line pointer array,
>> not the tuples themselves.
>
> I think this can help in mitigating the overhead.  However, now
> another question is if we just unset it when there is no other active
> transaction operating on the current page except for current
> transaction, then will that tuple be considered all-visible?  I think
> no transaction operating on a page can't be taken as a guarantee for
> tuple to be marked as all-visible. If that is true, then what is
> advantage of clearing the bit?

That's kind of a strange question.  The mission of this proposed bit
is to tell you whether the transaction(s) referenced in the page's
UNDO pointer have modified that tuple.  If you didn't clear it when
you reset the UNDO pointer to a new transaction, then the bit would
always be set for every tuple on the page and it would be completely
useless.  (I mean, the tuples had to be inserted originally, right?
So the bit was set then.  And if you never clear it, it will just stay
set.)

As to your other questions: If the page's UNDO pointer isn't valid,
then the whole page is all-visible.  If the page's UNDO pointer is
valid, then any tuple for which the bit is not set is all-visible.

> Okay, I think we could clean undo and heap without caring for the
> index.  Basically, the idea works on the premise that we won't allow
> same value rows in the index for same TID and using rechecks we can
> identify the deleted tuple.

Right, seems we are on the same page now.  Just to be clear, I'm not
saying there aren't other possible designs, and one of those designs
might be better.  But at least now we both seem to have the same
understanding of what I proposed, which seems like progress.

> On rethinking about the working of vacuum
> in this system,  it seems we can clean the heap independently and then
> for index we crosscheck each of the entry in the heap that is marked
> as deleted.

Right.

> Now this has the advantage that we don't need to do two
> passes of the heap, but for the index, we might need to re-fetch heap
> pages randomly to detect if the delete marked row is actually deleted.

Yes.  There are some performance trade-offs here no matter what
decision you make.  If you decide to try to remove delete-marked index
entries during UNDO, then (1) you have to somehow make sure that those
index entries aren't still needed by some previous transaction that
isn't yet all-visible (e.g. imagine key is updated 1->2->1, first
update commits but second one rolls back before first is all-visible)
which might be expensive and (2) if those index pages have been
evicted from cache you will incur additional I/O to bring them back
into cache, which might be more expensive than just cleaning them
later.  On the other hand, if you don't try to remove index pointers
during UNDO, then you're leaving bloat in your index.  We might want
to consider some combination of these things - e.g. if the page has
only one recent transaction (no TPD) and the index pages are still in
memory, have UNDO try to remove them, otherwise clean them later.  Or
some other rule.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for pg_receivexlog --format=plain|tar

On Fri, Dec 30, 2016 at 6:41 AM, Michael Paquier 
wrote:

> On Thu, Dec 29, 2016 at 6:13 PM, Magnus Hagander 
> wrote:
> > On Thu, Dec 29, 2016 at 12:35 AM, Michael Paquier
> >  wrote:
> >> On Wed, Dec 28, 2016 at 9:31 PM, Magnus Hagander 
> >> wrote:
> >> >> - I have switched the directory method to use a file pointer instead
> >> >> of a file descriptor as gzwrite returns int as the number of
> >> >> uncompressed bytes written.
> >> >
> >> > I don't really follow that reasoning :) Why does the directory method
> >> > have
> >> > to change to use a filepointer because of that?
> >>
> >> The only reason is that write() returns size_t and fwrite returns int,
> >> while gzwrite() returns int. It seems more consistent to use fwrite()
> >> in this case. Or we don't bother about my nitpicking and just cast
> >> stuff.
> >
> >
> > I can at least partially see that argument, but your patch doesn't
> actually
> > use fwrite(), it uses write() with fileno()...
>
> That was part of the one/two things I wanted to change before sending
> a fresh patch.
>

OK.



> > But also, on my platform (debian jessie), fwrite() returns size_t, and
> > write() returns ssize_t. So those are apparently both different from what
> > your platform does - which one did you get that one?
>
> It looks like I misread the macos man pages previously. Thay actually
> list ssize_t. I find a bit surprising the way gzwrite is designed. It
> uses an input an unsigned integer and returns to caller a signed
> integer, so this will never work with uncompressed buffers of sizes
> higher than 2GB. There's little point to worry about that in
> pg_receivexlog though, so let's just cast to ssize_t.
>
> Attached is a simplified new version, I have kept the file descriptor
> as originally done. Note that tests are actually difficult to work
> out, there is no way to run in batch pg_receivexlog..
>

A few further notes:

You are using the filemode to gzopen and the mode_compression variable to
set the compression level. The pre-existing code in pg_basebackup uses
gzsetparams(). Is there a particular reason you didn't do it the same way?

Small comment:
-   if (pad_to_size)
+   if (pad_to_size && dir_data->compression == 0)
{
/* Always pre-pad on regular files */


That "always" is not true anymore. Commit-time cleanup can be done of that.

The rest of this looks good to me, but please comment on the gzopen part
before we proceed to commit :)

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: [HACKERS] Questionable tag usage

On Thu, Jan 5, 2017 at 5:50 AM, Tom Lane  wrote:

> Tatsuo Ishii  writes:
> > In:
> > https://www.postgresql.org/docs/devel/static/runtime-
> config-file-locations.html
> > "Specifies the configuration file for Section 20.2,  $B!H (BUser Name
> Maps $B!I (B
> > user name mapping" looks pretty strange to me because a raw section
> > name appears.
>
> Yeah, it's definitely duplicative.  It was less so before the recent
> docs-toolchain changeover, because in the old toolchain the 
> tag only printed "Section M.N" and didn't include a section title;
> see the same page in prior versions.  I'm sure the markup was written
> with that in mind.  Not that that makes it good style necessarily.
>
> > Shouldn't we use a link tag instead of the xref tag here? Attached is
> > a patch to fix this.
>
> - Specifies the configuration file for
> -  user name mapping
> - (customarily called pg_ident.conf).
> - This parameter can only be set at server start.
> + Specifies the configuration file
> + for user name mapping
> + (customarily called pg_ident.conf).  This parameter
> can
> + only be set at server start.
>
> Well ... that will read nicely in output formats that have hyperlinks,
> but not so well on plain dead trees where the cross-reference is either
> invisible or an explicit footnote.  Our typical convention for this sort
> of thing has been more like "... file for user name mapping (see  linkend="auth-username-maps">)".  That used to expand like
>
> file for user name mapping (see Section 20.2).
>
> and now it expands like
>
> file for user name mapping (see Section 20.2, "User Name Mapping").
>
> In either case the text after "see" is a hotlink if supported.
>
> I complained previously that this seems a bit duplicative now,
> but didn't get any responses:
> https://www.postgresql.org/message-id/31278.1479587695%40sss.pgh.pa.us
>
> You could argue that nobody reads the PG docs on dead trees anymore
> and we should embrace the hyperlink style with enthusiasm.  I wouldn't
> be against that personally, but there are a lot of places to change if
> we decide that parenthetical "(see Section M.N)" hotlinks are pass ,Ai (B.
>

I don't think there are a lto of people who use dead tree editions anymore,
but they certainly do exist. A lot of people use the PDFs though,
particularly for offline reading or loading them in ebook readers. So it
still has to be workable there.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: [HACKERS] Support for pg_receivexlog --post-segment command


On 1/6/17 8:49 AM, Feike Steenbergen wrote:


On Fri, Jan 6, 2017 at 2:30 PM, David Steele > wrote:

For my part I still prefer an actual command to be executed so it will

start/restart the archiver if it is not already running or died.  This
reduces the number of processes that I need to ensure are running.


If the consensus is that a signal is better then I'll make that work.

I will say this raises the bar on what is required to write a good
archive command and we already know it is quite a difficult task.

On 6 January 2017 at 14:37, Magnus Hagander > wrote:

I like the idea of a command as well, for flexibility. If you want a

signal, you can write a trivial command that sends the signal... Maximum
flexibility, as long as we don't create a lot of caveats for users.

Agreed, I think it is also easier to understand the mechanism (instead
of a signal), and would allow for some reuse of already existing scripts.

If we do use a full command (vs a signal), I propose we do also offer
the %p and %f placeholders for the command.


Agreed.  It shouldn't be that hard and could be very useful.  If nothing 
else it will eliminate the need to configure path to the pg_receivexlog 
queue in the archiver.


--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for pg_receivexlog --post-segment command

On Fri, Jan 6, 2017 at 2:30 PM, David Steele  wrote:
> For my part I still prefer an actual command to be executed so it will
start/restart the archiver if it is not already running or died.  This
reduces the number of processes that I need to ensure are running.
>
> If the consensus is that a signal is better then I'll make that work.  I
will say this raises the bar on what is required to write a good archive
command and we already know it is quite a difficult task.

On 6 January 2017 at 14:37, Magnus Hagander  wrote:
> I like the idea of a command as well, for flexibility. If you want a
signal, you can write a trivial command that sends the signal... Maximum
flexibility, as long as we don't create a lot of caveats for users.

Agreed, I think it is also easier to understand the mechanism (instead of a
signal), and would allow for some reuse of already existing scripts.

If we do use a full command (vs a signal), I propose we do also offer the
%p and %f placeholders for the command.

Re: [HACKERS] Support for pg_receivexlog --post-segment command

On Fri, Jan 6, 2017 at 2:30 PM, David Steele  wrote:

> On 1/6/17 8:09 AM, Feike Steenbergen wrote:
>
>> On 6 January 2017 at 13:50, Magnus Hagander > > wrote:
>>
>>> I think we're better off clearly documenting that we don't care about
>>>
>> it. And basically let the external command be responsible for that part.
>>
>> So for example, your typical backup manager would listen to this
>>>
>> signal or whatever to react quickly. But it would *also* have some sort
>> of fallback. For example, whenever it's triggered it also checks if
>> there are any previous segments it missed (this would also cover the
>> startup sequence).
>>
>
> I'm fine with the backup manager doing all the work of keeping track of
> what has been compressed, moved to archive, etc.  No need to reinvent the
> wheel here.
>
> For my part I still prefer an actual command to be executed so it will
> start/restart the archiver if it is not already running or died.  This
> reduces the number of processes that I need to ensure are running.
>
> If the consensus is that a signal is better then I'll make that work.  I
> will say this raises the bar on what is required to write a good archive
> command and we already know it is quite a difficult task.


I like the idea of a command as well, for flexibility. If you want a
signal, you can write a trivial command that sends the signal... Maximum
flexibility, as long as we don't create a lot of caveats for users.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: [HACKERS] Support for pg_receivexlog --post-segment command


On 1/6/17 8:09 AM, Feike Steenbergen wrote:

On 6 January 2017 at 13:50, Magnus Hagander > wrote:

I think we're better off clearly documenting that we don't care about

it. And basically let the external command be responsible for that part.


So for example, your typical backup manager would listen to this

signal or whatever to react quickly. But it would *also* have some sort
of fallback. For example, whenever it's triggered it also checks if
there are any previous segments it missed (this would also cover the
startup sequence).


I'm fine with the backup manager doing all the work of keeping track of 
what has been compressed, moved to archive, etc.  No need to reinvent 
the wheel here.


For my part I still prefer an actual command to be executed so it will 
start/restart the archiver if it is not already running or died.  This 
reduces the number of processes that I need to ensure are running.


If the consensus is that a signal is better then I'll make that work.  I 
will say this raises the bar on what is required to write a good archive 
command and we already know it is quite a difficult task.



For me this works fine. I just want to ensure that if there is any work
to be done, my backup manager will do the work quickly. My
implementation might be very simply a process that checks every n
seconds or when signalled.


Since we never actually remove anything (unlike archive_command which

has the integration with wal segment rotation), I think this can be done
perfectly safe.


Looking at the usecases where you have been doing it, are there any

where this would not work?

This would work for all usecases I've come across.


Agreed.

--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add pgstathashindex() to get hash index table statistics.

2017-01-06 Thread Ashutosh Sharma

Hi,

> I think the calculation for max available spcae is wrong here. You
> should subtract the page header and special area from the total page size.
> A check for non-zero denominator should be added while calculating the
> percentage.
> There can be multiple bitmap pages. Right?

Yes, we can have multiple bitmap pages. I have adjusted the
calculation for free space percentage accordingly. Please check the
attached v2 patch.

>
> +values[10] = Float8GetDatum(free_percent);
> Some precision should be added.

Corrected. Please refer v2 patch.

>
> +   
> +ffactor
> +integer
> +Average number of tuples per bucket
> +   
> I feel that either the column name should be changed or it should
> just output how full the index method is in percentage.

Fixed. Refer to v2 patch.

>
> +   
> +
> +
> +   
> Please remove extra spaces.

Done. Please refer to v2 patch.

>
> And, please add some test cases for regression tests.
>
Added a test-case. Please check v2 patch attached with this mail.

--
With Regards,
Ashutosh Sharma.
EnterpriseDB: http://www.enterprisedb.com


0001-Add-pgstathashindex-to-pgstattuple-extension-v2.patch
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for pg_receivexlog --post-segment command

On 6 January 2017 at 13:50, Magnus Hagander  wrote:
> I think we're better off clearly documenting that we don't care about it.
And basically let the external command be responsible for that part.

> So for example, your typical backup manager would listen to this signal
or whatever to react quickly. But it would *also* have some sort of
fallback. For example, whenever it's triggered it also checks if there are
any previous segments it missed (this would also cover the startup
sequence).

For me this works fine. I just want to ensure that if there is any work to
be done, my backup manager will do the work quickly. My implementation
might be very simply a process that checks every n seconds or when
signalled.

> Since we never actually remove anything (unlike archive_command which has
the integration with wal segment rotation), I think this can be done
perfectly safe.
>
> Looking at the usecases where you have been doing it, are there any where
this would not work?

This would work for all usecases I've come across.

Re: [HACKERS] postgres_fdw bug in 9.6

2017-01-06 Thread Etsuro Fujita


On 2017/01/05 12:10, Etsuro Fujita wrote:

On 2016/12/28 17:34, Ashutosh Bapat wrote:

Hmm. If I understand the patch correctly, it does not return any path
when merge join is allowed and there are merge clauses but no hash
clauses. In this case we will not create a foreign join path, loosing
some optimization. If we remove GetExistingLocalJoinPath, which
returns a path in those cases as well, we have a regression in
performance.



Ok, will revise, but as I mentioned upthread, I'm not sure it's a good
idea to search the pathlist to get a merge join even in this case.  I'd
vote for creating a merge join path from the inner/outer paths in this
case as well.


Done.  Attached is the new version of the patch.

* I'm still not sure the search approach is the right way to go, so I 
modified CreateLocalJoinPath so that it creates a mergejoin path that 
explicitly sorts both the outer and inner relations as in 
sort_inner_and_outer, by using the information saved in that function. 
I think we could try to create a sort-free mergejoin as in 
match_unsorted_outer, but I'm not sure it's worth complicating the code.
* I modified CreateLocalJoinPath so that it handles the cheapest-total 
paths for the outer/inner relations that are parameterized if possible.

* I adjusted the code and revised some comments.

As I said upthread, we could skip costing for a local join path in 
CreateLocalJoinPath, for efficiency, but I'm not sure we should do that.


Best regards,
Etsuro Fujita
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
***
*** 1519,1540  SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t
 Output: t1.c1, t2.c1, t1.c3, t1.*, t2.*
 Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
 Remote SQL: SELECT r1."C 1", r1.c3, CASE WHEN (r1.*)::text IS NOT NULL THEN ROW(r1."C 1", r1.c2, r1.c3, r1.c4, r1.c5, r1.c6, r1.c7, r1.c8) END, r2."C 1", CASE WHEN (r2.*)::text IS NOT NULL THEN ROW(r2."C 1", r2.c2, r2.c3, r2.c4, r2.c5, r2.c6, r2.c7, r2.c8) END FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1" ORDER BY r1.c3 ASC NULLS LAST, r1."C 1" ASC NULLS LAST FOR UPDATE OF r1
!->  Merge Join
   Output: t1.c1, t1.c3, t1.*, t2.c1, t2.*
!  Merge Cond: (t1.c1 = t2.c1)
!  ->  Sort
 Output: t1.c1, t1.c3, t1.*
!Sort Key: t1.c1
!->  Foreign Scan on public.ft1 t1
!  Output: t1.c1, t1.c3, t1.*
!  Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" FOR UPDATE
!  ->  Sort
 Output: t2.c1, t2.*
!Sort Key: t2.c1
!->  Foreign Scan on public.ft2 t2
!  Output: t2.c1, t2.*
!  Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1"
! (23 rows)
  
  SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10 FOR UPDATE OF t1;
   c1  | c1  
--- 1519,1534 
 Output: t1.c1, t2.c1, t1.c3, t1.*, t2.*
 Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
 Remote SQL: SELECT r1."C 1", r1.c3, CASE WHEN (r1.*)::text IS NOT NULL THEN ROW(r1."C 1", r1.c2, r1.c3, r1.c4, r1.c5, r1.c6, r1.c7, r1.c8) END, r2."C 1", CASE WHEN (r2.*)::text IS NOT NULL THEN ROW(r2."C 1", r2.c2, r2.c3, r2.c4, r2.c5, r2.c6, r2.c7, r2.c8) END FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1" ORDER BY r1.c3 ASC NULLS LAST, r1."C 1" ASC NULLS LAST FOR UPDATE OF r1
!->  Nested Loop
   Output: t1.c1, t1.c3, t1.*, t2.c1, t2.*
!  Join Filter: (t1.c1 = t2.c1)
!  ->  Foreign Scan on public.ft1 t1
 Output: t1.c1, t1.c3, t1.*
!Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" FOR UPDATE
!  ->  Foreign Scan on public.ft2 t2
 Output: t2.c1, t2.*
!Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1"
! (17 rows)
  
  SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10 FOR UPDATE OF t1;
   c1  | c1  
***
*** 1563,1584  SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t
 Output: t1.c1, t2.c1, t1.c3, t1.*, t2.*
 Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
 Remote SQL: SELECT r1."C 1", r1.c3, CASE WHEN (r1.*)::text IS NOT NULL THEN ROW(r1."C 1", r1.c2, r1.c3, r1.c4, r1.c5, r1.c6, r1.c7, r1.c8) END,

Re: [HACKERS] postgres_fdw : altering foreign table not invalidating prepare statement execution plan.

2017-01-06 Thread Etsuro Fujita


On 2017/01/06 21:25, Ashutosh Bapat wrote:

On Thu, Jan 5, 2017 at 2:49 PM, Etsuro Fujita
 wrote:

On 2017/01/03 15:57, Ashutosh Bapat wrote:

The patch looks good to me, but I feel there are too many testscases.
Now that we have changed the approach to invalidate caches in all
cases, should we just include cases for SELECT or UPDATE or INSERT or
DELETE instead of each statement?



I don't object to that, but (1) the tests I added wouldn't be that
time-consuming, and (2) they would be more expected to help find bugs, in
general, so I'd vote for keeping them.  How about leaving that for the
committer's judge?



Ok. Marking this as ready for committer.


Thanks!

Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for pg_receivexlog --post-segment command

On Fri, Jan 6, 2017 at 1:45 PM, Feike Steenbergen <
feikesteenber...@gmail.com> wrote:

> Hi all,
>
> When reading through "Support for pg_receivexlog --format=plain|tar"[1], I
> came across a notion from Magnus Hagander that has crossed my mind a few
> times as well in the past years. As the feature proposed here is not
> directly related to that thread, I thought it best to start a new thread to
> discuss.
>
> > I have been talking to David about it a couple of times, and he agreed
> that it'd be useful to have a post-segment command. We haven't discussed it
> in much detail though. I'll add him to direct-cc here to see if he has any
> further input :)
>
> I'm coming across a few usecases where this would seem very useful. I'm
> looking to push finished segments to some api as soon as possible. Having
> the post-segment command would allow me to get there. I've tried the
> following approaches, none of them are very satisfying however:
>
> - periodic checking of new files (cron)
> - using inotify
> - tailing verbose pg_receivexlog output to see when a segment was switched
>

Yeah, these are all kind of ugly. Or platform specific. Or both.

You can make them work, but they feel hackish.

> > It could be that the best idea is to just notify some other process of
> what's happening. But making it an external command would give that a lot
> of flexibility. Of course, we need to be careful not to put ourselves back
> in the position we are in with archive_command, in that it's very difficult
> to write a good one.
>
> A signal for  would be good enough for my use case, I don't necessarily
> need all the bookkeeping to ensure the post-segment command was finished
> successfully. However I can see people expecting similar behaviour for the
> post-segment command as for the archive_command. If we would use an
> external command, does this also imply that we need some bookkeeping around
> which segments are ready and done, similar to what is done on the server in
> the archive_status directory?
>

I'd really like to avoid having to re-implement that functionality in
pg_receivexlog. Given the experiences we have with how difficult it is to
get right in archive_command...

I think we're better off clearly documenting that we don't care about it.
And basically let the external command be responsible for that part.

So for example, your typical backup manager would listen to this signal or
whatever to react quickly. But it would *also* have some sort of fallback.
For example, whenever it's triggered it also checks if there are any
previous segments it missed (this would also cover the startup sequence).

Since we never actually remove anything (unlike archive_command which has
the integration with wal segment rotation), I think this can be done
perfectly safe.

Looking at the usecases where you have been doing it, are there any where
this would not work?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: [HACKERS] [COMMITTERS] pgsql: Fix possible crash reading pg_stat_activity.

On Thu, Jan 5, 2017 at 5:07 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, Jan 5, 2017 at 4:33 PM, Tom Lane  wrote:
>>> Better documentation seems required, but really the whole design seems
>>> rather wacko.  Backends must agree on numeric tranche IDs, but every
>>> backend has its own copy of the tranche name?  How do we even know what
>>> agreement is?  And every one has to "register" every tranche ID for
>>> itself?  Why in the world isn't registration done *once* and the tranche
>>> name stored in shared memory?
>
>> Well, with the original design, that wasn't feasible, because each
>> backend had to store not only the name (which was constant across all
>> backends) but also the array_base (which might not be, if the locks
>> were stored in DSM) and array_stride.  Since commit
>> 3761fe3c20bb040b15f0e8da58d824631da00caa, it would be much easier to
>> do what you're proposing.  It's still not without difficulties,
>> though.  There are 65,536 possible tranche IDs, and allocating an
>> array of 64k pointers in shared memory would consume half a megabyte
>> of shared memory the vast majority of which would normally be
>> completely unused.  So I'm not very enthused about that solution, but
>> you aren't the first person to propose it.
>
> So, um, how do we know that backend A and backend B have the same idea
> about what tranche id 37 means?

Well, if they just call C exposed functions at random with arguments
picked out of a hat, then we don't.  But if they use the APIs in the
manner documented in lwlock.h, then we do:

/*
 * There is another, more flexible method of obtaining lwlocks. First, call
 * LWLockNewTrancheId just once to obtain a tranche ID; this allocates from
 * a shared counter.  Next, each individual process using the tranche should
 * call LWLockRegisterTranche() to associate that tranche ID with a name.
 * Finally, LWLockInitialize should be called just once per lwlock, passing
 * the tranche ID as an argument.
 *
 * It may seem strange that each process using the tranche must register it
 * separately, but dynamic shared memory segments aren't guaranteed to be
 * mapped at the same address in all coordinating backends, so storing the
 * registration in the main shared memory segment wouldn't work for that case.
 */

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Support for pg_receivexlog --post-segment command

Hi all,

When reading through "Support for pg_receivexlog --format=plain|tar"[1], I
came across a notion from Magnus Hagander that has crossed my mind a few
times as well in the past years. As the feature proposed here is not
directly related to that thread, I thought it best to start a new thread to
discuss.

> I have been talking to David about it a couple of times, and he agreed
that it'd be useful to have a post-segment command. We haven't discussed it
in much detail though. I'll add him to direct-cc here to see if he has any
further input :)

I'm coming across a few usecases where this would seem very useful. I'm
looking to push finished segments to some api as soon as possible. Having
the post-segment command would allow me to get there. I've tried the
following approaches, none of them are very satisfying however:

- periodic checking of new files (cron)
- using inotify
- tailing verbose pg_receivexlog output to see when a segment was switched

> It could be that the best idea is to just notify some other process of
what's happening. But making it an external command would give that a lot
of flexibility. Of course, we need to be careful not to put ourselves back
in the position we are in with archive_command, in that it's very difficult
to write a good one.

A signal for would be good enough for my use case, I don't necessarily
need all the bookkeeping to ensure the post-segment command was finished
successfully. However I can see people expecting similar behaviour for the
post-segment command as for the archive_command. If we would use an
external command, does this also imply that we need some bookkeeping around
which segments are ready and done, similar to what is done on the server in
the archive_status directory?

Thanks,

Feike

[1]
https://www.postgresql.org/message-id/CAB7nPqTEVXjtH+fehcCbP791H71cfLN_p9rrd-h=ymjfshz...@mail.gmail.com

Re: [HACKERS] Potential data loss of 2PC files

2017-01-06 Thread Ashutosh Bapat

Marking this as ready for committer.

On Wed, Jan 4, 2017 at 12:16 PM, Michael Paquier
 wrote:
> On Wed, Jan 4, 2017 at 1:23 PM, Ashutosh Bapat
>  wrote:
>> I don't have anything more to review in this patch. I will leave that
>> commitfest entry in "needs review" status for few days in case anyone
>> else wants to review it. If none is going to review it, we can mark it
>> as "ready for committer".
>
> Thanks for the input!
> --
> Michael



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgres_fdw : altering foreign table not invalidating prepare statement execution plan.

2017-01-06 Thread Ashutosh Bapat

On Thu, Jan 5, 2017 at 2:49 PM, Etsuro Fujita
 wrote:
> On 2017/01/03 15:57, Ashutosh Bapat wrote:
>>
>> The patch looks good to me, but I feel there are too many testscases.
>> Now that we have changed the approach to invalidate caches in all
>> cases, should we just include cases for SELECT or UPDATE or INSERT or
>> DELETE instead of each statement?
>
>
> I don't object to that, but (1) the tests I added wouldn't be that
> time-consuming, and (2) they would be more expected to help find bugs, in
> general, so I'd vote for keeping them.  How about leaving that for the
> committer's judge?
>

Ok. Marking this as ready for committer.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Odd behavior with PG_TRY

On Thu, Jan 5, 2017 at 7:10 PM, Amit Kapila  wrote:
> Memory contexts used in catch block also doesn't seem to be marked as
> volatile, you might want to try by marking them as volatile.  Also, it
> might worth trying it on some other system to see if you are by any
> chance hitting the problem similar to what Tom has faced.

If a variable is modified within PG_TRY and then referenced in
PG_CATCH it needs to be marked as volatile to be strictly in
conformance with POSIX. This also ensures that any compiler does not
do any stupid optimizations with those variables in the way they are
referenced and used.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-01-06 Thread Ashutosh Bapat

On Fri, Dec 23, 2016 at 10:51 AM, Amit Khandekar  wrote:
> Currently an Append plan node does not execute its subplans in
> parallel. There is no distribution of workers across its subplans. The
> second subplan starts running only after the first subplan finishes,
> although the individual subplans may be running parallel scans.
>
> Secondly, we create a partial Append path for an appendrel, but we do
> that only if all of its member subpaths are partial paths. If one or
> more of the subplans is a non-parallel path, there will be only a
> non-parallel Append. So whatever node is sitting on top of Append is
> not going to do a parallel plan; for example, a select count(*) won't
> divide it into partial aggregates if the underlying Append is not
> partial.
>
> The attached patch removes both of the above restrictions.  There has
> already been a mail thread [1] that discusses an approach suggested by
> Robert Haas for implementing this feature. This patch uses this same
> approach.

The first goal requires some kind of synchronization which will allow workers
to be distributed across the subplans. The second goal requires some kind of
synchronization to prevent multiple workers from executing non-parallel
subplans. The patch uses different mechanisms to achieve the goals. If
we create two different patches addressing each goal, those may be
easier to handle.

We may want to think about a third goal: preventing too many workers
from executing the same plan. As per comment in get_parallel_divisor()
we do not see any benefit if more than 4 workers execute the same
node. So, an append node can distribute more than 4 worker nodes
equally among the available subplans. It might be better to do that as
a separate patch.

>
> Attached is pgbench_create_partition.sql (derived from the one
> included in the above thread) that distributes pgbench_accounts table
> data into 3 partitions pgbench_account_[1-3]. The below queries use
> this schema.
>
> Consider a query such as :
> select count(*) from pgbench_accounts;
>
> Now suppose, these two partitions do not allow parallel scan :
> alter table pgbench_accounts_1 set (parallel_workers=0);
> alter table pgbench_accounts_2 set (parallel_workers=0);
>
> On HEAD, due to some of the partitions having non-parallel scans, the
> whole Append would be a sequential scan :
>
>  Aggregate
>->  Append
>  ->  Index Only Scan using pgbench_accounts_pkey on pgbench_accounts
>  ->  Seq Scan on pgbench_accounts_1
>  ->  Seq Scan on pgbench_accounts_2
>  ->  Seq Scan on pgbench_accounts_3
>
> Whereas, with the patch, the Append looks like this :
>
>  Finalize Aggregate
>->  Gather
>  Workers Planned: 6
>  ->  Partial Aggregate
>->  Parallel Append
>  ->  Parallel Seq Scan on pgbench_accounts
>  ->  Seq Scan on pgbench_accounts_1
>  ->  Seq Scan on pgbench_accounts_2
>  ->  Parallel Seq Scan on pgbench_accounts_3
>
> Above, Parallel Append is generated, and it executes all these
> subplans in parallel, with 1 worker executing each of the sequential
> scans, and multiple workers executing each of the parallel subplans.
>
>
> === Implementation details 
>
> --- Adding parallel-awareness ---
>
> In a given worker, this Append plan node will be executing just like
> the usual partial Append node. It will run a subplan until completion.
> The subplan may or may not be a partial parallel-aware plan like
> parallelScan. After the subplan is done, Append will choose the next
> subplan. It is here where it will be different than the current
> partial Append plan: it is parallel-aware. The Append nodes in the
> workers will be aware that there are other Append nodes running in
> parallel. The partial Append will have to coordinate with other Append
> nodes while choosing the next subplan.
>
> --- Distribution of workers 
>
> The coordination info is stored in a shared array, each element of
> which describes the per-subplan info. This info contains the number of
> workers currently executing the subplan, and the maximum number of
> workers that should be executing it at the same time. For non-partial
> sublans, max workers would always be 1. For choosing the next subplan,
> the Append executor will sequentially iterate over the array to find a
> subplan having the least number of workers currently being executed,
> AND which is not already being executed by the maximum number of
> workers assigned for the subplan. Once it gets one, it increments
> current_workers, and releases the Spinlock, so that other workers can
> choose their next subplan if they are waiting.
>
> This way, workers would be fairly distributed across subplans.
>
> The shared array needs to be initialized and made available to
> workers. For this, we can do exactly what sequential scan does for
> being parallel-aware :

Re: [HACKERS] UNDO and in-place update

On Thu, Jan 5, 2017 at 9:25 PM, Robert Haas  wrote:
> On Wed, Jan 4, 2017 at 6:05 AM, Amit Kapila  wrote:
>> Okay, so this optimization can work only after all the active
>> transactions operating on a page are finished.  If that is true, in
>> some cases such a design can consume a lot of CPU traversing all the
>> tuples in a page for un-setting the bit, especially when such tuples
>> are less.
>
> I suppose.  I didn't think that cost was likely to be big enough to
> worry about, but I might be wrong.  The worst case would be when you
> modify one tuple on a page, let the transaction that did the
> modification become all-visible, modify one tuple on the page again,
> etc. and, at the same time, the page is entirely full of tuples.  So
> you keep having to loop over all the bits to clear them (they are all
> clear except one, but you don't know that) and then re-set just one of
> them.  That's not free, but keep in mind that the existing system
> would be forced to perform non-HOT updates in that situation, which
> isn't free either.
>
> Also, I'm thinking the bit could be stored in the line pointer rather
> than the tuple, because with this design we don't need
> LP_UNUSED/LP_NORMAL/LP_REDIRECT/LP_DEAD any more.  We could use one
> bit to indicate dead or not-dead and the second bit to indicate
> recently-modified or not-recently-modified.  With that approach,
> clearing the bits only requires iterating over the line pointer array,
> not the tuples themselves.
>

I think this can help in mitigating the overhead.  However, now
another question is if we just unset it when there is no other active
transaction operating on the current page except for current
transaction, then will that tuple be considered all-visible?  I think
no transaction operating on a page can't be taken as a guarantee for
tuple to be marked as all-visible. If that is true, then what is
advantage of clearing the bit?

>>> We don't necessarily need UNDO to clean up the indexes, although it
>>> might be a good idea.  It would *work* to just periodically scan the
>>> index for delete-marked items.  For each one, visit the heap and see
>>> if there's a version of that tuple present in the heap or current UNDO
>>> that matches the index entry.  If not, remove the index entry.
>>
>> I think this is somewhat similar to how we clean the index now and
>> seems to be a workable design.  However, don't you think that it will
>> have somewhat similar characteristics for index-bloat as we have now?
>
> Yes, it would have similar characteristics.  Thus we might want to do better.
>
>> OTOH for heap, it will help us to take away old versions away from the
>> main heap, but still, we can't get rid of that space or reuse that
>> space till we can clean corresponding index entries.
>
> I don't think that's true.  If in-place update is ever allowed in
> cases where indexed columns have been modified, then the index already
> has to cope with the possibility that the heap tuple it can see
> doesn't match the index.  And if it can cope with that, then why do we
> have to remove the index entry before reusing the heap TID?
>

No need. I think I can see what you are saying.

>> UNDO has to be kept till heap page is marked as all visible.  This is
>> required to check the visibility of index.  Now, I think the page can
>> be marked as all visible when we removed corresponding dead entries in
>> heap.   I think the main point of discussion here is how vacuum will
>> clean heap and index entries in this new system.  Another idea could
>> be that we allow undo entries to be removed (we can allow heap entry
>> to be removed) based on if they are visible or not and then while
>> traversing index we cross check in heap if the entry can be removed.
>> Basically,  check the TID and traverse undo contents if any to verify
>> if the index entry can be removed.  I think this sounds to be more
>> complicated and less efficient than what I suggested earlier.
>
> In my proposal, when a tuple gets updated or deleted in the heap, we
> go find the corresponding index entries and delete-mark them.  When an
> index tuple is delete-marked, we have to cross-check it against the
> heap, because the index tuple might not match the version of the heap
> tuple that our snapshot can see.  Therefore, it's OK to discard the
> heap page's UNDO before cleaning the indexes, because the index tuples
> are already delete-marked and rechecks will therefore do the right
> thing.
>

Okay, I think we could clean undo and heap without caring for the
index.  Basically, the idea works on the premise that we won't allow
same value rows in the index for same TID and using rechecks we can
identify the deleted tuple.  On rethinking about the working of vacuum
in this system,  it seems we can clean the heap independently and then
for index we crosscheck each of the entry in the heap that is marked
as deleted.  Now this has the advantage that we don't need

Re: [HACKERS] Declarative partitioning - another take

2017-01-06 Thread Amit Langote

On 2017/01/05 3:26, Robert Haas wrote:
> On Tue, Dec 27, 2016 at 8:41 PM, Amit Langote
>  wrote:
>> On 2016/12/27 19:07, Amit Langote wrote:
>>> Attached should fix that.
>>
>> Here are the last two patches with additional information like other
>> patches.  Forgot to do that yesterday.
> 
> 0001 has the disadvantage that get_partition_for_tuple() acquires a
> side effect.  That seems undesirable.  At the least, it needs to be
> documented in the function's header comment.

That's true.  How about we save away the original ecxt_scantuple at entry
and restore the same just before returning from the function?  That way
there would be no side effect.  0001 implements that.

> It's unclear to me why we need to do 0002.  It doesn't seem like it
> should be necessary, it doesn't seem like a good idea, and the commit
> message you proposed is uninformative.

If a single BulkInsertState object is passed to
heap_insert()/heap_multi_insert() for different heaps corresponding to
different partitions (from one input tuple to next), tuples might end up
going into wrong heaps (like demonstrated in one of the reports [1]).  A
simple solution is to disable bulk-insert in case of partitioned tables.

But my patch (or its motivations) was slightly wrongheaded, wherein I
conflated multi-insert stuff and bulk-insert considerations.  I revised
0002 to not do that.

However if we disable bulk-insert mode, COPY's purported performance
benefit compared with INSERT is naught.  Patch 0003 is a proposal to
implement bulk-insert mode even for partitioned tables.  Basically,
allocate separate BulkInsertState objects for each partition and switch to
the appropriate one just before calling heap_insert()/heap_multi_insert().
 Then to be able to use heap_multi_insert(), we must also manage buffered
tuples separately for each partition.  Although, I didn't modify the limit
on number of buffered tuples and/or size of buffered tuples which controls
when we pause buffering and do heap_multi_insert() on buffered tuples.
Maybe, it should work slightly differently for the partitioned table case,
like for example, increase the overall limit on both the number of tuples
and tuple size in the partitioning case (I observed that increasing it 10x
or 100x helped to some degree).  Thoughts on this?

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFmBtr32FDOqofo8yG-4mjzL1HnYHxXK5S9OGFJ%3D%3DcJpgEW4vA%40mail.gmail.com
>From 332c67c258a0f25f76c29ced23199fe0ee8e153e Mon Sep 17 00:00:00 2001
From: amit 
Date: Wed, 28 Dec 2016 10:10:26 +0900
Subject: [PATCH 1/3] Set ecxt_scantuple correctly for tuple-routing

In 2ac3ef7a01df859c62d0a02333b646d65eaec5ff, we changed things so that
it's possible for a different TupleTableSlot to be used for partitioned
tables at successively lower levels.  If we do end up changing the slot
from the original, we must update ecxt_scantuple to point to the new one
for partition key of the tuple to be computed correctly.

Also update the regression tests so that the code manipulating
ecxt_scantuple is covered.

Reported by: Rajkumar Raghuwanshi
Patch by: Amit Langote
Reports: https://www.postgresql.org/message-id/CAKcux6%3Dm1qyqB2k6cjniuMMrYXb75O-MB4qGQMu8zg-iGGLjDw%40mail.gmail.com
---
 src/backend/catalog/partition.c  | 29 ++---
 src/backend/executor/execMain.c  |  2 --
 src/test/regress/expected/insert.out |  2 +-
 src/test/regress/sql/insert.sql  |  2 +-
 4 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index f54e1bdf3f..0de1cf245a 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1643,7 +1643,10 @@ get_partition_for_tuple(PartitionDispatch *pd,
 	bool		isnull[PARTITION_MAX_KEYS];
 	int			cur_offset,
 cur_index;
-	int			i;
+	int			i,
+result;
+	ExprContext *ecxt = GetPerTupleExprContext(estate);
+	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;

 	/* start with the root partitioned table */
 	parent = pd[0];
@@ -1672,7 +1675,14 @@ get_partition_for_tuple(PartitionDispatch *pd,
 			slot = myslot;
 		}

-		/* Extract partition key from tuple */
+		/*
+		 * Extract partition key from tuple; FormPartitionKeyDatum() expects
+		 * ecxt_scantuple to point to the correct tuple slot (which might be
+		 * different from the slot we received from the caller if the
+		 * partitioned table of the current level has different tuple
+		 * descriptor from its parent).
+		 */
+		ecxt->ecxt_scantuple = slot;
 		FormPartitionKeyDatum(parent, slot, estate, values, isnull);

 		if (key->strategy == PARTITION_STRATEGY_RANGE)
@@ -1727,16 +1737,21 @@ get_partition_for_tuple(PartitionDispatch *pd,
 		 */
 		if (cur_index < 0)
 		{
+			result = -1;
 			*failed_at = RelationGetRelid(parent->reldesc);
-			return -1;
+			break;
 		}
-		else if (parent->indexes[cur_index] < 0)
-			parent =

Re: [HACKERS] logical decoding of two-phase transactions

2017-01-06 Thread Simon Riggs

On 5 January 2017 at 12:43, Stas Kelvich  wrote:
>
>> On 5 Jan 2017, at 13:49, Simon Riggs  wrote:
>>
>> Surely in this case the master server is acting as the Transaction
>> Manager, and it knows the mapping, so we are good?
>>
>> I guess if you are using >2 nodes then you need to use full 2PC on each node.
>>
>> Please explain precisely how you expect to use this, to check that GID
>> is required.
>>
>
> For example if we are using logical replication just for failover/HA and 
> allowing user
> to be transaction manager itself. Then suppose that user prepared tx on 
> server A and server A
> crashed. After that client may want to reconnect to server B and commit/abort 
> that tx.
> But user only have GID that was used during prepare.

I don't think that's the case your trying to support and I don't think
that's a common case that we want to pay the price to put into core in
a non-optional way.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size