from:"peter"

[HACKERS] Re: Learning from other open source databases

2001-05-14 Thread peter

On Sun, 29 Apr 2001 20:04:19 + (UTC), [EMAIL PROTECTED]
(Bruce Momjian) wrote:

>Here is a general call for people to review other open-source database
>software and report back on things PostgreSQL can learn from them.
>
>I can see Interbase, MySQL, and SAP DB as being three database that
>would be worth researching.  I am willing to assist anyone who wants to
>give it a try.  I have all the sources here myself.  I even have old
>University Ingres, Mariposa, and Postgres 4.2.

Ideas that could be used from other databases:

DB2
DB2 and others made their native API ODBC compatible so they can plug
in anywhere. If PostgreSQL moved to an ODBC compatible API, PostgreSQL
could be a plug compatible replacement for DB2. As DB2 has 33% of the
commercial market, 1% ahead of Oracle, you will get more exposure and
support from large corporations if you can replace DB2 for the smaller
projects that do not require DB2's multi-tier capabilities.

Companies writing applications for DB2 can instantly plug their
software in to open source environments.

People without ODBC would continue using the native API then suddenly
find they are using odbc functions anyway.

phpMyAdmin
phpPgAdmin is based on phpMyAdmin but uses some complicated SQL to
provide the same views of databases. I think PostgreSQL should
continue adding predefined views to the point where phpPgAdmin can use
the same simple SQL as phpMyAdmin because that covers a huge amount of
what people write as soon as they have a few databases and lots of
tables.

NT
MySQL and others install ODBC support as standard in NT. It is one of
the standard things to do on NT. Starting services, like Postmaster,
as a service is another. 

phpPgAdmin
Recommend phpPgAdmin as the interface instead of psql as phpPgAdmin is
far closer to what NT users already use. Even cheap little routers are
now using web interfaces instead of telnet because web interfaces make
the products accessible to about 100 times more people.

Documentation
MySQL has it's documentation as one big install instead of 5 separate
documents.  Even if PostgreSQL just had one big index in to the 5
separate documents, that would help.

Peter

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Sanity checking for ./configure options?

2016-02-26 Thread Peter Eisentraut

On 2/22/16 6:24 PM, Jim Nasby wrote:
> On 2/5/16 10:08 AM, David Fetter wrote:
>> On Wed, Feb 03, 2016 at 06:02:57PM -0600, Jim Nasby wrote:
>>> I just discovered that ./configure will happily accept
>>> '--with-pgport=' (I
>>> was actually doing =$PGPORT, and didn't realize $PGPORT was empty).
>>> What you
>>> end up with is a compile error in guc.c, with no idea why it's
>>> broken. Any
>>> reason not to have configure or at least make puke if pgport isn't
>>> valid?
>>
>> That seems like a good idea.
> 
> Patch attached. I've verified it with --with-pgport=, =0, =7 and =1.
> It catches what you'd expect it to.

Your code and comments suggest that you can specify the port to
configure by setting PGPORT, but that is not the case.

test == is not portable (bashism).

Error messages should have consistent capitalization.

Indentation in configure is two spaces.

> As the comment states, it doesn't catch things like --with-pgport=1a in
> configure, but the compile error you get with that isn't too hard to
> figure out, so I think it's OK.

Passing a non-integer as argument will produce an error message like
(depending on shell)

./configure: line 3107: test: 11a: integer expression expected

but will not actually abort configure.

It would work more robustly if you did something like this

elif test "$default_port" -ge "1" -a "$default_port" -le "65535"; then
  :
else
  AC_MSG_ERROR([port must be between 1 and 65535])
fi

but that still leaks the shell's error message.

There is also the risk of someone specifying a number with a leading
zero, which C would interpret as octal but the shell would not.

To make this really robust, you might need to do pattern matching on the
value.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] syslog configurable line splitting behavior

2016-02-26 Thread Peter Eisentraut

Writing log messages to syslog caters to ancient syslog implementations
in two ways:

- sequence numbers
- line splitting

While these are arguably reasonable defaults, I would like a way to turn
them off, because they get in the way of doing more interesting things
with syslog (e.g., logging somewhere that is not just a text file).

So I propose the two attached patches that introduce new configuration
Boolean parameters syslog_sequence_numbers and syslog_split_lines that
can toggle these behaviors.
From e6a17750956e3e6950683bad397a74adb30f30a2 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 26 Feb 2016 22:34:30 -0500
Subject: [PATCH 1/2] Add syslog_sequence_numbers parameter

---
 doc/src/sgml/config.sgml  | 28 +++
 src/backend/utils/error/elog.c| 12 ++--
 src/backend/utils/misc/guc.c  | 10 ++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/utils/elog.h  |  1 +
 5 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a09ceb2..0d1ae4b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4218,6 +4218,34 @@ Where To Log

   
 
+  
+   syslog_sequence_numbers (boolean)
+
+ syslog_sequence_numbers configuration parameter
+
+   
+
+   
+
+ When logging to syslog and this is on (the
+ default), then each message will be prefixed by an increasing
+ sequence number (such as [2]).  This circumvents
+ the --- last message repeated N times --- suppression
+ that many syslog implementations perform by default.  In more modern
+ syslog implementations, repeat message suppression can be configured
+ (for example, $RepeatedMsgReduction
+ in rsyslog), so this might not be
+ necessary.  Also, you could turn this off if you actually want to
+ suppress repeated messages.
+
+
+
+ This parameter can only be set in the postgresql.conf
+ file or on the server command line.
+
+  
+ 
+
  
   event_source (string)
   
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 9005b26..0bc96b4 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -106,6 +106,7 @@ int			Log_error_verbosity = PGERROR_VERBOSE;
 char	   *Log_line_prefix = NULL;		/* format for extra log line info */
 int			Log_destination = LOG_DESTINATION_STDERR;
 char	   *Log_destination_string = NULL;
+bool		syslog_sequence_numbers = true;
 
 #ifdef HAVE_SYSLOG
 
@@ -2008,7 +2009,11 @@ write_syslog(int level, const char *line)
 
 			chunk_nr++;
 
-			syslog(level, "[%lu-%d] %s", seq, chunk_nr, buf);
+			if (syslog_sequence_numbers)
+syslog(level, "[%lu-%d] %s", seq, chunk_nr, buf);
+			else
+syslog(level, "[%d] %s", chunk_nr, buf);
+
 			line += buflen;
 			len -= buflen;
 		}
@@ -2016,7 +2021,10 @@ write_syslog(int level, const char *line)
 	else
 	{
 		/* message short enough */
-		syslog(level, "[%lu] %s", seq, line);
+		if (syslog_sequence_numbers)
+			syslog(level, "[%lu] %s", seq, line);
+		else
+			syslog(level, "%s", line);
 	}
 }
 #endif   /* HAVE_SYSLOG */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ea5a09a..bc8faa9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1632,6 +1632,16 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"syslog_sequence_numbers", PGC_SIGHUP, LOGGING_WHERE,
+			gettext_noop("Add sequence number to syslog messags to avoid duplicate suppression."),
+			NULL
+		},
+		&syslog_sequence_numbers,
+		true,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ee3d378..a85ba36 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -358,6 +358,7 @@
 # These are relevant when logging to syslog:
 #syslog_facility = 'LOCAL0'
 #syslog_ident = 'postgres'
+#syslog_sequence_numbers = true
 
 # This is only relevant when logging to eventlog (win32):
 #event_source = 'PostgreSQL'
diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index 326896f..bfbcf96 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -396,6 +396,7 @@ extern int	Log_error_verbosity;
 extern char *Log_line_prefix;
 extern int	Log_destination;
 extern char *Log_destination_string;
+extern bool syslog_sequence_numbers;
 
 /* Log destination bitmap */
 #define LOG_DESTINATION_STDERR	 1
-- 
2.7.2

From 72ea7dc222f41ab8246c0

Re: [HACKERS] pg_ctl promote wait

2016-02-28 Thread Peter Eisentraut

On 2/19/16 3:09 PM, Tom Lane wrote:
> I see no need for an additional mechanism.  Just watch pg_control until
> you see DB_IN_PRODUCTION state there, then switch over to the same
> connection probing that "pg_ctl start -w" uses.

Here is a patch set around that idea.

The subsequent discussion mentioned that there might still be a window
between end of waiting and when read-write queries would be accepted.  I
don't know how big that window would be in practice and would be
interested in some testing and feedback.
From cb5d4a63620636d4043d1a85acf7fcfdace73b1d Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sun, 28 Feb 2016 20:21:54 -0500
Subject: [PATCH 1/3] pg_ctl: Add tests for promote action

---
 src/bin/pg_ctl/t/003_promote.pl | 62 +
 src/test/perl/TestLib.pm| 11 
 2 files changed, 73 insertions(+)
 create mode 100644 src/bin/pg_ctl/t/003_promote.pl

diff --git a/src/bin/pg_ctl/t/003_promote.pl b/src/bin/pg_ctl/t/003_promote.pl
new file mode 100644
index 000..64d72e0
--- /dev/null
+++ b/src/bin/pg_ctl/t/003_promote.pl
@@ -0,0 +1,62 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 13;
+
+my $tempdir   = TestLib::tempdir;
+#my $tempdir_short = TestLib::tempdir_short;
+
+command_fails_like([ 'pg_ctl', 'promote', '-D', "$tempdir/nonexistent" ],
+   qr/directory .* does not exist/,
+   'pg_ctl promote with nonexistent directory');
+
+my $node_primary = get_new_node('primary');
+$node_primary->init;
+$node_primary->append_conf(
+	"postgresql.conf", qq(
+wal_level = hot_standby
+max_wal_senders = 2
+wal_keep_segments = 20
+hot_standby = on
+)
+	);
+
+command_fails_like([ 'pg_ctl', 'promote', '-D', $node_primary->data_dir ],
+   qr/PID file .* does not exist/,
+   'pg_ctl promote of not running instance fails');
+
+$node_primary->start;
+
+command_fails_like([ 'pg_ctl', 'promote', '-D', $node_primary->data_dir ],
+   qr/not in standby mode/,
+   'pg_ctl promote of primary instance fails');
+
+my $node_standby = get_new_node('standby');
+$node_primary->backup('my_backup');
+$node_standby->init_from_backup($node_primary, 'my_backup');
+my $connstr_primary = $node_primary->connstr('postgres');
+
+$node_standby->append_conf(
+	"recovery.conf", qq(
+primary_conninfo='$connstr_primary'
+standby_mode=on
+recovery_target_timeline='latest'
+)
+	);
+
+$node_standby->start;
+
+$node_standby->command_like(['psql', '-X', '-A', '-t', '-c', 'SELECT pg_is_in_recovery()'],
+			qr/^t$/,
+			'standby is in recovery');
+
+command_ok([ 'pg_ctl', 'promote', '-D', $node_standby->data_dir ],
+		   'pg_ctl promote of standby runs');
+
+sleep 3;  # needs a moment to react
+
+$node_standby->command_like(['psql', '-X', '-A', '-t', '-c', 'SELECT pg_is_in_recovery()'],
+			qr/^f$/,
+			'promoted standby is not in recovery');
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index 3d11cbb..dd275cf 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -34,6 +34,7 @@ our @EXPORT = qw(
   program_version_ok
   program_options_handling_ok
   command_like
+  command_fails_like
 
   $windows_os
 );
@@ -262,4 +263,14 @@ sub command_like
 	like($stdout, $expected_stdout, "$test_name: matches");
 }
 
+sub command_fails_like
+{
+	my ($cmd, $expected_stderr, $test_name) = @_;
+	my ($stdout, $stderr);
+	print("# Running: " . join(" ", @{$cmd}) . "\n");
+	my $result = IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+	ok(!$result, "@$cmd exit code not 0");
+	like($stderr, $expected_stderr, "$test_name: matches");
+}
+
 1;
-- 
2.7.2

From 7c1b6e94f5e3beb9e558d2af7098940d4475fe11 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sun, 28 Feb 2016 20:21:54 -0500
Subject: [PATCH 2/3] pg_ctl: Detect current standby state from pg_control

pg_ctl used to determine whether a server was in standby mode by looking
for a recovery.conf file.  With this change, it instead looks into
pg_control, which is potentially more accurate.  There are also
occasional discussions about removing recovery.conf, so this removes one
dependency.
---
 src/bin/pg_ctl/pg_ctl.c | 60 ++---
 1 file changed, 57 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index bae6c22..c38c479 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -19,6 +19,7 @@
 
 #in

[HACKERS] pg_resetxlog reference page reorganization

2016-02-29 Thread Peter Eisentraut

The pg_resetxlog reference page has grown over the years into an
unnavigable jungle, so here is a patch that reorganizes it to be more in
the style of the other ref pages, with a normal options list.
From a9024195e9f7a0b47e592f39938bdc9743152a70 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 29 Feb 2016 18:48:34 -0500
Subject: [PATCH] doc: Reorganize pg_resetxlog reference page

The pg_resetxlog reference page didn't have a proper options list, only
running text listing the options and some explanations of them.  This
might have worked when there were only a few options, but the list has
grown over the releases, and now it's hard to find an option and its
associated explanation.  So write out the options list as on other
reference pages.
---
 doc/src/sgml/ref/pg_resetxlog.sgml | 222 -
 1 file changed, 144 insertions(+), 78 deletions(-)

diff --git a/doc/src/sgml/ref/pg_resetxlog.sgml b/doc/src/sgml/ref/pg_resetxlog.sgml
index 1bcc5a7..fd9d0be 100644
--- a/doc/src/sgml/ref/pg_resetxlog.sgml
+++ b/doc/src/sgml/ref/pg_resetxlog.sgml
@@ -22,15 +22,9 @@
  
   
pg_resetxlog
-   -c xid,xid
-f
-n
-   -o oid
-   -x xid
-   -e xid_epoch
-   -m mxid,mxid
-   -O mxoff
-   -l xlogfile
+   option
-D datadir
   
  
@@ -76,78 +70,108 @@ Description
execute any data-modifying operations in the database before you dump,
as any such action is likely to make the corruption worse.
   
+ 
 
-  
-   The -o, -x, -e,
-   -m, -O,
-   -c
-   and -l
-   options allow the next OID, next transaction ID, next transaction ID's
-   epoch, next and oldest multitransaction ID, next multitransaction offset,
-   oldest and newest transaction IDs for which the commit time can be retrieved,
-   and WAL
-   starting address values to be set manually.  These are only needed when
-   pg_resetxlog is unable to determine appropriate values
-   by reading pg_control.  Safe values can be determined as
-   follows:
+ 
+  Options
 
-   
+  
+   
+-f
 
  
-  A safe value for the next transaction ID (-x)
-  can be determined by looking for the numerically largest
-  file name in the directory pg_clog under the data directory,
-  adding one,
-  and then multiplying by 1048576.  Note that the file names are in
-  hexadecimal.  It is usually easiest to specify the option value in
-  hexadecimal too. For example, if 0011 is the largest entry
-  in pg_clog, -x 0x120 will work (five
-  trailing zeroes provide the proper multiplier).
+  Force pg_resetxlog to proceed even if it cannot determine
+  valid data for pg_control, as explained above.
  
 
+   
 
+   
+-n
 
  
-  A safe value for the next multitransaction ID (first part of -m)
-  can be determined by looking for the numerically largest
-  file name in the directory pg_multixact/offsets under the
-  data directory, adding one, and then multiplying by 65536.
-  Conversely, a safe value for the oldest multitransaction ID (second part of
-  -m)
-  can be determined by looking for the numerically smallest
-  file name in the same directory and multiplying by 65536.
-  As above, the file names are in hexadecimal, so the easiest way to do
-  this is to specify the option value in hexadecimal and append four zeroes.
+  The -n (no operation) option instructs
+  pg_resetxlog to print the values reconstructed from
+  pg_control and values about to be changed, and then exit
+  without modifying anything. This is mainly a debugging tool, but can be
+  useful as a sanity check before allowing pg_resetxlog
+  to proceed for real.
  
 
+   
 
+   
+-V
+--version
+Display version information, then exit.
+   
+
+   
+-?
+--help
+Show help, then exit.
+   
+  
+
+  
+   The following options are only needed when
+   pg_resetxlog is unable to determine appropriate values
+   by reading pg_control.  Safe values can be determined as
+   described below.  For values that take numeric arguments, hexadecimal
+   values can be specified by using the prefix 0x.
+  
+
+  
+   
+-c xid,xid
 
  
-  A safe value for the next multitransaction offset (-O)
-  can be determined by looking for the numerically largest
-  file name in the directory pg_multixact/members under the
-  data directory, adding one, and then multiplying by 52352.  As above,
-  the file names are in hexadecimal.  There is no simple recipe such as
-  the ones above of appending zeroes.
+  Manually set the oldest and newest transaction IDs for which the commit
+  time can be retrieved.
  
-
 
-
  
   A safe value for the oldest transaction ID for which the commit time can
-  be retrieved (first part of -c) can be determined by looking
+  be retrieved (first part) can be determined by looking
   for the numerically smallest f

[HACKERS] amcheck (B-Tree integrity checking tool)

2016-02-29 Thread Peter Geoghegan

I was assigned an "action point" during the FOSDEM developer meeting:
"Post new version of btree consistency checker patch". I attach a new
WIP version of my consistency checker tool, amcheck. This patch is
proposed for 9.6, as an extension in contrib -- maybe we can still get
it in. This is the first time I've added any version of this to a
commitfest, although I've posted a couple of rough versions of this in
the past couple of years. The attached version has received a major
overhaul, and is primarily aimed at finding corruption in production
systems, although I think it will still have significant value for
debugging too. Maybe it can help with some of the B-Tree patches in
the final commitfest, for example. I also have some hope that it will
become a learning tool for people interested in how B-Tree indexes
work.

To recap, the extension adds some SQL-callable functions that verify
certain invariant conditions hold within some particular B-Tree index.
These are the conditions that index scans rely on always being true.
The tool's scope may eventually cover other AMs, including heapam, but
nbtree seems like the best place to start.

Note that no function currently checks that the index is consistent
with the heap, which would be very useful (that's probably how I'd
eventually target the heapam, actually).

Invariants


nbtree invariants that the tool verifies with just an AccessShareLock
on the relation are:

* That all items are in the correct, opclass order on each page.

* That the page "high key", if any, actually bounds the items on the page.

* That the last item on a page is less than or equal to the first item
on the next page (the page to its right). The idea here is that the
key space spans multiple pages, not just one page, so it make sense to
check the last item where we can.

With an ExclusiveLock + ShareLock, some addition invariants are verified:

* That child pages actually have their parent's downlink as a lower bound.

* Sane right links and left links at each level.

Obviously, this tool is all about distrusting the structure of a
B-Tree index. That being the case, it's not always totally clear where
to draw the line. I think I have the balance about right, though.

Interface
===

There are only 2 SQL callable functions in the extension, which are
very similar:

bt_index_check(index regclass)

bt_index_parent_check(index regclass)

The latter is more thorough than the former -- it performs all checks,
including those checks that I mentioned required an ExclusiveLock. So,
bt_index_check() only requires an AccessShareLock.
bt_index_parent_check() requires an ExclusiveLock on the index
relation, and a ShareLock on its heap relation, almost like REINDEX.
bt_index_parent_check() performs verification that is a superset of
the verification performed by bt_index_check() -- mostly, the extra
verification/work is that it verifies downlinks against child pages.

Both functions raise an error in the event of observing that an
invariant in a B-Tree was violated, such as items being out of order
on a page. I've written extensive documentation, which goes into
practical aspects of using amcheck effectively. It doesn't go into
significant detail about every invariant that is checked, but gives a
good idea of roughly what checks are performed.

I could almost justify only having one function with an argument about
the downlink/child verification, but that would be a significant
footgun given the varying locking requirements that such a function
would have.

Locking
==

We never rely on something like holding on to a buffer pin as an
interlock for correctness (the vacuum interlock thing isn't generally
necessary, because we don't look at the heap at all). We simply pin +
BT_READ lock a buffer, copy it into local memory allocated by
palloc(), and then immediately release the buffer lock and drop the
pin. This is the same in all instances. There is never more than one
buffer lock or pin held at a time.

We do, on the other hand, have a detailed rationale for why it's okay
that we don't have an ExclusiveLock on the index relation for checks
that span the key space of more than one page by following right links
to compare items across sibling pages. This isn't the same thing as
having an explicit interlock like a pin -- our interlock is one
against *recycling* by vacuum, which is based on recentGlobalXmin.
This rationale requires expert review.

Performance
==

Trying to keep the tool as simple as possible, while still making it
do verification that is as useful as possible was my priority here,
not performance. Still, verification completes fairly quickly.
Certainly, it takes far less time than having to REINDEX the index,
and doesn't need too much memory. I think that in practice most
problems that can be detected by the B-Tree checker functions will be
detected with the lig

Re: [HACKERS] remove wal_level archive

2016-02-29 Thread Peter Eisentraut

On 2/8/16 9:36 AM, David Steele wrote:
> -#define XLogIsNeeded() (wal_level >= WAL_LEVEL_ARCHIVE)
> +#define XLogIsNeeded() (wal_level >= WAL_LEVEL_REPLICA)
> <...>
> -#define XLogStandbyInfoActive() (wal_level >= WAL_LEVEL_HOT_STANDBY)
> +#define XLogStandbyInfoActive() (wal_level >= WAL_LEVEL_REPLICA)
> 
> Since these are identical now shouldn't one be removed?  I searched the
> code and I couldn't find anything that looked dead (i.e. XLogIsNeeded()
> && !XLogStandbyInfoActive()) but it still seems like having both could
> cause confusion.

I think this should eventually be cleaned up, but it doesn't seem
necessary in the first patch.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] remove wal_level archive

2016-02-29 Thread Peter Eisentraut

On 2/8/16 7:34 AM, Michael Paquier wrote:
> Shouldn't backup.sgml be updated as well? Here is the portion that I
> am referring to:
> To enable WAL archiving, set the 
> configuration parameter to archive or higher,
>  to on,
> 
>  But minimal WAL does not contain enough information to reconstruct 
> the
> -data from a base backup and the WAL logs, so archive or
> +data from a base backup and the WAL logs, so replica or
>  higher must be used to enable WAL archiving
>  () and streaming replication.
> 

Checked for leftovers again and fixed one.

> 
> -In hot_standby level, the same information is logged as
> -with archive, plus information needed to reconstruct
> -the status of running transactions from the WAL. To enable read-only
> As the paragraph about the difference between hot_standby and archive
> is removed, I think that it would be better to mention that setting
> wal_level to replica allows to reconstruct data from a base backup and
> the WAL logs, *and* to run read-only queries when hot_standby is
> enabled.

Well, I think that is really only of historical interest.  The
assumption is, as long as hot_standby = on, you can run read-only
queries.  The WAL level is taken completely out of the mental
consideration, because if you have replicate at all, it's good enough.
That is part of the point of this patch.

> 
> -   if (ControlFile->wal_level < WAL_LEVEL_HOT_STANDBY)
> +   if (ControlFile->wal_level < WAL_LEVEL_REPLICA)
> Upthread it was mentioned that switching to an approach where enum
> values are directly listed would be better. The target of an extra
> patch on top of this one?

I'm not sure what is meant by that.

> 
> -   if (wal_level < WAL_LEVEL_ARCHIVE)
> -   ereport(ERROR,
> -
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> -errmsg("replication slots can only be
> used if wal_level >= archive")));
> We should still forbid the creation of replication slots if wal_level = 
> minimal.

I think I took this out because you actually can't get to that check,
but I put it back in because it seems better for clarity.

From 574dd447b4a077267200d2ca9b8b4e185d4bb052 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 29 Feb 2016 20:01:54 -0500
Subject: [PATCH] Merge wal_level "archive" and "hot_standby" into new name
 "replica"

The distinction between "archive" and "hot_standby" existed only because
at the time "hot_standby" was added, there was some uncertainty about
stability.  This is now a long time ago.  We would like to move forward
with simplifying the replication configuration, but this distinction is
in the way, because a primary server cannot tell (without asking a
standby or predicting the future) which one of these would be the
appropriate level.

Pick a new name for the combined setting to make it clearer that it
covers all (non-logical) backup and replication uses.  The old values
are still accepted but are converted internally.
---
 doc/src/sgml/backup.sgml  |  4 ++--
 doc/src/sgml/config.sgml  | 30 +++
 doc/src/sgml/high-availability.sgml   |  2 +-
 doc/src/sgml/ref/alter_system.sgml|  2 +-
 doc/src/sgml/ref/pgupgrade.sgml   |  2 +-
 src/backend/access/rmgrdesc/xlogdesc.c|  5 +++--
 src/backend/access/transam/xact.c |  2 +-
 src/backend/access/transam/xlog.c | 20 --
 src/backend/access/transam/xlogfuncs.c|  2 +-
 src/backend/postmaster/postmaster.c   |  2 +-
 src/backend/replication/slot.c|  2 +-
 src/backend/utils/misc/postgresql.conf.sample |  2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |  2 +-
 src/bin/pg_controldata/pg_controldata.c   |  6 ++
 src/include/access/xlog.h | 11 +-
 src/include/catalog/pg_control.h  |  2 +-
 src/test/perl/PostgresNode.pm |  2 +-
 17 files changed, 44 insertions(+), 54 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 7413666..9092cf8 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -592,7 +592,7 @@ Setting Up WAL Archiving
 

 To enable WAL archiving, set the 
-configuration parameter to archive or higher,
+configuration parameter to replica or higher,
  to on,
 and specify the shell command to use in the  configuration parameter.  In practice
@@ -1285,7 +1285,7 @@ Standalone Hot Backups
   If more flexibility in copying the backup files is needed, a lower
   level process can be used for standalone hot backups a

Re: Commitfest Bug (was: [HACKERS] Re: Reusing abbreviated keys during second pass of ordered [set] aggregates)

2016-03-01 Thread Peter Geoghegan

On Tue, Mar 1, 2016 at 7:27 AM, Tom Lane  wrote:
> +1 for not moving such patches to the new CF until the author does
> something --- at which point they'd change to "Needs Review" state.
> But we should not change them into that state without author input.
> And I don't see the value of having them in a new CF until the
> author does something.

To be clear: My position was always that it's good that the author has
to do *something* to get their patch into the next CF. It's bad that
this change in state can easily be missed, though. I've now been on
both sides of this, as a patch author and patch reviewer. If the patch
was left as "Waiting on Author", as my review of Alexander's patch
was, then it ought to not change to "Needs Review" silently. That
makes absolutely no sense.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Commitfest Bug (was: [HACKERS] Re: Reusing abbreviated keys during second pass of ordered [set] aggregates)

2016-03-02 Thread Peter Geoghegan

On Wed, Mar 2, 2016 at 5:41 AM, Magnus Hagander  wrote:
> Ok, I've pushed a code that does that.

Thank you.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.

2016-03-02 Thread Peter Geoghegan

On Tue, Mar 1, 2016 at 6:51 PM, Robert Haas  wrote:
> I removed the pgstat stuff.  I'm not sure we want that stuff in that
> form; it doesn't seem to fit with the rest of what's in that view, and
> it wasn't reliable in my testing.  I did however throw together a
> little contrib module for testing, which I attach here.  I'm not sure
> we want to commit this, and at the least someone would need to write
> documentation.  But it's certainly handy for checking whether this
> works.

I think you should commit this. The chances of anyone other than you
and Masahiko recalling that you developed this tool in 3 years is
essentially nil. I think that the cost of committing a developer-level
debugging tool like this is very low. Modules like pg_freespacemap
currently already have no chance of being of use to ordinary users.
All you need to do is restrict the functions to throw an error when
called by non-superusers, out of caution.

It's a problem that modules like pg_stat_statements and
pg_freespacemap are currently lumped together in the documentation,
but we all know that.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GinPageIs* don't actually return a boolean

2016-03-02 Thread Peter Eisentraut

On 2/11/16 9:30 PM, Michael Paquier wrote:
>> Well, Yury was saying upthread that some MSVC versions have a problem
>> with the existing coding, which would be a reason to back-patch ...
>> but I'd like to see a failing buildfarm member first.  Don't particularly
>> want to promise to support compilers not represented in the farm.
> 
> Grmbl. Forgot to attach the rebased patch upthread. Here is it now.
> 
> As of now the only complain has been related to MS2015 and MS2013. If
> we follow the pattern of cec8394b and [1], support to compile on newer
> versions of MSVC would be master and REL9_5_STABLE, but MS2013 is
> supported down to 9.3. Based on this reason, we would want to
> backpatch down to 9.3 the patch of this thread.

After reviewing this thread and relevant internet lore, I think this
might be the wrong way to address this problem.  It is in general not
guaranteed in C that a Boolean-sounding function or macro returns 0 or
1.  Prime examples are ctype.h functions such as isupper().  This is
normally not a problem because built-in conditionals such as if, &&, ||
handle this just fine.  So code like

-   Assert(!create || !!txn);
+   Assert(!create || txn != NULL);

is arguably silly either way.  There is no risk in writing just

Assert(!create || txn);

The problem only happens if you compare two "Boolean" values directly
with each other; and so maybe you shouldn't do that, or at least place
the extra care there instead, instead of fighting a permanent battle
with the APIs you're using.  (This isn't an outrageous requirement: You
can't compare floats or strings either without extra care.)

A quick look through the code based on the provided patch shows that
approximately the only place affected by this is

if (isLeaf != GinPageIsLeaf(page) || isData != GinPageIsData(page))
elog(ERROR, "right sibling of GIN page is of different type");

and that's not actually a problem because isLeaf and isData are earlier
populated by the same macros.  It would still be worth fixing, but a
localized fix seems better.

Now on the matter of stdbool, I tried putting an #include 
near the top of c.h and compile that to see what would happen.  This is
the first warning I see:

ginlogic.c: In function 'shimTriConsistentFn':
ginlogic.c:171:24: error: comparison of constant '2' with boolean
expression is always false [-Werror=bool-compare]
   if (key->entryRes[i] == GIN_MAYBE)
^

and then later on something related:

../../../../src/include/tsearch/ts_utils.h:107:13: note: expected '_Bool
(*)(void *, QueryOperand *) {aka _Bool (*)(void *, struct 
*)}' but argument is of type 'GinTernaryValue (*)(void *, QueryOperand
*) {aka char (*)(void *, struct  *)}'

So the compiler is actually potentially helpful, but as it stands,
PostgreSQL code is liable to break if you end up with stdbool.h somehow.

(plperl also fails to compile because of a hot-potato game about who is
actually responsible for defining bool.)

So one idea would be to actually get ahead of the game, include
stdbool.h if available, fix the mentioned issues, and maybe get more
robust code that way.

But the lore on the internet casts some doubt on that: There is no
guarantee that bool is 1 byte, that bool can be passed around like char,
or even that bool arrays are laid out like char arrays.  Maybe this all
works out okay, just like it has worked out so far that int is 4 bytes,
but we don't know enough about it.  We could probably add some configure
tests around that.

We could also go the other way and forcibly undefine an existing bool
type (since stdbool.h is supposed to use macros, not typedefs).  But
that might not work well if a header that is included later pulls in
stdbool.h on top of that.

My proposal on this particular patch is to do nothing.  The stdbool
issues should be looked into, for the sake of Windows and other
future-proofness.  But that will likely be an entirely different patch.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix for OpenSSL error queue bug

2016-03-03 Thread Peter Eisentraut

On 2/5/16 5:04 AM, Peter Geoghegan wrote:
> As Heikki goes into on that thread, the appropriate action seems to be
> to constantly reset the error queue, and to make sure that we
> ourselves clear the queue consistently. (Note that we might not have
> consistently called ERR_get_error() in the event of an OOM within
> SSLerrmessage(), for example). I have not changed backend code in the
> patch, though; I felt that we had enough control of the general
> situation there for it to be unnecessary to lock everything down.

I think clearing the error after a call is not necessary.  The API
clearly requires that you should clear the error queue before a call, so
clearing it afterwards does not accomplish anything, except maybe make
broken code work sometimes, for a while.  Also, there is nothing that
says that an error produces exactly one entry in the error queue; it
could be multiple.  Or that errors couldn't arise at random times
between the reset and whatever happens next.

I think this is analogous to clearing errno before a C library call.
You could clear it afterwards as well, to be nice to the next guy, but
the next guy should really take care of that themselves, and we can't
rely on what happens in between anyway.

The places that you identified for change look correct as far as libpq
goes.  I do think that the backend should be updated in the same way,
because it's a) correct, b) easy enough, and c) there could well be
interactions with postgres_fdw, plproxy, plperl, or who knows what.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix handling of invalid sockets returned by PQsocket()

2016-03-05 Thread Peter Eisentraut

On 2/17/16 10:52 PM, Michael Paquier wrote:
> On Thu, Feb 18, 2016 at 1:58 AM, Alvaro Herrera
>  wrote:
>> Michael Paquier wrote:
>>> Hi all,
>>>
>>> After looking at Alvaro's message mentioning the handling of
>>> PQsocket() for invalid sockets, I just had a look by curiosity at
>>> other calls of this routine, and found a couple of issues:
>>> 1) In vacuumdb.c, init_slot() does not check for the return value of 
>>> PQsocket():
>>> slot->sock = PQsocket(conn);
>>> 2) In isolationtester.c, try_complete_step() should do the same.
>>> 3) In pg_recvlogical.c for StreamLogicalLog() I am spotting the same 
>>> problem.
>>> I guess those ones should be fixed as well, no?
>>
>> I patched pgbench to use PQerrorMessage rather than strerror(errno).  I
>> think your patch should do the same.
> 
> OK, this looks like a good idea. I would suggest doing the same in
> receivelog.c then.

Let's make the error messages consistent as "invalid socket".  "bad
socket" isn't really our style, and pg_basebackup saying "socket not
open" is just plain incorrect.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The plan for FDW-based sharding

2016-03-06 Thread Peter Geoghegan

On Fri, Mar 4, 2016 at 4:41 PM, Robert Haas  wrote:
> Yeah, I agree with that.  I am utterly mystified by why Bruce keeps
> beating this drum, and am frankly pretty annoyed about it.  In the
> first place, he seems to think that he invented the idea of using FDWs
> for sharding in PostgreSQL, but I don't think that's true.  I think it
> was partly my idea, and partly something that the NTT folks have been
> working on for years (cf, e.g.,
> cb1ca4d800621dcae67ca6c799006de99fa4f0a5).  As far as I understand it,
> Bruce came in near the end of that conversation and now wants to claim
> credit for something that doesn't really exist yet and, to the extent
> that it does exist, wasn't even his idea.

I think that it's easy to have the same idea as someone else
independently. I've had that happen several times myself; ideas that
other people had that I felt I could have easily had myself, or did in
fact have. Most of the ideas that I have are fairly heavily based on
known techniques. I don't think that I've ever creating a PostgreSQL
feature that was in some way truly original, except perhaps for some
aspects of how UPSERT works.

Who cares whose idea FDW sharding was? It matters not a whit. It
probably independently occurred to several people that the FDW
interface could be built to support horizontal sharding more directly.
The idea almost suggests itself.

> EnterpriseDB *does* have a plan to try to continue enhancing foreign
> data wrappers so that you can run queries against foreign tables and
> get reasonable plans, something that currently isn't true.  I haven't
> heard anybody objecting to that, and I don't expect to hear anybody
> objecting to that, because it's hard to imagine why you wouldn't want
> queries against foreign data wrappers to produce better plans than
> they do today.  At worst, you might think it doesn't matter either
> way, but actually, I think there are a substantial number of people
> who are pretty happy about join pushdown and I expect that when and if
> we get aggregate pushdown working there will be even more people who
> are happy about that.

I think that that's Bruce's point, to a large degree.

>> Alternately, you can just work on the individual FDW features, which
>> *everyone* thinks are a good idea, and when most of them are done, FDW-based
>> scaleout will be such an obvious solution that nobody will argue with it.
>
> That's exactly what the people at EnterpriseDB who are actually doing
> work in this area are attempting to do.  Meanwhile, there's also
> Bruce, who is neither doing nor planning to do any work in this area,
> nor advising either EnterpriseDB or the PostgreSQL community to
> undertake any particular project, but who *is* making it sound like
> there is a super sekret plan that nobody else gets to see.

Is he? I didn't get that impression.

I think Bruce is trying to facilitate discussion, which can sometimes
require being a bit provocative. I think you're being quite unfair,
and mischaracterizing his words. I've heard Bruce talk about
horizontal scaling on several occasions, including at a talk in San
Francisco about a year ago, and I just thought it was Bruce being
Bruce -- primarily, a facilitator. I think that he is not especially
motivated by taking credit either here or in general, and not at all
by taking credit for other people's work.

It's not hard to get agreement about something abstract, like the
general idea of a distributed transaction manager. I fear that any
particular detailed interpretation of what that phrase means will be
very hard to get accepted into PostgreSQL.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Upper planner pathification

2016-03-06 Thread Peter Geoghegan

On Sun, Mar 6, 2016 at 9:59 PM, Tom Lane  wrote:
> Perhaps it was intentional when written, but if Robert's advice is correct
> that the new upper-planner path nodes should copy up parallel_degree from
> their children, then it cannot be the case that parallel_degree>0 in a
> node above the scan level implies that that node type has any special
> behavior for parallelism.
>
> I continue to bemoan the lack of documentation about what these fields
> mean.  As far as I can find, the sum total of the documentation about
> this field is
>
> int parallel_degree; /* desired parallel degree; 0 = not parallel 
> */

While it doesn't particularly relate to parallel joins, I've expressed
a general concern about the max_parallel_degree GUC that I think is
worth considering again:

http://www.postgresql.org/message-id/cam3swzrs1mtvrkkasy1xbshgzxkd6-hnxx3gq7x-p-dz0zt...@mail.gmail.com

In summary, I think it's surprising that a max_parallel_degree of 1
doesn't disable parallel workers entirely.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ExecGather() + nworkers

2016-03-07 Thread Peter Geoghegan

On Mon, Mar 7, 2016 at 4:04 AM, Amit Kapila  wrote:
> Your point is genuine, but OTOH let us say if max_parallel_degree = 1 means
> parallelism is disabled then when somebody sets max_parallel_degree = 2,
> then it looks somewhat odd to me that, it will mean that 1 worker process
> can be used for parallel query.

I'm not sure that that has to be true.

What is the argument for only using one worker process, say in the
case of parallel seq scan? I understand that parallel seq scan can
consume tuples itself, which seems like a good principle, but how far
does it go, and how useful is it in the general case? I'm not
suggesting that it isn't, but I'm not sure.

How common is it for the leader process to do anything other than
coordinate and consume from worker processes?

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Minor bug affecting ON CONFLICT lock wait log messages

2016-03-07 Thread Peter Geoghegan

Attached patch fixes a bug reported privately by Stephen this morning.
He complained about deadlocking ON CONFLICT DO NOTHING statements.
There were no exclusion constraints involved, and yet they were
incorrectly indicated as being involved in log messages that related
to these deadlocks.

-- 
Peter Geoghegan
From bc481af77994057cb1ffe4a0e471b38bb00dc228 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Mon, 7 Mar 2016 13:16:24 -0800
Subject: [PATCH] Avoid incorrectly indicating exclusion constraint wait

INSERT ... ON CONFLICT's precheck may have to wait on the outcome of
another insertion, which may or may not itself be a speculative
insertion.  This wait is not necessarily associated with an exclusion
constraint, but was always reported that way in log messages if the wait
happened to involve a tuple that had no speculative token.

Bug reported privately by Stephen Frost.  His case involved ON CONFLICT
DO NOTHING, where spurious references to exclusion constraints in log
messages were more likely.
---
 src/backend/executor/execIndexing.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 838cee7..5d553d5 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -725,6 +725,7 @@ retry:
 	{
 		TransactionId xwait;
 		ItemPointerData ctid_wait;
+		XLTW_Oper		reason_wait;
 		Datum		existing_values[INDEX_MAX_KEYS];
 		bool		existing_isnull[INDEX_MAX_KEYS];
 		char	   *error_new;
@@ -783,13 +784,14 @@ retry:
 			  TransactionIdPrecedes(GetCurrentTransactionId(), xwait
 		{
 			ctid_wait = tup->t_data->t_ctid;
+			reason_wait = indexInfo->ii_ExclusionOps ?
+XLTW_RecheckExclusionConstr : XLTW_InsertIndex;
 			index_endscan(index_scan);
 			if (DirtySnapshot.speculativeToken)
 SpeculativeInsertionWait(DirtySnapshot.xmin,
 		 DirtySnapshot.speculativeToken);
 			else
-XactLockTableWait(xwait, heap, &ctid_wait,
-  XLTW_RecheckExclusionConstr);
+XactLockTableWait(xwait, heap, &ctid_wait, reason_wait);
 			goto retry;
 		}
 
-- 
1.9.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.

2016-03-07 Thread Peter Geoghegan

On Mon, Mar 7, 2016 at 4:50 PM, Robert Haas  wrote:
> Here's an updated patch with an API that I think is much more
> reasonable to expose to users, and documentation!  It assumes that the
> patch I posted a few hours ago to remove PD_ALL_FROZEN will be
> accepted; if that falls apart for some reason, I'll update this.  I
> plan to push this RSN if nobody objects.

Thanks for making the effort to make the tool generally available.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-07 Thread Peter Geoghegan

On Mon, Feb 15, 2016 at 3:45 PM, Greg Stark  wrote:
> I was thinking about this over the past couple weeks. I'm starting to
> think the quicksort runs gives at least the beginnings of a way
> forward on this front.

As I've already pointed out several times, I wrote a tool that makes
it easy to load sortbenchmark.org data into a PostgreSQL table:

https://github.com/petergeoghegan/gensort

(You should use the Python script that invokes the "gensort" utility
-- see its "--help" display for details).

This seems useful as a standard benchmark, since it's perfectly
deterministic, allowing the user to create arbitrarily large tables to
use for sort benchmarks. Still, it doesn't produce data that is any
way organic; sort data is uniformly distributed. Also, it produces a
table that really only has one attribute to sort on, a text attribute.

I suggest looking at real world data, too. I have downloaded UK land
registry data, which is a freely available dataset about property
sales in the UK since the 1990s, of which there have apparently been
about 20 million (I started with a 20 million line CSV file). I've
used COPY to load the data into one PostgreSQL table.

I attach instructions on how to recreate this, and some suggested
CREATE INDEX statements that seemed representative to me. There are a
variety of Postgres data types in use, including UUID, numeric, and
text. The final Postgres table is just under 3GB. I will privately
make available a URL that those CC'd here can use to download a custom
format dump of the table, which comes in at 1.1GB (ask me off-list if
you'd like to get that URL, but weren't CC'd here). This URL is
provided as a convenience for reviewers, who can skip my detailed
instructions.

An expensive rollup() query on the "land_registry_price_paid_uk" table
is interesting. Example:

select date_trunc('year', transfer_date), county, district, city,
sum(price) from land_registry_price_paid_uk group by rollup (1,
county, district, city);

Performance is within ~5% of an *internal* sort with the patch series
applied, even though ~80% of time is spent copying and sorting
SortTuples overall in the internal sort case (the internal case cannot
overlap sorting and aggregate processing, since it has no final merge
step). This is a nice demonstration of how this work has significantly
blurred the line between internal and external sorts.

-- 
Peter Geoghegan
Instructions

CSV File

The land registry file from http://data.gov.uk is 3.2GB. A CSV file that can be
loaded into PostgreSQL that has organic data. No registration required. See
https://theodi.org/blog/the-status-of-csvs-on-datagovuk for details on
downloaded the file pp-complete.csv.

SQL
---

begin;
create table land_registry_price_paid_uk(
  transaction uuid,
  price numeric,
  transfer_date date,
  postcode text,
  property_type char(1),
  newly_built boolean,
  duration char(1),
  paon text,
  saon text,
  street text,
  locality text,
  city text,
  district text,
  county text,
  ppd_category_type char(1));

copy land_registry_price_paid_uk FROM '/home/pg/Downloads/pp-complete.csv' with 
(format csv, freeze true, encoding 'win1252', header false, null '', quote '"', 
force_null (postcode, saon, paon, street, locality, city, district));
commit;

Resulting table
---

postgres=# \dt+
  List of relations
 Schema âName â Type  â Owner â  Size   â 
Description 
âââââââââ¼ââââââââââââââââââââââââââââââ¼ââââââââ¼ââââââââ¼ââââââââââ¼âââââââââââââ
 public â land_registry_price_paid_uk â table â pgâ 2779 MB â 
(1 row)

Interesting Indexes
===

Many attribute index (Low cardinality leading attributes):

postgres=# create index on land_registry_price_paid_uk_suffix(county, district, 
city, locality, street);

UUID pk index (UUID type, high cardinality):

postgres=# create index on land_registry_price_paid_uk (transaction);

Price index (numeric, moderate cardinality):

postgres=# create index on land_registry_price_paid_uk (price);

Preview
===

pg@hamster:~$ head ~/Downloads/pp-complete.csv
"{0C7ADEF5-878D-4066-B785-003ED74A}","163000","2003-02-21 00:00","UB5 
4PJ","T","N","F","106","","READING 
ROAD","NORTHOLT","NORTHOLT","EALING","GREATER LONDON","A"
"{35F67271-ABD4-40DA-AB09-0085B9D3}","247500","2005-07-15 00:00","TA19 
9DD","D","N","F","58","","ADAMS MEADOW","ILMINSTER",

Re: [HACKERS] GCC 6 warning fixes

2016-03-08 Thread Peter Eisentraut

On 3/8/16 4:44 PM, Robert Haas wrote:
> On Mon, Feb 29, 2016 at 4:50 PM, Thomas Munro
>  wrote:
>> On Sat, Feb 20, 2016 at 5:14 PM, Peter Eisentraut  wrote:
>>> Here are three patches to fix new warnings in GCC 6.
>>>
>>> 0001 is apparently a typo.
>>
>> Right, looks like it.  Builds and tests OK with this change (though I
>> didn't get any warning from GCC6.0.0 -Wall for this one).
>>
>>> 0002 was just (my?) stupid code to begin with.
>>
>> Right, it makes sense to define QL_HELP in just one translation unit
>> with external linkage.  Builds and works fine.  I got the 'defined but
>> not used' warning from GCC6 and it went away with this patch.
>>
>>> 0003 is more of a workaround.  There could be other ways address this, too.
>>
>> This way seems fine to me (you probably want the function to continue
>> to exist rather than, say, becoming a macro evaluating to false on
>> non-WIN32, if this gets backpatched).  I got this warning from GCC6
>> and it went away with this patch.
> 
> Peter, are you going to commit this?

done



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] syslog configurable line splitting behavior

2016-03-08 Thread Peter Eisentraut

On 3/4/16 11:01 AM, Alexander Korotkov wrote:
> On Sat, Feb 27, 2016 at 6:49 AM, Peter Eisentraut  <mailto:pete...@gmx.net>> wrote:
> 
> Writing log messages to syslog caters to ancient syslog implementations
> in two ways:
> 
> - sequence numbers
> - line splitting
> 
> While these are arguably reasonable defaults, I would like a way to turn
> them off, because they get in the way of doing more interesting things
> with syslog (e.g., logging somewhere that is not just a text file).
> 
> So I propose the two attached patches that introduce new configuration
> Boolean parameters syslog_sequence_numbers and syslog_split_lines that
> can toggle these behaviors.
> 
> 
> Would it have any usage if we make PG_SYSLOG_LIMIT configurable (-1 for
> disable) instead of introducing boolean?

That would work, too.  But then we'd need another setting to disable
splitting on newlines.  That way we'd have more settings, but they
actually mirror the corresponding settings on the rsyslogd side better.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] syslog configurable line splitting behavior

2016-03-08 Thread Peter Eisentraut

On 3/8/16 9:12 PM, Andreas Karlsson wrote:
> As someone who uses syslog for my servers I find both of these GUCs
> useful, especially when used in combination, and I do not think a
> compile time option like suggest by Alexander would be suitable
> substitute because then I would need a custom build of PostgreSQL just
> to change this which seems too much effort just for this.

I think he was suggesting to take the existing compile-time constant and
make a run-time setting out of it.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Crash with old Windows on new CPU

2016-03-08 Thread Peter Eisentraut

On 2/12/16 11:24 AM, Christian Ullrich wrote:
> Otherwise, it may be time to update the manual (15.6 Supported
> Platforms) where it says PostgreSQL "can be expected to work on these
> operating systems: [...] Windows (Win2000 SP4 and later), [...]".
> Perhaps we could add "except Windows before 7 SP1/2008R2 SP1 when
> running in x64 mode on Intel CPUs introduced after May 2013 (Haswell and
> later)"?

Wouldn't the fix be for users to upgrade their service packs?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-10 Thread Peter Geoghegan

On Thu, Mar 10, 2016 at 5:40 AM, Tomas Vondra
 wrote:
> I was thinking about running some benchmarks on this patch, but the
> thread is pretty huge so I want to make sure I'm not missing something
> and this is indeed the most recent version.

Wait 24 - 48 hours, please. Big update coming.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-10 Thread Peter Geoghegan

On Thu, Mar 10, 2016 at 10:39 AM, Greg Stark  wrote:
> I want to rerun these on a dedicated machine and with trace_sort
> enabled so that we can see how many merge passes were actually
> happening and how much I/O was actually happening.

Putting the results in context, by keeping trace_sort output with the
results is definitely a good idea here. Otherwise, it's almost
impossible to determine what happened after the fact. I have had
"trace_sort = on" in my dev postgresql.conf for some time now. :-)

When I produce my next revision, we should focus on regressions at the
low end, like the 4MB work_mem for multiple GB table size cases you
show here. So, I ask that any benchmarks that you or Tomas do look at
that first and foremost. It's clear that in high memory environments
the patch significantly improves performance, often by as much as
2.5x, and so that isn't really a concern anymore. I think we may be
able to comprehensively address Robert's concerns about regressions
with very little work_mem and lots of data by fixing a problem with
polyphase merge. More to come soon.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Upper planner pathification

2016-03-10 Thread Peter Geoghegan

On Thu, Mar 10, 2016 at 11:45 AM, Robert Haas  wrote:
> On Thu, Mar 10, 2016 at 2:41 PM, Andres Freund  wrote:
>> ISTM, that there's good enough reasons to go either way; I don't see
>> what we're gaining by making these private. That just encourages
>> copy-paste coding.
>
> +1.  Frustrating Citus's attempt to open-source their stuff is not in
> the project's interest.

I agree.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix for OpenSSL error queue bug

2016-03-10 Thread Peter Geoghegan

On Thu, Mar 10, 2016 at 2:50 PM, Robert Haas  wrote:
> So what's the next step here?  Peter G, are you planning to update the
> patch based on this review from Peter E?  If not, Peter E, do you want
> to update the patch and commit?  If neither, I'm going to mark this
> Returned with Feedback in the CF and move on, which seems a bit of a
> shame since this appears to be a bona fide bug, but if nobody's
> willing to work on it, it ain't gettin' fixed.

Getting to it very soon. Just really busy right this moment.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix for OpenSSL error queue bug

2016-03-10 Thread Peter Geoghegan

On Thu, Mar 10, 2016 at 3:09 PM, Peter Geoghegan  wrote:
> Getting to it very soon. Just really busy right this moment.

That said, I agree with Peter's remarks about doing this frontend and
backend. So, while I'm not sure, I think we're in agreement on all
issues. I would have no problem with Peter E following through with
final steps + commit as Robert outlined, if that works for him.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix for OpenSSL error queue bug

2016-03-10 Thread Peter Eisentraut

On 3/10/16 6:10 PM, Peter Geoghegan wrote:
> On Thu, Mar 10, 2016 at 3:09 PM, Peter Geoghegan  wrote:
>> Getting to it very soon. Just really busy right this moment.
> 
> That said, I agree with Peter's remarks about doing this frontend and
> backend. So, while I'm not sure, I think we're in agreement on all
> issues. I would have no problem with Peter E following through with
> final steps + commit as Robert outlined, if that works for him.

My proposal is the attached patch.


From 697b4f75fccbb5eda530211f3ef58c2b226c5461 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 10 Mar 2016 20:59:30 -0500
Subject: [PATCH] Clear OpenSSL error queue before OpenSSL calls

OpenSSL requires that the error queue be cleared before certain OpenSSL
API calls, so that you can be sure that the error you are checking
afterwards actually come from you and was not left over from other
activity.  We had never done that, which appears to have worked as long
as we are the only users of OpenSSL in the process.  But if a process
using libpq or a backend plugin uses OpenSSL as well, this can lead to
confusion and crashes.

see bug #12799 and https://bugs.php.net/bug.php?id=68276

based on patches by Dave Vitek and Peter Geoghegan
---
 src/backend/libpq/be-secure-openssl.c| 3 +++
 src/interfaces/libpq/fe-secure-openssl.c | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 1e3dfb6..be337f5 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -353,6 +353,7 @@ be_tls_open_server(Port *port)
 	port->ssl_in_use = true;
 
 aloop:
+	ERR_clear_error();
 	r = SSL_accept(port->ssl);
 	if (r <= 0)
 	{
@@ -501,6 +502,7 @@ be_tls_read(Port *port, void *ptr, size_t len, int *waitfor)
 	int			err;
 
 	errno = 0;
+	ERR_clear_error();
 	n = SSL_read(port->ssl, ptr, len);
 	err = SSL_get_error(port->ssl, n);
 	switch (err)
@@ -558,6 +560,7 @@ be_tls_write(Port *port, void *ptr, size_t len, int *waitfor)
 	int			err;
 
 	errno = 0;
+	ERR_clear_error();
 	n = SSL_write(port->ssl, ptr, len);
 	err = SSL_get_error(port->ssl, n);
 	switch (err)
diff --git a/src/interfaces/libpq/fe-secure-openssl.c b/src/interfaces/libpq/fe-secure-openssl.c
index 133546b..0535338 100644
--- a/src/interfaces/libpq/fe-secure-openssl.c
+++ b/src/interfaces/libpq/fe-secure-openssl.c
@@ -212,6 +212,7 @@ pgtls_read(PGconn *conn, void *ptr, size_t len)
 
 rloop:
 	SOCK_ERRNO_SET(0);
+	ERR_clear_error();
 	n = SSL_read(conn->ssl, ptr, len);
 	err = SSL_get_error(conn->ssl, n);
 	switch (err)
@@ -320,6 +321,7 @@ pgtls_write(PGconn *conn, const void *ptr, size_t len)
 	int			err;
 
 	SOCK_ERRNO_SET(0);
+	ERR_clear_error();
 	n = SSL_write(conn->ssl, ptr, len);
 	err = SSL_get_error(conn->ssl, n);
 	switch (err)
@@ -1327,6 +1329,7 @@ open_client_SSL(PGconn *conn)
 {
 	int			r;
 
+	ERR_clear_error();
 	r = SSL_connect(conn->ssl);
 	if (r <= 0)
 	{
-- 
2.7.2


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relaxing SSL key permission checks

2016-03-10 Thread Peter Eisentraut

On 3/4/16 3:55 PM, Alvaro Herrera wrote:
> * it failed to check for S_IXUSR, so permissions 0700 were okay, in
> contradiction with what the error message indicates.  This is a
> preexisting bug actually.  Do we want to fix it by preventing a
> user-executable file (possibly breaking compability with existing
> executable key files), or do we want to document what the restriction
> really is?

I think we should not check for S_IXUSR.  There is no reason for doing that.

I can imagine that key files are sometimes copied around using USB
drives with FAT file systems or other means of that sort where
permissions can scrambled.  While I hate gratuitous executable bits as
much as the next person, insisting here would just create annoyances in
practice.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix for OpenSSL error queue bug

2016-03-10 Thread Peter Geoghegan

Looked at your proposed patch. Will respond to your original mail on the matter.

On Thu, Mar 3, 2016 at 4:15 PM, Peter Eisentraut  wrote:
> I think clearing the error after a call is not necessary.  The API
> clearly requires that you should clear the error queue before a call, so
> clearing it afterwards does not accomplish anything, except maybe make
> broken code work sometimes, for a while.

Uh, if it's so clear, then why haven't we been doing it all along? The
API doesn't require you to take *any* specific practical measure (for
example, the specific practical measure of resetting the queue before
calling an I/O function). It simply says "this exact thing cannot be
allowed to happen; the consequences are undefined", with nothing in
the way of guidance on what that means in the real world. It's a
shockingly bad API, but that's the reality.

Part of the problem is that various scripting language OpenSSL
wrappers are only very thin wrappers. They effectively pass the buck
on to PHP and Ruby devs. If we cannot get it right, what chance have
they? I've personally experienced a bit uptick in complaints about
this recently. I think there are 3 separate groups within Heroku that
regularly ask me how this patch is doing.

> Also, there is nothing that
> says that an error produces exactly one entry in the error queue; it
> could be multiple.  Or that errors couldn't arise at random times
> between the reset and whatever happens next.

I think that it's kind of implied, since calling ERR_get_error() pops
the stack. But even if that isn't so, it might be worth preventing bad
things from happening to client applications only sometimes.

> I think this is analogous to clearing errno before a C library call.
> You could clear it afterwards as well, to be nice to the next guy, but
> the next guy should really take care of that themselves, and we can't
> rely on what happens in between anyway.

It sounds like you're saying "well, we cannot be expected to bend over
backwards to make broken code work". But that broken code includes
every single version of libpq + OpenSSL currently distributed. Seems
like a very high standard. I'm not saying that that means we
definitely should clear the error queue reliably ourselves, but
doesn't it give you pause? Heikki seemed to think that clearing our
own queue was important when he looked at this a year ago:

http://www.postgresql.org/message-id/54edd30d.5050...@vmware.com

Again, not conclusive, but I would like to hear a rationale for why
you think it's okay to not consistently clear our own queue for the
benefit of others. Is this informed by a concern about some specific
downside to taking that extra precaution?

Thanks
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-10 Thread Peter Geoghegan

On Sun, Feb 14, 2016 at 8:01 PM, Peter Geoghegan  wrote:
> The query I'm testing is: "reindex index pgbench_accounts_pkey;"
>
> Now, with a maintenance_work_mem of 5MB, the most recent revision of
> my patch takes about 54.2 seconds to complete this, as compared to
> master's 44.4 seconds. So, clearly a noticeable regression there of
> just under 20%. I did not see a regression with a 5MB
> maintenance_work_mem when pgbench scale was 100, though.

I've fixed this regression, and possibly all regressions where workMem
> 4MB. I've done so without resorting to making the heap structure
more complicated, or using a heap more often than when
replacement_sort_mem is exceeded by work_mem or maintenance_work_mem
(so replacement_sort_mem becomes something a bit different to what we
discussed, Robert -- more on that later). This seems like an
"everybody wins" situation, because in this revision the patch series
is now appreciably *faster* where the amount of memory available is
only a tiny fraction of the total input size.

Jeff Janes deserves a lot of credit for helping me to figure out how
to do this. I couldn't get over his complaint about the regression he
saw a few months back. He spoke of an "anti-sweetspot" in polyphase
merge, and how he suspected that to be the real culprit (after all,
most of his time was spent merging, with or without the patch
applied). He also said that reverting the memory batch/pool patch made
things go a bit faster, somewhat ameliorating his regression (when
just the quicksort patch was applied). This made no sense to me, since
I understood the memory batching patch to be orthogonal to the
quicksort thing, capable of being applied independently, and more or
less a strict improvement on master, no matter what the variables of
the sort are. Jeff's regressed case especially made no sense to me
(and, I gather, to him) given that the regression involved no
correlation, and so clearly wasn't reliant on generating far
fewer/longer runs than the patch (that's the issue we've discussed
more than any other now -- it's a red herring, it seems). As I
suspected out loud on February 14th, replacement selection mostly just
*masked* the real problem: the problem of palloc() fragmentation.
There doesn't seem to be much of an issue with the scheduling of
polyphase merging, once you fix palloc() fragmentation. I've created a
new revision, incorporating this new insight.

New Revision

Attached revision of patch series:

1. Creates a separate memory context for tuplesort's copies of
caller's tuples, which can be reset at key points, avoiding
fragmentation. Every SortTuple.tuple is allocated there (with trivial
exception); *everything else*, including the memtuples array, is
allocated in the existing tuplesort context, which becomes the parent
of this new "caller's tuples" context. Roughly speaking, that means
that about half of total memory for the sort is managed by each
context in common cases. Even with a high work_mem memory budget,
memory fragmentation could previously get so bad that tuplesort would
in effect claim a share of memory from the OS that is *significantly*
higher than the work_mem budget allotted to its sort. And with low
work_mem settings, fragmentation previously made palloc() thrash the
sort, especially during non-final merging. In this latest revision,
tuplesort now almost gets to use 100% of the memory that was requested
from the OS by palloc() is cases tested.

2. Loses the "quicksort with spillover" case entirely, making the
quicksort patch significantly simpler. A *lot* of code was thrown out.

This change is especially significant because it allowed me to remove
the cost model that Robert took issue with so vocally. "Quicksort with
spillover" was always far less important than the basic idea of using
quicksort for external sorts, so I'm not sad to see it go. And, I
thought that the cost model was pretty bad myself.

3. Fixes cost_sort(), making optimizer account for the fact that runs
are now about sort_mem-sized, not (sort_mem * 2)-sized.

While I was at it, I made cost_sort() more optimistic about the amount
of random I/O required relative to sequential I/O. This additional
change to cost_sort() was probably overdue.

4. Restores the ability of replacement selection to generate one run
and avoid any merging (previously, only one really long run and one
short run was possible, because at the time I conceptualized
replacement selection as being all about enabling "quicksort with
spillover", which quicksorted that second run in memory). This
only-one-run case is the case that Robert particularly cared about,
and it's fully restored when RS is in use (which can still only happen
for the first run, just never for the benefit of the now-axed
"quicksort with spillover" case).

5. Ad

Re: [HACKERS] Fix for OpenSSL error queue bug

2016-03-10 Thread Peter Eisentraut

On 3/10/16 9:38 PM, Peter Geoghegan wrote:
> Looked at your proposed patch. Will respond to your original mail on the 
> matter.
> 
> On Thu, Mar 3, 2016 at 4:15 PM, Peter Eisentraut  wrote:
>> I think clearing the error after a call is not necessary.  The API
>> clearly requires that you should clear the error queue before a call, so
>> clearing it afterwards does not accomplish anything, except maybe make
>> broken code work sometimes, for a while.
> 
> Uh, if it's so clear, then why haven't we been doing it all along?

The issue only happens if two interleaving trains of execution, one of
which is libpq, use OpenSSL.  Not many applications do that.  And you
also need to provoke the errors in a certain order.  And even then, in
some cases you might just see a false positive error, rather than a
crash.  So it's an edge case.

> Part of the problem is that various scripting language OpenSSL
> wrappers are only very thin wrappers. They effectively pass the buck
> on to PHP and Ruby devs. If we cannot get it right, what chance have
> they? I've personally experienced a bit uptick in complaints about
> this recently. I think there are 3 separate groups within Heroku that
> regularly ask me how this patch is doing.

I think they have been getting away with it for so long for the same
reasons.

Arguably, if everyone followed "my" approach, this should be very easy
to fix everywhere.  Instead, reading through the PHP bug report, they
are coming up with a fairly complex solution for clearing the error
queue afterwards so as to not leave "landmines" for the next guy.  But
their code will (AFAICT) still be wrong because they are not clearing
the error *before* the API calls where it is required per documentation.
 So "everyone" (sample of 2) is scrambling to clean up for the next guy
instead of doing the straightforward fix of following the API
documentation and cleaning up before their own calls.

I also see the clean-up-afterwards approach in the Python ssl module.  I
fear there is de facto a second API specification that requires you to
clean up errors after yourself and gives an implicit guarantee that the
error queue is empty whenever you want to make any SSL calls.  I don't
think this actually works in all cases, but maybe if everyone else is
convinced of that (in plain violation of the published OpenSSL
documentation, AFAICT) we need to get on board with that for
interoperability.

>> Also, there is nothing that
>> says that an error produces exactly one entry in the error queue; it
>> could be multiple.  Or that errors couldn't arise at random times
>> between the reset and whatever happens next.
> 
> I think that it's kind of implied, since calling ERR_get_error() pops
> the stack. But even if that isn't so, it might be worth preventing bad
> things from happening to client applications only sometimes.

The lore on the internet suggests that multiple errors could definitely
happen.  So popping one error afterwards isn't going to fix it, it just
moves the edge case around.  At least what we should do is clear the
entire queue afterwards instead of just the first error.

> doesn't it give you pause? Heikki seemed to think that clearing our
> own queue was important when he looked at this a year ago:
> 
> http://www.postgresql.org/message-id/54edd30d.5050...@vmware.com

I think that message suggests that we should clear the queue before each
call, not after.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix for OpenSSL error queue bug

2016-03-10 Thread Peter Geoghegan

On Thu, Mar 10, 2016 at 7:22 PM, Peter Eisentraut  wrote:
> Arguably, if everyone followed "my" approach, this should be very easy
> to fix everywhere.

I don't think that there is any clear indication that the OpenSSL
people would share that view. Or my view. Or anything that's sensible
or practical or actionable.

> Instead, reading through the PHP bug report, they
> are coming up with a fairly complex solution for clearing the error
> queue afterwards so as to not leave "landmines" for the next guy.  But
> their code will (AFAICT) still be wrong because they are not clearing
> the error *before* the API calls where it is required per documentation.
>  So "everyone" (sample of 2) is scrambling to clean up for the next guy
> instead of doing the straightforward fix of following the API
> documentation and cleaning up before their own calls.

It will be less wrong, though.

PostgreSQL is the project that doesn't trust a C90 standard library
function to not blithely write passed the bounds of a passed buffer,
all because of a bug in certain versions of Solaris based systems that
was several years old at the time. See commit be8b06c364. My view that
that wasn't really worth worrying about was clearly the minority view
when this was discussed (a minority of 1, and a majority of 4 or 5,
IIRC). I think that this case vastly exceeds that standard for
worrying about other people's broken code; in this case, we ourselves
made the same mistake for years and years.

> I also see the clean-up-afterwards approach in the Python ssl module.  I
> fear there is de facto a second API specification that requires you to
> clean up errors after yourself and gives an implicit guarantee that the
> error queue is empty whenever you want to make any SSL calls.  I don't
> think this actually works in all cases, but maybe if everyone else is
> convinced of that (in plain violation of the published OpenSSL
> documentation, AFAICT) we need to get on board with that for
> interoperability.

I didn't know that Python's ssl module did that. That seems to lend
support to my view, which is that we should similarly clear the
thread's queue lest anyone else be affected. Yes, this approach is
fairly scatter-gun, but frankly that's just the situation we find
ourselves in.

>>> Also, there is nothing that
>>> says that an error produces exactly one entry in the error queue; it
>>> could be multiple.  Or that errors couldn't arise at random times
>>> between the reset and whatever happens next.
>>
>> I think that it's kind of implied, since calling ERR_get_error() pops
>> the stack. But even if that isn't so, it might be worth preventing bad
>> things from happening to client applications only sometimes.
>
> The lore on the internet suggests that multiple errors could definitely
> happen.  So popping one error afterwards isn't going to fix it, it just
> moves the edge case around.

Are you sure, specifically, that an I/O function is known to add more
than one error to the per-thread queue? Obviously there can be more
than one error in the queue. But I haven't seen any specific
indication, either in the docs or in the lore, that more than one
error can be added by a single call to an I/O function such as
SSL_read(). Perhaps you can share where you encountered the lore.

>> doesn't it give you pause? Heikki seemed to think that clearing our
>> own queue was important when he looked at this a year ago:
>>
>> http://www.postgresql.org/message-id/54edd30d.5050...@vmware.com
>
> I think that message suggests that we should clear the queue before each
> call, not after.

Uh, it very clearly *is* Heikki's view that we should clear the queue
*afterwards*. Certainly, I think Heikki also wanted us to clear the
queue before, so we aren't screwed, "just to be sure", as he puts it
-- but nobody disputes that that's necessary anyway. That it might not
be *sufficient* to just call ERR_get_error() is the new information in
the bug report. Heikki said:

"""

The OpenSSL manual doesn't directly require you to call
ERR_clear_error() before every SSL_* call. It just requires that you
ensure that the error queue is empty. Libpq ensures that by always
clearing the queue *after* an error happens, in SSLerrmessage().

"""

The problem with this, as Heikki goes on to say, it that we might not
get to that point in SSLerrmessage(); we may not be able to pop the
queue/call ERR_get_error(), more or less by accident (e.g. I noticed
an OOM could do that). That's why I proposed to fix that by calling
ERR_get_error() early and unambiguously. If we must rely on that
happening, it should not be from such a long distance (i.e. from
within SSLerrmessage(), which is kind of far removed from the original
I/O function calls).

Thanks
--
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Peter Geoghegan

On Thu, Mar 10, 2016 at 11:18 PM, Fabien COELHO  wrote:
> I can only concur!
>
> The "Performance Tips" chapter (II.14) is more user/query oriented. The
> "Server Administration" bool (III) does not discuss this much.

That's definitely one area in which the docs are lacking -- I've heard
several complaints about this myself. I think we've been hesitant to
do more in part because the docs must always be categorically correct,
and must not use weasel words. I think it's hard to talk about
performance while maintaining the general tone of the documentation. I
don't know what can be done about that.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-11 Thread Peter Geoghegan

On Fri, Mar 11, 2016 at 8:24 AM, Michael Paquier
 wrote:
> You can use for example dd in non-truncate mode to corrupt on-disk
> page data, say that for example:
> dd if=/dev/random bs=8192 count=1 \
> seek=$BLOCK_ID of=base/$DBOID/$RELFILENODE \
> conv=notrunc

Sure, but that would probably fail at the first hurdle -- the page
header would be corrupt. Which is a valid test, but not all that
interesting.

One testing workflow I tried is overwriting some page in a B-Tree
relfilenode with some other page in the same file:

$:~/pgdata/base/12413$ dd if=somefile of=somefile conv=notrunc bs=8192
count=1 skip=2 seek=3

That should fail due to the key space not being in order across pages,
which is slightly interesting. Or, you could selectively change one
item with a hex editor, as Anastasia did.

Or, you could add code like this to comparetup_index_btree(), to
simulate a broken opclass:

diff --git a/src/backend/utils/sort/tuplesort.c
b/src/backend/utils/sort/tuplesort.c
index 67d86ed..23712ff 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -3562,6 +3562,9 @@ comparetup_index_btree(const SortTuple *a, const
SortTuple *b,
compare = ApplySortComparator(a->datum1, a->isnull1,

b->datum1, b->isnull1,
  sortKey);
+
+   if (random() <= (MAX_RANDOM_VALUE / 1000))
+   compare = -compare;
if (compare != 0)
return compare;

There are many options when you want to produce a corrupt B-Tree index!

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-11 Thread Peter Geoghegan

On Fri, Mar 11, 2016 at 1:09 PM, Peter Geoghegan  wrote:
> Or, you could add code like this to comparetup_index_btree(), to
> simulate a broken opclass:
>
> diff --git a/src/backend/utils/sort/tuplesort.c
> b/src/backend/utils/sort/tuplesort.c
> index 67d86ed..23712ff 100644
> --- a/src/backend/utils/sort/tuplesort.c
> +++ b/src/backend/utils/sort/tuplesort.c
> @@ -3562,6 +3562,9 @@ comparetup_index_btree(const SortTuple *a, const
> SortTuple *b,
> compare = ApplySortComparator(a->datum1, a->isnull1,
>
> b->datum1, b->isnull1,
>   sortKey);
> +
> +   if (random() <= (MAX_RANDOM_VALUE / 1000))
> +   compare = -compare;
> if (compare != 0)
> return compare;


Note that this patch that I sketched would make CREATE INDEX produce
corrupt indexes, but the tool's verification itself would not be
affected. Although even if it was (even if _bt_compare() gave the same
wrong answers), it would still very likely detect corruption.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-11 Thread Peter Geoghegan

On Thu, Mar 10, 2016 at 9:18 AM, Tomas Vondra
 wrote:
> I've looked at this patch today, mostly to educate myself, so this
> probably should not count as a full review. Anyway, the patch seems in
> excellent shape - it'd be great if all patches (including those written
> by me) had this level of comments/docs.

Thanks. I try. This patch is something I've been working on slowly and
steadily for over 2 years. It was time to see it through.

>> Note that no function currently checks that the index is consistent
>> with the heap, which would be very useful (that's probably how I'd
>> eventually target the heapam, actually).
>
> I'm afraid I don't understand what "target the heapam" means. Could you
> elaborate?

Just that that's how I'd make amcheck verify that the heap was sane,
if I was going to undertake that project. Or, I'd start there.

> I agree. It'd be nice to have a tool that also checks for data
> corruption the a lower level (e.g. that varlena headers are not
> corrupted etc.) but that seems like a task for some other tool.

I'm not sure that that's a task for another tool. I think that this
tool is scoped at detecting corruption, and that could work well for
the heap, too. There might be fewer interesting things to test about
the heap when indexes aren't involved. Following HOT chains through an
index, and verifying things like that about the heap as we go could
detect a lot of problematic cases.

> Can we come up with names that more clearly identify the difference
> between those two functions? I mean, _parent_ does not make it
> particularly obvious that the second function acquires exclusive lock
> and performs more thorough checks.

Dunno about that. It's defining characteristic is that it checks child
pages against their parent IMV. Things are not often defined in terms
of their locking requirements.

> This also applies to the name of the contrib module - it's not
> particularly obvious what "amcheck" unless you're familiar with the
> concept of access methods. Which is quite unlikely for regular users.
> Maybe something like "check_index" would be more illustrative?

I think that given the heap may be targeted in the future, amcheck is
more future-proof. I think Robert might have liked that name a year
ago or something. In general, I'm not too worried about what the
contrib module is called in the end.

> Well, I wouldn't count myself as an expert here, but do we actually need
> such protection? I mean, we only do such checks when holding an
> exclusive lock on the parent relation, no? And even if we don't vacuum
> can only remove entries from the pages - why should that cause
> violations of any invariants?

I think that a more worked out explanation for why the ExclusiveLock
is needed is appropriate. I meant to do that.

Basically, a heavier lock is needed because of page deletion by
VACUUM, which is the major source of complexity (much more than page
splits, I'd say). In general, the key space can be consolidated by
VACUUM in a way that breaks child page checking because the downlink
we followed from our target page is no longer the current downlink.
Page deletion deletes the right sibling's downlink, and the deleted
page's downlink is used for its sibling. You could have a race, where
there was a concurrent page deletion of the left sibling of the child
page, then a concurrent insertion into the newly expanded keyspace of
the parent. Therefore, the downlink in the parent (which is the
"target", to use the patch's terminology) would not be a lower bound
on items in the page.

That's a very low probability race, because it involves deletion a
page representing a range in the keyspace that has no live items, but
then immediately getting one just in time for our check. But, I'm
pretty determined that false positives like that need to be impossible
(or else it's a bug).

I have a paranoid feeling that there is a similar very low probability
race with the left-right keyspace checks, which don't have relation
ExclusiveLock protection (IOW, I think that that might be buggy). I
need to think about that some more, but current thinking is that it
would hardly matter if we used the highkey from right page rather than
the first data item, which definitely would be correct. And it would
be simpler.

> A few minor comments:

Thanks for catching those issues. Will fix.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-11 Thread Peter Geoghegan

On Fri, Mar 11, 2016 at 1:31 PM, Peter Geoghegan  wrote:
> You could have a race, where
> there was a concurrent page deletion of the left sibling of the child
> page, then a concurrent insertion into the newly expanded keyspace of
> the parent. Therefore, the downlink in the parent (which is the
> "target", to use the patch's terminology) would not be a lower bound
> on items in the page.

Excuse me: I meant the newly expanded keyspace of the *child*. (The
parent's keyspace would have covered everything. It's naturally far
larger than either child's keyspace, since it typically has several
hundred pages.)


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-11 Thread Peter Geoghegan

On Fri, Mar 11, 2016 at 8:19 AM, Anastasia Lubennikova
 wrote:
> I do hope that my patch will be accepted in 9.6, so this conflict looks
> really bad.

I hope so too. I'll have to look into this issue.

> I think that error is caused by changes in pages layout. To save some space
> nonkey attributes are truncated
> when btree copies the indexed value into inner pages or into high key. You
> can look at index_reform_tuple() calls.

That seems like the kind of problem that would be expected when things
like that change. I think it's going to be hard to add new B-Tree
features without a tool like this, which was a big reason to work on
it; to make these new projects possible to test and review. I see many
opportunities to improve the B-Tree code, just as I imagine you do.
These projects are quite strategically important, because B-Trees are
so frequently used. I think that Postgres B-Trees produce excessive
random I/O for checkpoints for various reasons, and this disadvantages
Postgres when it is compared to other major systems. As things get
better in other areas of Postgres, these hard problems become more
important to solve.

> I wonder, whether you can add some check of catalog version to your module
> or this case requires more changes?

I think that it's just going to be tied to the Postgres version. So,
if your B-Tree patches are committed first, it's on me to make sure
they're handled correctly. Or vice-versa. Not worried that that will
be a problem.

We already take special steps to avoid the "minus infinity" item on
internal pages. I think that in the future, if Postgres B-Trees get
suffix truncation for internal page items, there are new problems for
amcheck (suffix truncation remove unneeded information from internal
page items, sometimes greatly increasing B-Tree fan-out. Internal page
items need only be sufficient to guide index scans correctly.).

Specially, with suffix truncation there might be "minus infinity"
*attributes*, too (these could make it safe to completely remove
attributes/columns past the first distinguishing/distinct attribute on
each item on internal pages). That's a case that amcheck then needs to
care about, just like it currently cares about the existing concept of
minus infinity items. That's how it goes for amcheck.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-11 Thread Peter Geoghegan

On Fri, Mar 11, 2016 at 3:50 PM, Jim Nasby  wrote:
> I also agree that the nmodule name isn't very clear. If this is meant to be
> the start of a generic consistency checker, lets call it that. Otherwise, it
> should be marked as being specific to btrees, because presumably we might
> eventually want similar tools for GIN, etc. (FWIW I'd vote for a general
> consistency checker).

It's a generic consistency checker -- that's the intent. Robert wanted
to go that way about a year ago, and I think it makes sense (this
module started out life as btreecheck). As I said, I don't really care
what the name ends up being. amcheck seems fine to me, but I'll go
with any reasonable suggestion that reflects the scope of the tool as
a generic consistency checker for AMs, including heapam.

I don't know that I understand the concern about the naming of
bt_index_parent_check(). I called it something that accurately
reflects what it does. That particular SQL-callable function is more
orientated towards helping hackers than towards helping users detect
routine problems, as the docs say, but it could help users working
with hackers, and it might even help users acting on their own. I
don't want to make it sound scary, because it isn't. I don't know what
problems will be detected by this tool most often, and I don't think
it would be wise to try to predict how that will work out. And so,
concealing the internal workings of each function/check by giving them
all totally non-technical names seems like a bad idea. It's not that
hard to understand that a B-Tree has multiple levels, and checking the
levels against each other requires more restrictive/stronger
heavyweight locks. I've only added 2 SQL-callable functions, each of
which has one argument, and the first of which has a generic name and
very light locking requirements, like a SELECT. It's not that hard to
get the general idea, as an ordinary user.

> I know the vacuum race condition would be very rare, but I don't think it
> can be ignored. Last thing you want out of a consistency checker is false
> negatives/positives.

That's what I said. And the docs also say that there should never be a
false positive. That's definitely an important design goal here,
because it enables routine testing.

> I do think it would be reasonable to just wholesale
> block against concurrent vacuums, but I don't think there's any reasonable
> way to do that.

Actually, that's exactly what bt_index_parent_check() does already.
VACUUM requires a SHARE UPDATE EXCLUSIVE lock on the heap relation,
which cannot be concurrently acquired due to bt_index_parent_check()'s
acquisition of a SHARE lock. The locking for bt_index_parent_check()
is almost the same as REINDEX, except that that acquires an ACCESS
EXCLUSIVE lock on the index relation; bt_index_parent_check() only
requires an EXCLUSIVE lock on the index relation.

Not sure about the cost delay thing. Delays are disabled by default
for manually issued VACUUM, so have doubts that that's useful.

If you want the tool to limp on when it finds an error, that can be
done by changing the constant for the CORRUPTION macro in amcheck.c.
But having that be dynamically configurable is not really compatible
with the goal of having amcheck be run routinely.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-11 Thread Peter Geoghegan

On Fri, Mar 11, 2016 at 4:17 PM, Peter Geoghegan  wrote:
> If you want the tool to limp on when it finds an error, that can be
> done by changing the constant for the CORRUPTION macro in amcheck.c.
> But having that be dynamically configurable is not really compatible
> with the goal of having amcheck be run routinely.

Also, it's just really hard to reason about what remains OK to check
after the first problem is encountered in the general case. It's
"unreasonable" for any of the checks to ever fail. So, by that
standard, assuming that they might fail at all could be called
paranoid. Who can say what "paranoid enough" should be? I think it's
useful to have a low-impact, generic check function for B-Tree
indexes, but I don't think we need to hold back on being exhaustive
elsewhere.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-11 Thread Peter Geoghegan

On Fri, Mar 11, 2016 at 4:30 PM, Jim Nasby  wrote:
> Right, but you still have the option to enable them if you don't want to
> swamp your IO system. That's why CIC obeys it too. If I was running a
> consistency check on a production system I'd certainly want the option to
> throttle it. Without that option, I don't see running this on production
> systems as being an option. If that's not a goal then fine, but if it is a
> goal I think it needs to be there.
>
> Isn't it just a few extra lines of code to support it?

I see your point.

I'll add that if people like the interface you propose. (Overloading
the VACUUM cost delay functions to cause a delay for amcheck
functions, too). Note that the functions already use an appropriate
buffer access strategy (it avoids blowing shared_buffers, much like
VACUUM itself).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: BSD Authentication support

2016-03-11 Thread Peter Eisentraut

On 1/7/16 9:40 PM, Marisa Emerson wrote:
> There's a port for PAM, but we would prefer to use BSD Auth as its quite
> a lot cleaner and is standard on OpenBSD.
> 
> I've attached an updated patch that includes documentation. It has been
> tested against OpenBSD 5.8. I'll add this thread to the commitfest.

(Not a BSD user, just reviewing the code.)

configure.in has "build with BSD support", which should be "build with
BSD Authentication support".

There should be some documentation of the new configure option in
installation.sgml.

The documentation in client-auth.sgml speaks of a postgresql user and an
auth group.  Maybe that's clear to users of BSD, but I don't know
whether these are OS entities or groups that I need to create or what.

The auth_userokay() call hardcodes a "type" of "pg-auth".  That seems
important and should probably be documented.  Extrapolating from PAM, I
think that should perhaps be an option in pg_hba.conf.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: BSD Authentication support

2016-03-11 Thread Peter Eisentraut

On 3/11/16 4:38 PM, Thomas Munro wrote:
> It looks like this needs review from an OpenBSD user specifically.
> FreeBSD and NetBSD use PAM instead of BSD auth.

FreeBSD has man pages for this stuff, so maybe they also have it now.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-12 Thread Peter Geoghegan

On Fri, Mar 11, 2016 at 6:33 PM, Tomas Vondra
 wrote:
> Right, but isn't there a difference between the two functions in this
> respect? Once you find corruption involving relationship between
> multiple pages, then I agree it's complicated to do any reasoning about
> what additional checks are safe.
>
> But does that problem also apply to bt_index_check, which checks pages
> independently?

I think so, yes.

> Admittedly, this also depends on the use case. If all we need to do is
> answering a question "Is the index corrupted?" then sure, bailing out
> on the first error is perfectly appropriate.
>
> But perhaps this would be useful for some recovery/forensics tasks?

Maybe, but I feel like making it possible to change the CORRUPTION
elevel macro was the right trade-off. I don't want to have to reason
about the universe of possible problems that could occur when the tool
must limp on in the event of corruption. For example, I don't want to
have to deal with infinite loops. In practice, an expert would
probably be fine to change the constant themselves if they needed to.

Indexes can always be rebuilt. The tool is for identifying and
diagnosing corruption, but if you want to diagnose a faulty opclass or
something, then I think you need to get out pageinspect. You need
human judgement for that.

> From time to time we need to investigate corruption in a database, i.e.
> see how much of the data is actually corrupted, list pages that need to
> be zeroed to get the cluster up to salvage as much as possible, etc.
> Currently this is tedious because we essentially find/fix the pages one
> by one. It'd be very useful to list all broken pages in one go and then
> fix all of them.
>
> Obviously, that's about heapam checks, but perhaps it would be useful
> for an index too?

Only insofar as it helps diagnose the underlying issue, when it is a
more subtle issue. Actually fixing the index is almost certainly a
REINDEX. Once you're into the messy business of diagnosing a
problematic opclass, you have to be an expert, and tweaking amcheck
for your requirements (i.e. rebuilding from source) becomes
reasonable. Part of the reason that the code is so heavily commented
is to make it hackable, because I do not feel optimistic that I can
get an expert-orientated interface right, but I still want to make the
tool as useful as possible to experts.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Refactoring speculative insertion with unique indexes a little

2016-03-12 Thread Peter Geoghegan

On Fri, Mar 11, 2016 at 5:26 AM, Robert Haas  wrote:
> This patch was reviewed during CF 2016-01 and has not been updated for
> CF 2016-03.  I think we should mark it Returned with Feedback.

I have a full plate at the moment, Robert, both as a reviewer and as a
patch author. This patch is basically uncontroversial, and is built to
make the AM interface clearer, and the design of speculative insertion
easier to understand. It's clear we should have it. I'll get around to
revising it before too long.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Refactoring speculative insertion with unique indexes a little

2016-03-12 Thread Peter Geoghegan

On Sat, Mar 12, 2016 at 2:43 PM, Michael Paquier
 wrote:
> Only one version of this patch has been sent at the beginning of this
> thread, and Heikki has clearly expressed his disagreement about at
> least a portion of it at the beginning of this thread, so I find it
> hard to define it as an "uncontroversial" thing and something that is
> clear to have as things stand. Seeing a new version soon would be a
> good next step I guess.

What is the point in saying this, Michael? What purpose does it serve?

I said "basically uncontroversial", not "uncontroversial". That is a
perfectly accurate characterization of the patch, and if you disagree
than I suggest you re-read the thread. Andres and Heikki were both in
favor of this patch. Heikki and I discussed one particular aspect of
it, and then it trailed off. The only thing that Heikki categorically
stated was that he disliked one narrow aspect of the style of one
thing in one function. I've already said I'm happy to do that.

As things stand, the documentation for amcanunique methods, and the
way they are described internally, is fairly misleading.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Refactoring speculative insertion with unique indexes a little

2016-03-12 Thread Peter Geoghegan

On Sat, Mar 12, 2016 at 2:53 PM, Peter Geoghegan  wrote:
> I said "basically uncontroversial", not "uncontroversial". That is a
> perfectly accurate characterization of the patch, and if you disagree
> than I suggest you re-read the thread.

In particular, note that Alvaro eventually sided with me against the
thing that Heikki argued for:

http://www.postgresql.org/message-id/20160118195643.GA117199@alvherre.pgsql

Describing what happened that way is unfair on Heikki, because I don't
think he was at all firm in what he said about making the new
UNIQUE_CHECK_SPECULATIVE "like CHECK_UNIQUE_YES, but return FALSE
instead of throwing an error on conflict". We were working through the
design, and it didn't actually come to any kind of impasse.

It's surprising and disappointing to me that this supposed
disagreement has been blown out of all proportion.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-13 Thread Peter Geoghegan

On Sat, Mar 12, 2016 at 7:46 PM, Matt Kelly  wrote:
> You can actually pretty easily produce a test case by setting up streaming
> replication between servers running two different version of glibc.
>
> I actually wrote a tool that spins up a pair of VMs using vagrant and then
> sets them up as streaming replica's using ansible.  It provides a nice one
> liner to get a streaming replica test environment going and it will easily
> provide the cross glibc test case.  Technically, though it belongs to Trip
> because I wrote it on company time.  Let me see if I can open source a
> version of it later this week that way you can use it for testing.

That could be interesting. The earlier prototypes of this tool are
known to have detected glibc collation incompatibilities in real
production systems.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-13 Thread Peter Geoghegan

On Sat, Mar 12, 2016 at 5:21 PM, Jeff Janes  wrote:
> Would the wiki be a good place for such tips?  Not as formal as the
> documentation, and more centralized (and editable) than a collection
> of blog posts.

That general direction makes sense, but I'm not sure if the Wiki is
something that this will work for. I fear that it could become
something like the TODO list page: a page that contains theoretically
accurate information, but isn't very helpful. The TODO list needs to
be heavily pruned, but that seems like something that will never
happen.

A centralized location for performance tips will probably only work
well if there are still high standards that are actively enforced.
There still needs to be tight editorial control.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix for OpenSSL error queue bug

2016-03-14 Thread Peter Geoghegan

On Thu, Mar 10, 2016 at 8:16 PM, Peter Geoghegan  wrote:
> On Thu, Mar 10, 2016 at 7:22 PM, Peter Eisentraut  wrote:
>> Arguably, if everyone followed "my" approach, this should be very easy
>> to fix everywhere.
>
> I don't think that there is any clear indication that the OpenSSL
> people would share that view. Or my view. Or anything that's sensible
> or practical or actionable.

Ideally, we'd be able to get this into the upcoming minor release.
This bug has caused Heroku some serious problems.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-14 Thread Peter Geoghegan

On Mon, Mar 14, 2016 at 10:39 AM, Robert Haas  wrote:
> I don't particularly like that interface.  I also suggest that it
> would be better to leave throttling to a future commit, and focus on
> getting the basic feature in first.

Works for me. I don't think throttling is an especially compelling feature.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix for OpenSSL error queue bug

2016-03-14 Thread Peter Geoghegan

On Mon, Mar 14, 2016 at 3:06 PM, Tom Lane  wrote:
> Agreed, we need to deal with this one way or the other.  My proposal
> is:
>
> 1. In HEAD, do it as Peter E. suggests, ie clear error queue before calls.
>
> 2. In back branches, clear error queue before *and* after calls.  This
> will waste a few nanoseconds but will avoid any risk of breaking
> existing third-party code.
>
> I think it's reasonable to expect extensions to update to the newer
> behavior in a major release, but we're taking risks if we expect
> that to happen in minor releases.

I am concerned that users will never be able to get this right, since
I think it requires every Ruby or PHP app using some thin OpenSSL
wrapper to clear the per-queue thread. It's a big mess, but it's our
mess to some degree.

I wonder if it would be just as good if we ensured that
ERR_get_error() was definitely called in any circumstance where it
looked like we might have an error: ERR_get_error() would be
*reliably* called, as in my patch, but unlike my patch only when
SSL_get_error() indicated a problem (not all the time).

Heikki believed that clearing the error queue by calling
ERR_clear_error() before calling an I/O function like SSL_read() was
necessary, as we all do; no controversy there. But Heikki also
believed, even prior to hearing about this bug, that it was important
and necessary for ERR_get_error() to be reached when there might be an
error added to the queue following a Postgres/libpq call to an I/O
function like SSL_read() followed by SSL_get_error() indicating
trouble. He thought, as I do, that it would be a good idea to not rely
on that happening from a distance (i.e. not relying on reaching
SSLerrmessage()). Peter E. seems to believe that there is absolutely
no reason to rely on ERR_get_error() getting called at all, and that
the existing SSLerrmessage() only exists for the purposes of producing
a human-readable error message.

Anyway, thinking about it some more, perhaps the best solution is to
do the ERR_get_error() call iff SSL_get_error() seems unhappy, perhaps
even placing the two into a utility function. That's probably almost
the same as the existing behavior, as far as clearing up the queue
after we may have added to it goes. I don't know if that's any less
safe then my patch's pessimistic approach. It seems like it might be
just as safe. Under this compromise, I think we'd probably clear the
error queue after SSL_get_error() returned a value that is not
SSL_ERROR_NONE, though (including SSL_ERROR_WANT_READ, etc). What do
you think about that?

> In any case, we need a patch that touches the backend-side code as well.

I agree that the backend-side code should be covered. I quickly
changed my mind about that, and am happy to produce a revision along
those lines.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix for OpenSSL error queue bug

2016-03-14 Thread Peter Geoghegan

On Mon, Mar 14, 2016 at 4:05 PM, Tom Lane  wrote:
> So your proposal is basically to do #2 in all branches?  I won't fight it,
> if it doesn't bloat the code much.  The overhead should surely be trivial
> compared to network communication costs, and I'm afraid you might be right
> about the risk of latent bugs.

Yes, with one small difference: I wouldn't be calling ERR_get_error()
in the common case where SSL_get_error() returns SSL_ERROR_NONE, on
the theory that skipping that case represents no risk. I'm making a
concession to Peter E's view that that will calling ERR_get_error()
more will add useless cycles.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Refactoring speculative insertion with unique indexes a little

2016-03-14 Thread Peter Geoghegan

On Mon, Mar 14, 2016 at 5:04 PM, Robert Haas  wrote:
> There hasn't been a new version of this patch in 9 months, you're
> clearly not in a hurry to produce one, and nobody else seems to feel
> strongly that this is something that needs to be done at all.  I think
> we could just let this go and be just fine, but instead of doing that
> and moving onto patches that people do feel strongly about, we're
> arguing about this.  Bummer.

I'm busy working on fixing an OpenSSL bug affecting all released
versions right at the moment, but have a number of complex 9.6 patches
to review, most of which are in need of support. I'm very busy.

I said that I'd get to this patch soon. I might be kicking the can
down the road a little with this patch; if so, I'm sorry. I suggest we
leave it at that, until the CF is almost over or until I produce a
revision.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Refactoring speculative insertion with unique indexes a little

2016-03-14 Thread Peter Geoghegan

On Mon, Mar 14, 2016 at 5:21 PM, Andres Freund  wrote:
> So? You're not the only one. I don't see why we shouldn't move this to
> 'returned with feedback' until there's a new version.

I don't see any point in that; I intend to get a revision in to the
ongoing CF. But fine.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix for OpenSSL error queue bug

2016-03-14 Thread Peter Geoghegan

On Mon, Mar 14, 2016 at 4:11 PM, Peter Geoghegan  wrote:
> Yes, with one small difference: I wouldn't be calling ERR_get_error()
> in the common case where SSL_get_error() returns SSL_ERROR_NONE, on
> the theory that skipping that case represents no risk. I'm making a
> concession to Peter E's view that that will calling ERR_get_error()
> more will add useless cycles.

The attached patch is what I have in mind.

I can produce a back-patchable variant of this if you and Peter E.
think this approach is okay.

-- 
Peter Geoghegan
From f7a72e36cdf2ff58857bd962e26daabdc5747fe1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Tue, 26 Jan 2016 15:11:15 -0800
Subject: [PATCH] Distrust external OpenSSL clients; clear err queue

OpenSSL has an unfortunate tendency to mix per-session state error
handling with per-thread error handling.  This can cause problems when
programs that link to libpq with OpenSSL enabled have some other use of
OpenSSL; without care, one caller of OpenSSL may cause problems for the
other caller.  Backend code might similarly be affected, for example
when a third party extension independently uses OpenSSL without taking
the appropriate precautions.

To fix, don't trust other users of OpenSSL to clear the per-thread error
queue.  Instead, clear the entire per-thread queue ahead of certain I/O
operations when it appears that there might be trouble (these I/O
operations mostly need to call SSL_get_error() to check for success,
which relies on the queue being empty).  This is slightly aggressive,
but it's pretty clear that the other callers have a very dubious claim
to ownership of the per-thread queue.  Do this is both frontend and
backend code.

Finally, be more careful about clearing our own error queue, so as to
not cause these problems ourself.  It's possibly that control previously
did not always reach SSLerrmessage(), where ERR_get_error() was supposed
to be called to clear the queue's earliest code.  Make sure
ERR_get_error() is always called, so as to spare other users of OpenSSL
the possibility of similar problems caused by libpq (as opposed to
problems caused by a third party OpenSSL library like PHP's OpenSSL
extension).  Again, do this is both frontend and backend code.

See bug #12799 and https://bugs.php.net/bug.php?id=68276

Based on patches by Dave Vitek and Peter Eisentraut.

Back-patch to all supported versions.
---
 src/backend/libpq/be-secure-openssl.c| 104 ++--
 src/interfaces/libpq/fe-secure-openssl.c | 114 ---
 2 files changed, 173 insertions(+), 45 deletions(-)

diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 1e3dfb6..8fd92ab 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -78,7 +78,7 @@ static DH  *tmp_dh_cb(SSL *s, int is_export, int keylength);
 static int	verify_cb(int, X509_STORE_CTX *);
 static void info_cb(const SSL *ssl, int type, int args);
 static void initialize_ecdh(void);
-static const char *SSLerrmessage(void);
+static const char *SSLerrmessage(unsigned long ecode);

 static char *X509_NAME_to_cstring(X509_NAME *name);

@@ -182,7 +182,7 @@ be_tls_init(void)
 		if (!SSL_context)
 			ereport(FATAL,
 	(errmsg("could not create SSL context: %s",
-			SSLerrmessage(;
+			SSLerrmessage(ERR_get_error();

 		/*
 		 * Disable OpenSSL's moving-write-buffer sanity check, because it
@@ -198,7 +198,7 @@ be_tls_init(void)
 			ereport(FATAL,
 	(errcode(ERRCODE_CONFIG_FILE_ERROR),
   errmsg("could not load server certificate file \"%s\": %s",
-		 ssl_cert_file, SSLerrmessage(;
+		 ssl_cert_file, SSLerrmessage(ERR_get_error();

 		if (stat(ssl_key_file, &buf) != 0)
 			ereport(FATAL,
@@ -228,12 +228,12 @@ be_tls_init(void)
 		SSL_FILETYPE_PEM) != 1)
 			ereport(FATAL,
 	(errmsg("could not load private key file \"%s\": %s",
-			ssl_key_file, SSLerrmessage(;
+			ssl_key_file, SSLerrmessage(ERR_get_error();

 		if (SSL_CTX_check_private_key(SSL_context) != 1)
 			ereport(FATAL,
 	(errmsg("check of private key failed: %s",
-			SSLerrmessage(;
+			SSLerrmessage(ERR_get_error();
 	}

 	/* set up ephemeral DH keys, and disallow SSL v2/v3 while at it */
@@ -262,7 +262,7 @@ be_tls_init(void)
 			(root_cert_list = SSL_load_client_CA_file(ssl_ca_file)) == NULL)
 			ereport(FATAL,
 	(errmsg("could not load root certificate file \"%s\": %s",
-			ssl_ca_file, SSLerrmessage(;
+			ssl_ca_file, SSLerrmessage(ERR_get_error();
 	}

 	/*--
@@ -293,7 +293,7 @@ be_tls_init(void)
 			else
 ereport(FATAL,
 		(errmsg("could not load SSL certificate revocation list file \"%s\": %s",
-ssl_crl_file, SSLerrmessage(;
+ssl_crl_f

Re: [HACKERS] Minor bug affecting ON CONFLICT lock wait log messages

2016-03-14 Thread Peter Geoghegan

On Mon, Mar 7, 2016 at 1:46 PM, Peter Geoghegan  wrote:
> Attached patch fixes a bug reported privately by Stephen this morning.

Bump.

I would like to see this in the next point release. It shouldn't be
hard to review.

Thanks
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS][PROPOSAL] Covering + unique indexes.

2016-03-14 Thread Peter Geoghegan

On Wed, Sep 16, 2015 at 8:53 AM, Nicolas Barbier
 wrote:
> After thinking about it a bit more, it indeed seems never useful to
> have f3 in the internal nodes if it is not part of the columns that
> determine the UNIQUE property. It could as well be pushed out of the
> internal nodes and only appear in the leaf nodes.

Correct. That's a standard technique in B-Tree implementations like
our own; suffix truncation can remove unneeded information from the
end of values, possibly including entire attributes, which can be
truncated in a way that is similar to this patch.

The difference here is only that this patch does not dynamically
determine which attributes can be removed while re-finding the parent
downlink in the second phase of a page split (the usual place it
happens with standard suffix truncation). Rather, this patch knows "a
priori" that it can truncate attributes that are merely "included"
attributes. That means that this patch has as much to do with
increasing B-Tree fan-out for these indexes as it does for making
unique indexes more usable for index-only scans. Both of those goals
are important, IMV.

This patch seems pretty cool. I noticed some issues following a quick
read though the patch including_columns_6.0.patch that Anastasia
should look into:

* You truncate (remove suffix attributes -- the "included" attributes)
within _bt_insertonpg():

-   right_item = CopyIndexTuple(item);
+   indnatts = IndexRelationGetNumberOfAttributes(rel);
+   indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+   if (indnatts != indnkeyatts)
+   {
+   right_item = index_reform_tuple(rel, item, indnatts, indnkeyatts);
+   right_item_sz = IndexTupleDSize(*right_item);
+   right_item_sz = MAXALIGN(right_item_sz);
+   }
+   else
+   right_item = CopyIndexTuple(item);
ItemPointerSet(&(right_item->t_tid), rbkno, P_HIKEY);

I suggest that you do this within _bt_insert_parent(), instead, iff
the original target page is know to be a leaf page. That's where it
needs to happen for conventional suffix truncation, which has special
considerations when determining which attributes are safe to truncate
(or even which byte in the first distinguishing attribute it is okay
to truncate past). Conventional suffix truncation (not this patch)
could truncate, say, "C" collation text past the first distinguishing
byte, where the byte distinguishes the new downlink being inserted
into the parent page compared to both the left downlink and right
downlink in the parent page-- the minimum amount of information to
correctly guide later index scans is only stored. But it isn't correct
(again, with conventional suffix truncation) to do this passed the
leaf level. It must end there.

It isn't safe past the leaf level (by which I mean when inserting a
downlink into its parent, one level up) because applying suffix
truncation again for the next level up might guide a search to the
highest node in the left sub-tree rather than to the lowest node in
the right sub-tree, or vice versa. In general, we must be careful
about "cousin" nodes, that are beside each other but are not
"siblings" due to not sharing the same parent. It doesn't really
matter that this restriction exists, because you get almost all the
benefit at the leaf -> immediate parent level anyway. Higher levels
will reuse already truncated Index Tuples, that are typically
"truncated enough".

So, this should work in a similar way to conventional suffix
truncation (BTW, you should work on that later). And so, it should
just do it there. Besides, checking it only where it could possibly
help is clearer, since as written the code in _bt_insertonpg() will
never need to truncate following a non-leaf/internal page split.

* I think the comparison logic may have a bug.

Does this work with amcheck? Maybe it works with bt_index_check(), but
not bt_index_parent_check()? I think that you need to make sure that
_bt_compare() knows about this, too. That's because it isn't good
enough to let a truncated internal IndexTuple compare equal to a
scankey when non-truncated attributes are equal. I think you need to
have an imaginary "minus infinity" attribute past the first
non-truncated attribute (i.e. "minus infinity value" for the first
*truncated* attribute). That way, the downlinks will always be lower
bounds when the non-"included"/truncated attributes are involved. This
seems necessary. No?

It's necessary because you aren't storing any attributes, so it's not
acceptable to even attempt a comparison -- I think that will segfault
(doesn't matter that the index scan wouldn't have returned anything
anyway). I think it's also necessary because of  issues with "cousin"
nodes making index scans lose their way.

That's all I have right now. Nice work.
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS][PROPOSAL] Covering + unique indexes.

2016-03-14 Thread Peter Geoghegan

On Mon, Mar 14, 2016 at 8:43 PM, Peter Geoghegan  wrote:
>
> Does this work with amcheck? Maybe it works with bt_index_check(), but
> not bt_index_parent_check()? I think that you need to make sure that
> _bt_compare() knows about this, too. That's because it isn't good
> enough to let a truncated internal IndexTuple compare equal to a
> scankey when non-truncated attributes are equal. I think you need to
> have an imaginary "minus infinity" attribute past the first
> non-truncated attribute (i.e. "minus infinity value" for the first
> *truncated* attribute). That way, the downlinks will always be lower
> bounds when the non-"included"/truncated attributes are involved. This
> seems necessary. No?

Maybe  can store information about minus infinity attributes in
"itup->t_tid.ip_posid". As you know, this is unused within
internal/non-leaf pages, whose downlink items only need a block number
(the child's block number/location on disk for that particular
downlink). That's a bit ugly, but there are plenty of bits available
from there, so use them if you need them.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS][PROPOSAL] Covering + unique indexes.

2016-03-14 Thread Peter Geoghegan

On Mon, Mar 14, 2016 at 8:43 PM, Peter Geoghegan  wrote:
> * I think the comparison logic may have a bug.
>
> Does this work with amcheck? Maybe it works with bt_index_check(), but
> not bt_index_parent_check()? I think that you need to make sure that
> _bt_compare() knows about this, too. That's because it isn't good
> enough to let a truncated internal IndexTuple compare equal to a
> scankey when non-truncated attributes are equal. I think you need to
> have an imaginary "minus infinity" attribute past the first
> non-truncated attribute (i.e. "minus infinity value" for the first
> *truncated* attribute). That way, the downlinks will always be lower
> bounds when the non-"included"/truncated attributes are involved. This
> seems necessary. No?

Oh, BTW: You probably need to worry about high key items as a special
case, too. Note that there is a special case when the ScanKey is equal
to the high key on a page during insertion. As the nbtree README puts
it:

"""
An insertion that sees the high key of its target page is equal to the key
to be inserted has a choice whether or not to move right, since the new
key could go on either page.  (Currently, we try to find a page where
there is room for the new key without a split.)

"""

Just something to watch out for if you add "minus infinity" attributes
as I suggested. Not exactly sure what to do about this other problem,
but it seems manageable.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-14 Thread Peter Geoghegan

On Fri, Mar 11, 2016 at 8:19 AM, Anastasia Lubennikova
 wrote:
> But I have some concerns about compatibility with my patches.
> I've tried to call bt_index_check() over my "including" patch [1] and caught
> a segfault.
>
> LOG:  server process (PID 31794) was terminated by signal 11: Segmentation
> fault
> DETAIL:  Failed process was running: select bt_index_check('idx');
>
> I do hope that my patch will be accepted in 9.6, so this conflict looks
> really bad.
> I think that error is caused by changes in pages layout. To save some space
> nonkey attributes are truncated

> [1] https://commitfest.postgresql.org/9/433/

I posted a review of your "Covering + unique indexes" patch, where I
made an educated guess about what the problem is here (I sort of
hinted at what I thought it was already, in this thread, actually). I
haven't actually tested this theory of mine myself just yet, but let
me know what you think of it on the thread for your patch.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb array-style subscription

2016-03-14 Thread Peter Geoghegan

On Thu, Mar 3, 2016 at 2:31 AM, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> Well, actually, I agree with that. I can try to rework the patch to achieve
> this goal.

Good idea.

I wonder, having taken a quick look at the patch, how this works?:

+select * from test_jsonb_subscript where
test_json['key_doesnt_exists'] = '"value"';
+ id | test_json
++---
+(0 rows)

Can this use an index, in principle? If not, do you have a plan to get
it to a place where it can? How can the expression be made to map on
to existing indexable operators, such as the containment operator @>,
or even B-Tree opclass operators like = or <=, for example?

This kind of thing was always my main problem with jsonb array-style
subscription. I think it's really quite desirable in theory, but I
also think that these problems need to be fixed first. Think that I
made this point before.

ISTM that these expressions need to be indexable in some way, which
seems like a significantly harder project, especially because the
mapping between an expression in a predicate like this and an
indexable operator like @> is completely non-obvious. Making such a
mapping itself extensible seems even more tricky, which is what it
would take, I suspect. Indexing is always of great importance for
jsonb. It's already too complicated.

-- 
Peter Geoghega

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-15 Thread Peter Geoghegan

On Mon, Mar 14, 2016 at 11:48 PM, Amit Langote
 wrote:
>> Dunno about that. It's defining characteristic is that it checks child
>> pages against their parent IMV. Things are not often defined in terms
>> of their locking requirements.
>
> At the risk of sounding a bit verbose, do bt_check_level() for a check
> that inspects a level at a time and bt_check_multi_level() for a check
> that spans levels sound descriptive?

Hmm. But all functions verify multiple levels. What distinguishes
bt_index_parent_check()'s verification is that the downlinks in
internal pages are checked against actual child pages (every item in
the child page, in fact). It's the parent/child relationship that is
verified in addition to the standard checks of every page on and
across (not between) every level.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-15 Thread Peter Geoghegan

On Tue, Mar 15, 2016 at 12:31 AM, Amit Langote
 wrote:
> Ah, I see the nuance.  Thanks for the explanation.  Maybe,
> bt_index_check() and bt_index_parent_child_check() /
> bt_index_check_parent_child().  IMHO, the latter more clearly highlights
> the fact that parent/child relationships in the form of down-links are
> checked.

Well, the downlink is in the parent, because there is no such thing as
an "uplink". So I prefer bt_index_parent_check(), since it usefully
hints at starting from the parent. It's also more concise.

> By the way, one request (as a non-native speaker of English language, who
> ends up looking up quite a few words regularly) -
>
> Could we use "conform" or "correspond" instead of "comport" in the
> following error message:
>
> "left link/right link pair in index \"%s\" don't comport"

OK. I'll do something about that.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [NOVICE] WHERE clause not used when index is used

2016-03-15 Thread Peter Geoghegan

On Thu, Mar 3, 2016 at 2:07 AM, Simon Riggs  wrote:
> Later, I will add the tests we discovered here to index scans, so that
> further optimization work is more easily possible.

Please do.

I would like to start testing the B-Tree code more exhaustively by
adding a test suite to amcheck. This test suite would indirectly test
external sorting, B-Tree page deletion, edge-cases with very large
IndexTuples, etc.

Ideas for good areas of the B-Tree code to add tests for are welcome.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: FW: [HACKERS] [NOVICE] WHERE clause not used when index is used

2016-03-15 Thread Peter Eisentraut

On 3/15/16 2:28 PM, Jernigan, Kevin wrote:
> I recently joined the product management team for AWS RDS Postgres
> (after years at Oracle in their database team), and we are very
> interested in confirming (or not) that the fix for the problem below
> will be included in 9.5.2, and in the community’s plans (likely date)
> for releasing 9.5.2.

The patch was reverted in the 9.5 branch, so assuming that that is the
end of this investigation (which it appears to be), then it will be part
of the 9.5.2 release.

> Is there an email list other than hackers where we can follow
> discussions on release plans for 9.5.2 (and future releases)?

This is a good list to follow to know about release schedules.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minor bug affecting ON CONFLICT lock wait log messages

2016-03-15 Thread Peter Geoghegan

On Tue, Mar 15, 2016 at 6:18 AM, Stephen Frost  wrote:
> Agreed.  I'm going to play with it a bit more but barring objections,
> I'll commit and back-patch Peter's patch.

Thanks for taking care of this, Stephen.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS][PROPOSAL] Covering + unique indexes.

2016-03-15 Thread Peter Geoghegan

On Mon, Mar 14, 2016 at 8:43 PM, Peter Geoghegan  wrote:
> Does this work with amcheck? Maybe it works with bt_index_check(), but
> not bt_index_parent_check()? I think that you need to make sure that
> _bt_compare() knows about this, too. That's because it isn't good
> enough to let a truncated internal IndexTuple compare equal to a
> scankey when non-truncated attributes are equal. I think you need to
> have an imaginary "minus infinity" attribute past the first
> non-truncated attribute (i.e. "minus infinity value" for the first
> *truncated* attribute). That way, the downlinks will always be lower
> bounds when the non-"included"/truncated attributes are involved. This
> seems necessary. No?

Actually, I now think I got this slightly wrong.

What's at issue is this (from nbtree README):

"""
Lehman and Yao assume that the key range for a subtree S is described
by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
page.  This does not work for nonunique keys (for example, if we have
enough equal keys to spread across several leaf pages, there *must* be
some equal bounding keys in the first level up).  Therefore we assume
Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
bounding key in an upper tree level must descend to the left of that
key to ensure it finds any equal keys in the preceding page.

"""

Today, nbtree needs to check the page to the left of an equal internal
page downlink child anyway. That isn't hard-coded into _bt_compare(),
though. If it was, it would be a "positive infinity" attribute, not
"negative infinity" as I incorrectly said. This is because the equal
IndexTuples might easily not begin exactly at the beginning of the
downlink's child page (often, we should begin in the left page
instead, by following the previous downlink in the parent instead --
just in case).

Any kind of "infinity" attribute probably isn't necessary for your
patch today, since, as referenced in the README extract above, our
slightly non-standard L&Y does this in _bt_binsrch():

 /*
  * At this point we have high == low, but be careful: they could point
  * past the last slot on the page.
  *
  * On a leaf page, we always return the first key >= scan key (resp. >
  * scan key), which could be the last slot + 1.
  */
 if (P_ISLEAF(opaque))
 return low;

However, I think it's still a good idea to have a special integer in
the IndexTuple explicitly indicating the attribute at which the
"suffix" is truncated, even if the "suffix truncation" happens at a
consistent point based on an index in your patch. That will change in
the future, and we should be prepared.

Even though I was partially mistaken, clearly it still wasn't okay to
even try to compare non-existent attributes in internal pages, since
that segfaulted. So a (mostly imaginary) "positive infinity" attribute
can still exist, initially just to make _bt_compare() not crash. This
attribute number (stored in "itup->t_tid.ip_posid") effectively tells
the binary search code to look at the child to the left of the
compared downlink (not the downlink child itself), even though that's
already going to happen per the code above. So, thinking about it once
more (uh, sorry), _bt_compare() has to "indicate equality"/return 0,
*despite* being *logically* a "positive infinity" comparison from a
higher level, in order to let the code above to handle it instead, so
it isn't handled more than once. Also, not sure if less common
"nextkey == true" case needs some further consideration (lets forget
that detail for time being, though). Phew!

So, as I said, _bt_binsrch() and/or _bt_compare() can be fixed to make
sure that the scan arrives on the correct leaf page (the first leaf
page that an matching IndexTuple could be on). What then, though? What
about leaf pages, that *do* have the extra attributes ("INCLUDING"
attributes) represented in their tuples, and *don't* "return
OffsetNumberPrev(low)" at the end of _bt_binsrch() (they do the
P_LEAF() thing quoted above)? Are they safe? Remember:

* For nextkey=false (cmpval=1), the loop invariant is: all slots before
* 'low' are < scan key, all slots at or after 'high' are >= scan key.

I think this means that you need to be very careful about leaf pages, too.

Speculative insertion (used by UPSERT) may not even have the extra
attributes, during the precheck that starts from within
check_exclusion_or_unique_constraint() -- it needs to start from the
very start at the leaf level, without regard for the particular
details of the non-constrained extra columns. I see that you take the
number of attributes a new way, so that ultimately _bt_compare

Re: [HACKERS] pam auth - add rhost item

2016-03-15 Thread Peter Eisentraut

On 3/10/16 8:11 AM, Grzegorz Sampolski wrote:
> In attchment new patch with updated documentation and with small change
> to coding style as you suggested.

This patch seems fine.  I'm not sure about the name "pamusedns" for the
option, since we use the OS resolver, which might not actually use DNS.
 Maybe something like "pam_use_hostname"?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] insufficient qualification of some objects in dump files

2016-03-15 Thread Peter Eisentraut

On 2/26/16 1:30 AM, Tom Lane wrote:
> As the patch is presented, I agree with Peter that it does not really
> need a format number bump.  The question that has to be answered is
> whether this solution is good enough?  You could not trust it for
> automated processing of tags --- it's easy to think of cases in which the
> schema/object name separation would be ambiguous.  So is the tag really
> "strictly for human consumption"?  I'm not sure about that.

Well what are those tags for?  They are not used by pg_restore, so they
are for users.  My understanding is that the tags help in editing a TOC
list for use by pg_restore.  What pg_restore actually reads are the
OIDs, but the tags are there so users can edit the files.  The tags can
also be used for ad hoc automatic processing.  They are not sufficiently
delimited and escaped for robustness in all cases, but it can be done if
you control the inputs and know what to expect.  But this is the same
problem before and after my patch.

Both of these cases are helped by my patch, and both of these cases were
pretty broken (for the object classes in question) before my patch.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] IF (NOT) EXISTS in psql-completion

2016-03-15 Thread Peter Eisentraut

On 2/5/16 3:09 AM, Kyotaro HORIGUCHI wrote:
> I considered how to make tab-completion robust for syntactical
> noises, in other words, optional words in syntax. Typically "IF
> (NOT) EXISTS", UNIQUE and TEMPORARY are words that don't affect
> further completion.

To repeat the question I raised in the previous commit fest about tab
completion: Why do you want tab completion for IF NOT EXISTS?  When you
tab complete, the completion mechanism will show you whether the item in
question exists.  What is the use case?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relaxing SSL key permission checks

2016-03-15 Thread Peter Eisentraut

On 3/10/16 9:20 PM, Peter Eisentraut wrote:
> On 3/4/16 3:55 PM, Alvaro Herrera wrote:
>> * it failed to check for S_IXUSR, so permissions 0700 were okay, in
>> contradiction with what the error message indicates.  This is a
>> preexisting bug actually.  Do we want to fix it by preventing a
>> user-executable file (possibly breaking compability with existing
>> executable key files), or do we want to document what the restriction
>> really is?
> 
> I think we should not check for S_IXUSR.  There is no reason for doing that.
> 
> I can imagine that key files are sometimes copied around using USB
> drives with FAT file systems or other means of that sort where
> permissions can scrambled.  While I hate gratuitous executable bits as
> much as the next person, insisting here would just create annoyances in
> practice.

I'm happy with this patch except this minor point.  Any final comments?



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] syslog configurable line splitting behavior

2016-03-15 Thread Peter Eisentraut

On 3/8/16 9:12 PM, Andreas Karlsson wrote:
> I have one nitpick: why is one of the variables "true" while the other
> is "on" in the example? I think both should be "on".
> 
> #syslog_sequence_numbers = true
> #syslog_split_lines = on
> 
> Another possible improvement would be to change "Split messages sent to
> syslog." to something more verbose like "Split messages sent to syslog,
> by lines and to fit in 1024 bytes.".

Updated patches with your suggestions.  I also renamed
syslog_split_lines to syslog_split_messages, which I think is more accurate.


From 70bacecba46eb38c02c43957c2f1812faf5684df Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 26 Feb 2016 22:34:30 -0500
Subject: [PATCH 1/2] Add syslog_sequence_numbers parameter

---
 doc/src/sgml/config.sgml  | 28 +++
 src/backend/utils/error/elog.c| 12 ++--
 src/backend/utils/misc/guc.c  | 10 ++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/utils/elog.h  |  1 +
 5 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6c73fb4..bbe87ce 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4305,6 +4305,34 @@ Where To Log

   
 
+  
+   syslog_sequence_numbers (boolean)
+
+ syslog_sequence_numbers configuration parameter
+
+   
+
+   
+
+ When logging to syslog and this is on (the
+ default), then each message will be prefixed by an increasing
+ sequence number (such as [2]).  This circumvents
+ the --- last message repeated N times --- suppression
+ that many syslog implementations perform by default.  In more modern
+ syslog implementations, repeat message suppression can be configured
+ (for example, $RepeatedMsgReduction
+ in rsyslog), so this might not be
+ necessary.  Also, you could turn this off if you actually want to
+ suppress repeated messages.
+
+
+
+ This parameter can only be set in the postgresql.conf
+ file or on the server command line.
+
+  
+ 
+
  
   event_source (string)
   
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 5b7554b..88421c7 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -106,6 +106,7 @@ int			Log_error_verbosity = PGERROR_VERBOSE;
 char	   *Log_line_prefix = NULL;		/* format for extra log line info */
 int			Log_destination = LOG_DESTINATION_STDERR;
 char	   *Log_destination_string = NULL;
+bool		syslog_sequence_numbers = true;
 
 #ifdef HAVE_SYSLOG
 
@@ -2018,7 +2019,11 @@ write_syslog(int level, const char *line)
 
 			chunk_nr++;
 
-			syslog(level, "[%lu-%d] %s", seq, chunk_nr, buf);
+			if (syslog_sequence_numbers)
+syslog(level, "[%lu-%d] %s", seq, chunk_nr, buf);
+			else
+syslog(level, "[%d] %s", chunk_nr, buf);
+
 			line += buflen;
 			len -= buflen;
 		}
@@ -2026,7 +2031,10 @@ write_syslog(int level, const char *line)
 	else
 	{
 		/* message short enough */
-		syslog(level, "[%lu] %s", seq, line);
+		if (syslog_sequence_numbers)
+			syslog(level, "[%lu] %s", seq, line);
+		else
+			syslog(level, "%s", line);
 	}
 }
 #endif   /* HAVE_SYSLOG */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f0d4ec1..3ef432a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1632,6 +1632,16 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"syslog_sequence_numbers", PGC_SIGHUP, LOGGING_WHERE,
+			gettext_noop("Add sequence number to syslog messags to avoid duplicate suppression."),
+			NULL
+		},
+		&syslog_sequence_numbers,
+		true,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ee3d378..b72ea6d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -358,6 +358,7 @@
 # These are relevant when logging to syslog:
 #syslog_facility = 'LOCAL0'
 #syslog_ident = 'postgres'
+#syslog_sequence_numbers = on
 
 # This is only relevant when logging to eventlog (win32):
 #event_source = 'PostgreSQL'
diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index 7d338dd..e245b2e 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -397,6 +397,7 @@ extern int	Log_error_verbosity;
 extern char *Log_line_prefix;
 extern int	Log_destination;
 extern char *Log_destination_string;
+extern

Re: [HACKERS] Minor bug affecting ON CONFLICT lock wait log messages

2016-03-16 Thread Peter Geoghegan

On Tue, Mar 15, 2016 at 8:31 AM, Stephen Frost  wrote:
>> > We wouldn't want to end up returning different error messages for the
>> > same command under the same conditions just based, which is what we'd
>> > potentially end up doing if we used XLTW_InsertIndexUnique here.
>>
>> Perhaps we need a new value in that enum, so that the context message is
>> specific to the situation at hand?
>
> Maybe, but that would require a new message and new translation and just
> generally doesn't seem like something we'd want to back-patch for a
> bugfix.

Thinking about this again, I think we should use
XLTW_InsertIndexUnique after all. The resemblance of the
check_exclusion_or_unique_constraint() code to the nbtinsert.c code
seems only superficial on second thought. So, I propose fixing the fix
by changing XLTW_InsertIndex to XLTW_InsertIndexUnique.

Basically, unlike with the similar nbtinsert.c code, we're checking
someone else's tuple in the speculative insertion
check_exclusion_or_unique_constraint() case that was changed (or it's
an exclusion constraint, where even the check for our own tuple
happens only after insertion; no change there in any case). Whereas,
in the nbtinsert.c case that I incorrectly duplicated, we're
specifically indicating that we're waiting on *our own* already
physically inserted heap tuple, and say as much in the
XLTW_InsertIndex message that makes it into the log. So, the
fd658dbb300456b393536802d1145a9cea7b25d6 fix is wrong in that we now
indicate that the wait was on our own already-inserted tuple, and not
*someone else's* already-inserted tuple, as it should (we haven't
inserting anything in the first phase of speculative insertion, this
precheck). This code is not a do-over of the check in nbtinsert.c --
rather, the code in nbtinsert.c is a second phase do-over of this code
(where we've physically inserted a heap tuple + index tuple --
"speculative" though that is).

It seems fine to characterize a wait here as "checking the uniqueness
of [somebody else's] tuple", even though technically we're checking
the would-be uniqueness were we to (speculatively, or actually)
insert. However, it does not seem fine to claim ctid_wait as a tuple
we ourselves inserted.

Sorry about that. My confusion came from the fact that historically,
when check_exclusion_or_unique_constraint() was called
check_exclusion_constraint(), it (almost) was our own tuple that was
waited on.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] speeding up GIN build with parallel workers

2016-03-16 Thread Peter Geoghegan

On Wed, Mar 16, 2016 at 2:25 AM, Constantin S. Pan  wrote:
> The backend just waits for the results from the workers and merges them
> (in case wnum > 0). So the 1-worker configuration should never be used,
> because it is as sequential as the 0-worker, but adds data transfer.

This is why I wanted an easy way of atomically guaranteeing some
number of workers (typically 2), or not using parallelism at all. I
think the parallel worker API should offer a simple way to do that in
cases like this, where having only 1 worker is never going to win.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-18 Thread Peter Geoghegan

On Thu, Mar 17, 2016 at 1:13 PM, Robert Haas  wrote:
> OK, I have now committed 0001, and separately, some comment
> improvements - or at least, I think they are improvements - based on
> this discussion.

Thanks!

Your changes look good to me. It's always interesting to learn what
wasn't so obvious to you when you review my patches. It's probably
impossible to stare at something like tuplesort.c for as long as I
have and get that balance just right.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Covering + unique indexes.

2016-03-18 Thread Peter Geoghegan

On Fri, Mar 18, 2016 at 5:15 AM, David Steele  wrote:
> It looks like this patch should be marked "needs review" and I have done so.

Uh, no it shouldn't. I've posted an extensive review on the original
design thread. See CF entry:

https://commitfest.postgresql.org/9/433/

Marked "Waiting on Author".

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Make primnodes.h gender neutral

2016-03-18 Thread Peter Geoghegan

On Thu, Mar 17, 2016 at 4:46 PM, Tom Lane  wrote:
> Alvaro's original complaint that the sentences no longer agree as to
> person is on-point.

That's reasonable. Still, there are only a few existing instances of
gendered pronouns in the code, so fixing them carefully, without
losing anything important seems like a relatively straightforward
task.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-19 Thread Peter Geoghegan

On Wed, Mar 16, 2016 at 6:42 PM, Peter Geoghegan  wrote:
>> - I think that batchmemtuples() is somewhat weird.  Normally,
>> grow_memtuples() doubles the size of the array each time it's called.
>> So if you somehow called this function when you still had lots of
>> memory available, it would just double the size of the array.
>> However, I think the expectation is that it's only going to be called
>> when availMem is less than half of allowedMem, in which case we're
>> going to get the special "last increment of memtupsize" behavior,
>> where we expand the memtuples array by some multiple between 1.0 and
>> 2.0 based on allowedMem/memNowUsed.
>
> That's right. It might be possible for the simple doubling behavior to
> happen under artificial conditions instead, for example when we have
> enormous individual tuples, but if that does happen it's still
> correct. I just didn't think it was worth worrying about giving back
> more memory in such extreme edge-cases.

Come to think of it, maybe the pass-by-value datum sort case should
also call batchmemtuples() too (or something similar). If you look at
how beginmerge() is called, you'll see that that doesn't happen.

Obviously this case is not entitiled to a "memtupsize *
STANDARDCHUNKHEADERSIZE" refund, since of course there never was any
overhead like that at any point. And, obviously this case has no need
for batch memory at all. However, it is entitled to get a refund for
non-used tapes (accounted for, but, it turns out, never allocated
tapes). It should then get the benefit of that refund by way of
growing memtuples through a similar "final, honestly, I really mean it
this time" call to grow_memtuples().

So, while the "memtupsize * STANDARDCHUNKHEADERSIZE refund" part
should still be batch-specific (i.e. used for the complement of
tuplesort cases, never the datum pass-by-val case), the new
grow_memtuples() thing should always happen with external sorts.

The more I think about it, the more I wonder if we should commit
something like the debugging patch 0004-* (enabled only when
trace_sort = on, of course). Close scrutiny of what tuplesort.c is
doing with memory is important.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-19 Thread Peter Geoghegan

oo. I couldn't
easily make that work with the replacement selection heap, because
master's tupleosrt.c never fully empties its RS heap until the last
run. I can only perform the first call to MemoryContextReset() in the
memory patch because it happens at a point memtupcount == 0 -- it's
called when a run is merged (outside a final on-the-fly merge). Notice
that the mergeonerun() loop invariant is:

while (state->memtupcount > 0)
{
...
}

So, it must be that state->memtupcount == 0 (and that we have no batch
memory) when I call MemoryContextReset() immediately afterwards.

> - I haven't yet figured out why we use batching only for the final
> on-the-fly merge pass, instead of doing it for all merges.  I expect
> you have a reason.  I just don't know what it is.

The most obvious reason, and possibly the only reason, is that I have
license to lock down memory accounting in the final on-the-fly merge
phase. Almost equi-sized runs are the norm, and code like this is no
longer obligated to work:

FREEMEM(state, GetMemoryChunkSpace(stup->tuple));

That's why I explicitly give up on "conventional accounting". USEMEM()
and FREEMEM() calls become unnecessary for this case that is well
locked down. Oh, and I know that I won't use most tapes, so I can give
myself a FREEMEM() refund before doing the new grow_memtuples() thing.

I want to make batch memory usable for runs, too. I haven't done that
either for similar reasons. FWIW, I see no great reason to worry about
non-final merges.

> - I have also not yet figured out why you chose to replace
> state->datumTypByVal with state->tuples and reverse the sense.  I bet
> there's a reason for this, too.  I don't know what it is, either.

It makes things slightly easier to make this a generic property of any
tuplesort: "Can SortTuple.tuple ever be set?", rather than allowing it
to remain a specific property of a datum tuplesort.
state->datumTypByVal often isn't initialized in master, and so cannot
be checked as things stand (unless the code is in a
datum-case-specific routine).

This new flag controls batch memory in slightly higher-level way than
would otherwise be possible. It also controls the memory prefetching
added by patch/commit 0003-*, FWIW.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Improve memory management for external sorts.

2016-03-19 Thread Peter Geoghegan

On Sat, Mar 19, 2016 at 6:38 AM, Robert Haas  wrote:
> It would be helpful if you could either (a) confirm that that patch
> still applies and that it has no issues of this type or (b) post an
> updated version.

I don't think that it has a problem with lacking the right int64
format specifiers. However, I had a bad feeling about integer overflow
of state->currentRun, and think I'll need to address that. After all,
if runs are now no longer 2x work_mem on average, it's not completely
ridiculous to imagine that being an issue on a misconfigured system.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Improve memory management for external sorts.

2016-03-19 Thread Peter Geoghegan

On Fri, Mar 18, 2016 at 11:43 AM, Andres Freund  wrote:
> Yes, that removes the warning, and looks correct.

Thanks. We should be careful to not repeat this mistake when the
quicksort patch goes in.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fd.c doesn't remove files on a crash-restart

2016-03-19 Thread Peter Geoghegan

On Wed, Mar 16, 2016 at 10:53 AM, Joshua D. Drake  
wrote:
> fd.c[1] will remove files from pgsql_tmp on a restart but not a
> crash-restart per this comment:
>
> /*
> * NOTE: we could, but don't, call this during a post-backend-crash restart
> * cycle.  The argument for not doing it is that someone might want to
> examine
> * the temp files for debugging purposes.  This does however mean that
> * OpenTemporaryFile had better allow for collision with an existing temp
> * file name.
> */
>
> I understand that this is designed this way. I think it is a bad idea

FWIW, I've seen this get out of hand several times myself.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relaxing SSL key permission checks

2016-03-19 Thread Peter Eisentraut

Committed with the discussed adjustment and documentation update.

On 3/18/16 2:26 PM, Christoph Berg wrote:
> Re: Peter Eisentraut 2016-03-16 <56e8c221.1050...@gmx.net>
>>>> * it failed to check for S_IXUSR, so permissions 0700 were okay, in
>>>> contradiction with what the error message indicates.  This is a
>>>> preexisting bug actually.  Do we want to fix it by preventing a
>>>> user-executable file (possibly breaking compability with existing
>>>> executable key files), or do we want to document what the restriction
>>>> really is?
>>>
>>> I think we should not check for S_IXUSR.  There is no reason for doing that.
>>>
>>> I can imagine that key files are sometimes copied around using USB
>>> drives with FAT file systems or other means of that sort where
>>> permissions can scrambled.  While I hate gratuitous executable bits as
>>> much as the next person, insisting here would just create annoyances in
>>> practice.
>>
>> I'm happy with this patch except this minor point.  Any final comments?
> 
> I'm fine with that change.
> 
> Do you want me to update the patch or do you already have a new
> version, given it's marked as Ready for Committer?
> 
> Christoph
> 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Make primnodes.h gender neutral

2016-03-19 Thread Peter Geoghegan

On Thu, Mar 17, 2016 at 4:09 PM, Robert Haas  wrote:
> Debating whether or not somebody is currently upset about this, and
> how upset the are, and what the value is of fixing it is missing the
> point.  When somebody sends a patch for a typographical error, we
> don't say: well, we could fix that typographical error, but let's wait
> until the next time we have cause to reword the paragraph.  We just
> commit the patch

Right. We could spend significant time debating how much this matters.
I expect that few if any contributors would consider that a policy on
gendered pronouns has negative value, though, and it really isn't that
hard to fix. So we should just fix it.

(In case it matters, I'm in favor of this proposal on its merits).

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relaxing SSL key permission checks

2016-03-19 Thread Peter Eisentraut

Committed with the discussed adjustment and documentation update.

On 3/18/16 2:26 PM, Christoph Berg wrote:
> Re: Peter Eisentraut 2016-03-16 <56e8c221.1050...@gmx.net>
>>>> * it failed to check for S_IXUSR, so permissions 0700 were okay, in
>>>> contradiction with what the error message indicates.  This is a
>>>> preexisting bug actually.  Do we want to fix it by preventing a
>>>> user-executable file (possibly breaking compability with existing
>>>> executable key files), or do we want to document what the restriction
>>>> really is?
>>>
>>> I think we should not check for S_IXUSR.  There is no reason for doing that.
>>>
>>> I can imagine that key files are sometimes copied around using USB
>>> drives with FAT file systems or other means of that sort where
>>> permissions can scrambled.  While I hate gratuitous executable bits as
>>> much as the next person, insisting here would just create annoyances in
>>> practice.
>>
>> I'm happy with this patch except this minor point.  Any final comments?
> 
> I'm fine with that change.
> 
> Do you want me to update the patch or do you already have a new
> version, given it's marked as Ready for Committer?
> 
> Christoph
> 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Refactoring speculative insertion with unique indexes a little

2016-03-19 Thread Peter Geoghegan

On Wed, Mar 16, 2016 at 11:25 AM, Robert Haas  wrote:
> Sure, and if everybody does that, then there will be 40 patches that
> get updated in the last 2 days if the CommitFest, and that will be
> impossible.  Come on.  You're demanding a degree of preferential
> treatment which is unsupportable.

It's unexpected that an entirely maintenance-orientated patch like
this would be received this way. I'm not demanding anything, or
applying any real pressure. Let's just get on with it.

I attach a revision, that makes all the changes that Heikki suggested,
except one. As already noted several times, following this suggestion
would have added a bug. Alvaro preferred my original approach here in
any case. I refer to my original approach of making the new
UNIQUE_CHECK_SPECULATIVE case minimally different from the existing
UNIQUE_CHECK_PARTIAL case currently used for deferred unique
constraints and speculative insertion, as opposed to making the new
UNIQUE_CHECK_SPECULATIVE "like CHECK_UNIQUE_YES, but return FALSE
instead of throwing an error on conflict". That was broken because
CHECK_UNIQUE_YES waits for the outcome of an xact, which
UNIQUE_CHECK_PARTIAL never does, and so UNIQUE_CHECK_SPECULATIVE must
never do.

Any and all waits happen in the first phase of speculative insertion,
and never the seconds. I could give a complicated explanation for why,
involving a deadlock scenario, but a simple explanation will do: it
has always worked that way, and was tested to work that way.

Feedback from Heikki led to these changes for this revision:

* The use of arguments within ExecInsert() was simplified.

* More concise AM documentation.

The docs essentially describe two new concepts:

- What unique index insertion needs to know about speculative
insertion in general. This doesn't just apply to speculative inserters
themselves, of course.

- What speculative insertion is. Why it exists (why we don't just wait
on xact). In other words, what "unprincipled deadlocks" are, and how
they are avoided (from a relatively high level).

I feel like I have a responsibility to make sure that this mechanism
is well documented, especially given that not that many people were
involved in its design. It's possible that no more than the 3 original
authors of UPSERT fully understand speculative insertion -- it's easy
to miss some of the subtleties.

I do not pursue something like this without good reason. I'm
optimistic that the patch will be accepted if it is carefully
considered.

-- 
Peter Geoghegan
From 2b2a4c40a5e60ac1f28a75f11204ce88eb48cc73 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Tue, 2 Jun 2015 17:34:16 -0700
Subject: [PATCH] Refactor speculative insertion into unique indexes

Add a dedicated IndexUniqueCheck constant for the speculative insertion
case, UNIQUE_CHECK_SPECULATIVE, rather than reusing
UNIQUE_CHECK_PARTIAL, which should now only be used for deferrable
unique constraints.

This change allows btinsert() (and, in principle, any amcanunique
aminsert function) to avoid physically inserting an IndexTuple in the
event of detecting a conflict during speculative insertion's second
phase.  With nbtree, this avoidance now occurs at the critical point in
_bt_doinsert() immediately after establishing that there is a conflict,
but immediately before actually calling _bt_insertonpg() to proceed with
physical IndexTuple insertion.

At that point during UNIQUE_CHECK_PARTIAL insertion it makes sense to
soldier on, because the possibility remains that the conflict will later
go away and everything will happen to work out because the conflicting
insertion's transaction aborted.  Speculative inserters, in contrast,
have no chance of working out a way to proceed without first deleting
the would-be-pointed-to heap tuple already physically inserted.  For the
current row proposed for insertion, no useful progress will have been
made at this point.

This patch has nothing to do with performance; it is intended to clarify
how amcanunique AMs perform speculative insertion, and the general
theory of operation.  It is natural to avoid an unnecessary index tuple
insertion.  That that could happen before was quite misleading, because
it implied that it was necessary, and didn't acknowledge the differing
requirements in each case.
---
 doc/src/sgml/indexam.sgml  | 101 ++---
 src/backend/access/nbtree/nbtinsert.c  |  49 +---
 src/backend/executor/execIndexing.c|  34 +--
 src/backend/executor/nodeModifyTable.c |   2 +-
 src/include/access/genam.h |   8 +++
 5 files changed, 148 insertions(+), 46 deletions(-)

diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 5f7befb..1b26dd0 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -271,10 +271,13 @@ aminsert (Relation indexRelation,

The function's Boolean result value

Re: [HACKERS] snapshot too old, configured by time

2016-03-20 Thread Peter Geoghegan

On Thu, Mar 17, 2016 at 2:15 PM, Kevin Grittner  wrote:
> New patch just to merge in recent commits -- it was starting to
> show some bit-rot.  Tests folded in with main patch.

I haven't read the patch, but I wonder: What are the implications here
for B-Tree page recycling by VACUUM? I know that you understand this
topic well, so I don't assume that you didn't address it.

Offhand, I imagine that there'd be some special considerations. Why is
it okay that an index scan could land on a deleted page with no
interlock against VACUUM's page recycling? Or, what prevents that from
happening in the first place?

I worry that something weird could happen there. For example, perhaps
the page LSN on what is actually a newly recycled page could be set
such that the backend following a stale right spuriously raises a
"snapshot too old" error.

I suggest you consider making amcheck [1] a part of your testing
strategy. I think that this patch is a good idea, and I'd be happy to
take feedback from you on how to make amcheck more effective for
testing this patch in particular.

[1] https://commitfest.postgresql.org/9/561/
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] snapshot too old, configured by time

2016-03-20 Thread Peter Geoghegan

On Sun, Mar 20, 2016 at 4:25 PM, Peter Geoghegan  wrote:
> I worry that something weird could happen there. For example, perhaps
> the page LSN on what is actually a newly recycled page could be set
> such that the backend following a stale right spuriously raises a
> "snapshot too old" error.

I mean a stale right-link, of course.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-20 Thread Peter Geoghegan

now know that that
could be very important. Let's not forget where the useful places to
look for problems are.

6. Based on your feedback on the batch memory patch (your commit
c27033ff), I  made a stylistic change. I made similar comments about
the newly added quicksort/dumpbatch() MemoryContextReset() call, since
it has its own special considerations (a big change in the pattern of
allocations occurs after batch memory is used -- we need to be careful
about how that could impact the "bucketing by size class").

Thanks
-- 
Peter Geoghegan
From b921e285ed3f22c9cab9c78c7c610fbdfee5839b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Thu, 10 Mar 2016 14:52:18 -0800
Subject: [PATCH 3/3] Add MemoryContextStats() calls for debugging

These new calls, which must be explicitly enabled, illustrates how
effective frequent MemoryContextReset() calls are in preventing palloc()
framentation within tuplesort.c.
---
 src/backend/utils/sort/tuplesort.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 2832c86..102f78a 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2622,6 +2622,14 @@ mergeonerun(Tuplesortstate *state)
 	 */
 	beginmerge(state, false);
 
+#ifdef SHOW_MEMORY_STATS
+#ifdef TRACE_SORT
+	/* Print mem stats before each non-final merge to track fragmentation */
+	if (trace_sort)
+		MemoryContextStats(state->sortcontext);
+#endif
+#endif
+
 	/*
 	 * Execute merge by repeatedly extracting lowest tuple in heap, writing it
 	 * out, and replacing it with next tuple from same tape (if there is
@@ -2954,6 +2962,14 @@ mergebatch(Tuplesortstate *state, int64 spacePerTape)
 
 	state->batchUsed = true;
 	state->spacePerTape = spacePerTape;
+
+#ifdef SHOW_MEMORY_STATS
+#ifdef TRACE_SORT
+	/* Print mem stats before final merge to track fragmentation */
+	if (trace_sort)
+		MemoryContextStats(state->sortcontext);
+#endif
+#endif
 }
 
 /*
-- 
1.9.1

From da71fc6214a4906b0b1bcd6e6e96bf1d86effe50 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Sun, 12 Jul 2015 13:14:01 -0700
Subject: [PATCH 2/3] Perform memory prefetching when writing memtuples

This patch is based on, but quite distinct to a separately submitted,
more general version which performs prefetching in several places [1].
This version now only performs prefetching of each "tuple proper" during
the writing of batches of tuples (an entire run, written following a
quicksort).  The case for prefetching each "tuple proper" at several
sites now seems weak due to difference in CPU microarchitecture.
However, it might still be that there is a consistent improvement
observable when writing out tuples, because that involves a particularly
tight inner loop, with relatively predictable processing to hide memory
latency behind.  A helpful generic prefetch hint may be possible for
this case, even if it proves impossible elsewhere.

This has been shown to appreciably help on both a POWER7 server
processor [2], and an Intel Mobile processor.

[1] https://commitfest.postgresql.org/6/305/
[2] CAM3SWZR5rv3+F3FOKf35=dti7otmmcdfoe2vogur0pddg3j...@mail.gmail.com
---
 config/c-compiler.m4   | 17 +
 configure  | 31 +++
 configure.in   |  1 +
 src/backend/utils/sort/tuplesort.c | 14 ++
 src/include/c.h| 14 ++
 src/include/pg_config.h.in |  3 +++
 src/include/pg_config.h.win32  |  3 +++
 src/include/pg_config_manual.h | 10 ++
 8 files changed, 93 insertions(+)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 550d034..8be2122 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -271,6 +271,23 @@ fi])# PGAC_C_BUILTIN_UNREACHABLE
 
 
 
+# PGAC_C_BUILTIN_PREFETCH
+# -
+# Check if the C compiler understands __builtin_prefetch(),
+# and define HAVE__BUILTIN_PREFETCH if so.
+AC_DEFUN([PGAC_C_BUILTIN_PREFETCH],
+[AC_CACHE_CHECK(for __builtin_prefetch, pgac_cv__builtin_prefetch,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([],
+[int i = 0;__builtin_prefetch(&i, 0, 3);])],
+[pgac_cv__builtin_prefetch=yes],
+[pgac_cv__builtin_prefetch=no])])
+if test x"$pgac_cv__builtin_prefetch" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_PREFETCH, 1,
+  [Define to 1 if your compiler understands __builtin_prefetch.])
+fi])# PGAC_C_BUILTIN_PREFETCH
+
+
+
 # PGAC_C_VA_ARGS
 # --
 # Check if the C compiler understands C99-style variadic macros,
diff --git a/configure b/configure
index a45be67..75453d2 100755
--- a/configure
+++ b/configure
@@ -11398,6 +11398,37 @@ if test x"$pgac_cv__builtin_unreachable" = xyes ; then
 $as_echo "#define HAVE__BUILTIN_UNREACHABLE 1" >>confdefs.h
 
 fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetc

Re: [HACKERS] Minor bug affecting ON CONFLICT lock wait log messages

2016-03-21 Thread Peter Geoghegan

XLTW_InsertIndexUnique.

> One thing I can say is that the XLTW_InsertIndex at least matches the
> *action* we're taking, which is that we're trying to INSERT.

Right, but I don't think that XLTW_InsertIndexUnique specifically
implies that we're not inserting, just as XLTW_RecheckExclusionConstr
does not specifically imply that we're not inserting (actually, we're
usually or always inserting with XLTW_RecheckExclusionConstr, so it
better not).

> I don't feel terribly strongly about that position and so if others
> feel the XLTW_InsertIndexUnique message really would be better, I'd be
> happy to commit the change.

I'd also like to hear other opinions, if any are to be had. Sorry that
I changed my mind, but it's a subtle issue, I'm sure you'll agree. I'm
not going to push on this, but I want to be sure that we're happy with
this.

To reiterate, I think it boils down to: Is it okay that this new
XLTW_InsertIndex case reports someone else's TID, while the only other
XLTW_InsertIndex case has always reported our own TID?

Discussing these sorts of "ontological" questions reminds me just how
painful UPSERT was as a project.  :-)

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-22 Thread Peter Geoghegan

On Tue, Mar 22, 2016 at 9:33 AM, David Steele  wrote:
> It looks like an updated patch is expected here, though it seems that
> the only requests are for updates to comments.

That's right - I have a small number of feedback items to work
through. I also determined myself that there could be a very low
probability race condition when checking the key space across sibling
pages, and will work to address that. If I'm right about that then
it's not a lot of work to fix; I'm probably just going to use the
right page's high key rather than its first data item.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-03-22 Thread Peter Geoghegan

On Tue, Mar 22, 2016 at 10:57 AM, Peter Geoghegan  wrote:
> That's right - I have a small number of feedback items to work
> through. I also determined myself that there could be a very low
> probability race condition when checking the key space across sibling
> pages, and will work to address that. If I'm right about that then
> it's not a lot of work to fix; I'm probably just going to use the
> right page's high key rather than its first data item.

I also want to use amcheck to test sorting, especially external
sorting which is currently totally untested. It would be nice to see
testing of strxfrm() on the buildfarm, too -- amcheck provides a nice
way to make sure strxfrm() and strcoll() are in agreement for at least
those cases that are tested, without having to worry about portability
in the same way as a simple pg_regress approach would require.

Note that there are reportedly systems in mainstream use where
strxfrm() is broken; it doesn't agree with strcoll().

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] problem with precendence order in JSONB merge operator

2016-03-22 Thread Peter Krauss

Seems that parser not using precedence ideal order, and that casting
obligation losts performance.

The first problem is self-evident in this example:

SELECT '{"x":1}'::jsonb || (('{"A":{"y":2}}'::jsonb)->'A')
  -- it is ok, expected result with (x,y)
SELECT '{"x":1}'::jsonb || '{"A":{"y":2}}'::jsonb)->'A'
  -- non-expected result (y).

Higher precedence  most
be for -> operator, that is like an object-oriented *path* operator, always
higher than algebric ones.


Other problem is using this operation as SQL function,

  CREATE FUNCTION term_lib.junpack(jsonb,text) RETURNS JSONB AS $f$
SELECT ($1-$2)::JSONB || ($1->>$2)::JSONB;
  $f$ LANGUAGE SQL IMMUTABLE;

without casting produce error. Perhaps will be "more friendly" without cast
obligation,

and it is a performance problem, the abusive use of castings losts
performance.

Re: [HACKERS] Using quicksort for every external sort run

2016-03-22 Thread Peter Geoghegan

On Tue, Mar 22, 2016 at 2:27 PM, Tomas Vondra
 wrote:
> Each query was executed 5x for each work_mem value (between 8MB and 1GB),
> and then a median of the runs was computed and that's what's on the
> "comparison". This compares a414d96ad2b without (master) and with the
> patches applied (patched). The last set of columns is simply a "speedup"
> where "<1.0" means the patched code is faster, while >1.0 means it's slower.
> Values below 0.9 or 1.1 are using green or red background, to make the most
> significant improvements or regressions clearly visible.
>
> For the smaller data set (1M rows), things works pretty fine. There are
> pretty much no red cells (so no significant regressions), but quite a few
> green ones (with duration reduced by up to 50%). There are some results in
> the 1.0-1.05 range, but considering how short the queries are, I don't think
> this is a problem. Overall the total duration was reduced by ~20%, which is
> nice.
>
> For the 10M data sets, total speedup is also almost ~20%, and the speedups
> for most queries are also very nice (often ~50%).

To be clear, you seem to mean that ~50% of the runtime of the query
was removed. In other words, the quicksort version is twice as fast.

> But the number of
> regressions is considerably higher - there's a small number of queries that
> got significantly slower for multiple data sets, particularly for smaller
> work_mem values.

No time to fully consider these benchmarks right now, but: Did you
make sure to set replacement_sort_mem very low so that it was never
used when patched? And, was this on the latest version of the patch,
where memory contexts were reset (i.e. the version that got committed
recently)? You said something about memory batching, so ISTM that you
should set that to '64', to make sure you don't get one longer run.
That might mess with merging.

Note that the master branch has the memory batching patch as of a few
days back, so it that's the problem at the low end, then that's bad.
But I don't think it is: I think that the regressions at the low end
are about abbreviated keys, particularly the numeric cases. There is a
huge gulf in the cost of those comparisons (abbreviated vs
authoritative), and it is legitimately a weakness of the patch that it
reduces the number in play. I think it's still well worth it, but it
is a downside. There is no reason why the authoritative numeric
comparator has to allocate memory, but right now that case isn't
optimized

I find it weird that the patch is exactly the same as master in a lot
of cases. ISTM that with a case where you use 1GB of memory to sort 1
million rows, you're so close to an internal sort that it hardly
matters (master will not need a merge step at all, most likely). The
patch works best with sorts that take tens of seconds, and I don't
think I see any here, nor any high memory tests where RS flops. Now, I
think you focused on regressions because that was what was
interesting, which is good. I just want to put that in context.

Thanks
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] problem with precendence order in JSONB merge operator

2016-03-22 Thread Peter Krauss

Subjective notes to contextualize (try to explain on bad-English) my
"precedence order" and JSONB visions:

JSON datatype is perfect as workaround, and for many simple and less
exigent applications.
JSONB is the  "first class" datatype for user community, we expected years
(!) for it ... Need some "first class" and friendly behaviour.

In this context JSONB is not "any other" datatype, it is the bridge between
relational data and flexible data...
It is the Holy Grail and the Rosetta Stone :-)

I think JSONB operators need some more attention, in semantic and usability
contexts.   If you want to add  some friendliness and orthogonality in
JSONB operators, will be natural to see -> operator as a kind of
object-oriented *path* operator...
By other hand, of course, you can do the simplest to implement JSONB... But
you do a lot <http://www.postgresql.org/docs/9.5/static/functions-json.html>
(!), it was not easy to arrive here, and need only a little bit more
to  reach perfection ;-)



2016-03-22 18:42 GMT-03:00 David G. Johnston :

> On Tue, Mar 22, 2016 at 1:52 PM, Peter Krauss  wrote:
>
>> Seems that parser not using precedence ideal order, and that casting
>> obligation losts performance.
>>
>> The first problem is self-evident in this example:
>>
>> SELECT '{"x":1}'::jsonb || (('{"A":{"y":2}}'::jsonb)->'A')
>>   -- it is ok, expected result with (x,y)
>> SELECT '{"x":1}'::jsonb || '{"A":{"y":2}}'::jsonb)->'A'
>>   -- non-expected result (y).
>>
>> Higher precedence <https://en.wikipedia.org/wiki/Order_of_operations> most
>> be for -> operator, that is like an object-oriented *path* operator,
>> always higher than algebric ones.
>>
> There is presently no formal concept of "path operator" in PostgreSQL.
>  "->" is a user-defined operator, as is "||" and thus have equal
> precedence and left associativity.
>
> http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html
>
> Regardless, "||" is not an "algebric" [sic] operator...I'm curious what
> source you are using to back your claim of operator precedence between
> different so-called "operator types".
>
> Its highly undesirable to make changes to operator precedence.
>
> Operators are simply symbols to the parser - there is no context involved
> that would allow making their precedence dynamic.  So all PostgreSQL sees
> is "||", not a "JSONB merge operator".
>
> Other problem is using this operation as SQL function
>>
>>   CREATE FUNCTION term_lib.junpack(jsonb,text) RETURNS JSONB AS $f$
>> SELECT ($1-$2)::JSONB || ($1->>$2)::JSONB;
>>   $f$ LANGUAGE SQL IMMUTABLE;
>>
>> without casting produce error. Perhaps will be "more friendly" without
>> cast obligation,
>>
>> and it is a performance problem, the abusive use of castings losts
>> performance.
>>
> I cannot make this work...
>
> version
> PostgreSQL 9.5.1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
> 4.8.2-19ubuntu1) 4.8.2, 64-bit
>
> SELECT ('{"a":1,"b":2}'::jsonb - 'b'::text)::jsonb ||
> ('{"a":1,"b":2}'::jsonb #> 'b'::text)::jsonb
>
> > SQL Error: ERROR: invalid concatenation of jsonb objects
> 
> This seems like user error but without a self-contained test case
> exercising the query (the use of a function in this context should be
> immaterial) I'm finding it hard to explain why.  My simple case returns a
> non-object with rightly cannot be appended to an object.
>
> In isolatoin you can avoid casting the RHS of the || operator by using the
> "#>(jsonb,text[])" operator
>
> SELECT pg_typeof('{"a":1,"b":{"c":2}}'::jsonb #> array['b']::text[])
> --jsonb
>
> JSON, IME, still needs some fleshing out.  Efficient usage might require
> additional features but for now one needs to get very familiar with all the
> various operator variants that allow the user to choose whether to return
> json or text and to pick the correct one for their needs.
>
> David J.
> 
>
>

Re: [HACKERS] PoC: Partial sort

2016-03-23 Thread Peter Geoghegan

Hi,

On Tue, Mar 1, 2016 at 7:06 AM, Alexander Korotkov  wrote:
> I finally went over your review.

I'll respond to your points here. Note that I'm reviewing
"partial-sort-basic-7.patch", which you sent on March 13. I respond
here because this is where you answered my questions (I had no
feedback on "partial-sort-basic-6.patch", which didn't use the new
upper planner pathification stuff, unlike this latest version).

> On Wed, Nov 4, 2015 at 4:44 AM, Peter Geoghegan  wrote:
>>
>> Explain output
>> ---

>> I think it might be a good idea to also have a "Sort Groups: 2" field
>> above. That illustrates that you are in fact performing 2 small sorts
>> per group, which is the reality. As you said, it's good to have this
>> be high, because then the sort operations don't need to do too many
>> comparisons, which could be expensive.
>
>
> I agree with your notes. In the attached version of path explain output was
> revised as you proposed.

Cool.

>> Sort Method
>> 
>>
>> Even thought the explain analyze above shows "top-N heapsort" as its
>> sort method, that isn't really true. I actually ran this through a
>> debugger, which is why the above plan took so long to execute, in case
>> you wondered. I saw that in practice the first sort executed for the
>> first group uses a quicksort, while only the second sort (needed for
>> the 2 and last group in this example) used a top-N heapsort.

> With partial sort we run multiple sorts in the same node. Ideally, we need
> to provide some aggregated information over runs.
> This situation looks very similar to subplan which is called multiple times.
> I checked how it works for now.

Noticed this in nodeSort.c:

+   if (node->tuplesortstate != NULL)
+   {
+   tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+   node->groupsCount++;
+   }
+   else
+   {
+   /* Support structures for cmpSortSkipCols - already
sorted columns */
+   if (skipCols)
+   prepareSkipCols(plannode, node);

+   /*
+* Only pass on remaining columns that are unsorted.
Skip abbreviated
+* keys usage for partial sort.  We unlikely will have
huge groups
+* with partial sort.  Therefore usage of abbreviated
keys would be
+* likely a waste of time.
+*/
tuplesortstate = tuplesort_begin_heap(tupDesc,

You should comment on which case is which, and put common case (no
skip cols) first. Similarly, the ExecSort() for(;;) should put the
common (non-partial) case first, which it does, but then the "first
tuple in partial sort" case first, then the "second or subsequent
partial sort" case last.

More comments here, please:

+typedef struct SkipKeyData
+{
+ FunctionCallInfoData fcinfo;
+ FmgrInfo flinfo;
+ OffsetNumber attno;
+} SkipKeyData;

(What's SkipKeyData?)

Also want comments for new SortState fields. SortState.prev is a
palloc()'d copy of tuple, which should be directly noted, as it is for
similar aggregate cases, etc.

Should you be more aggressive about freeing memory allocated for
SortState.prev tuples?

The new function cmpSortSkipCols() should say "Special case for
NULL-vs-NULL, else use standard comparison", or something. "Lets
pretend NULL is a value for implementation convenience" cases are
considered the exception, and are always noted as the exception.

> In the case of subplan explain analyze gives us just information about last
> subplan run. This makes me uneasy. From one side, it's probably OK that
> partial sort behaves like subplan while showing information just about last
> sort run. From the other side, we need some better solution for that in
> general case.

I see what you mean, but I wasn't so much complaining about that, as
complaining about the simple fact that we use a top-N heap sort *at
all*. This feels like the "limit" case is playing with partial sort
sub-sorts in a way that it shouldn't.

I see you have code like this to make this work:

+   /*
+* Adjust bound_Done with number of tuples we've actually sorted.
+*/
+   if (node->bounded)
+   {
+   if (node->finished)
+   node->bound_Done = node->bound;
+   else
+   node->bound_Done = Min(node->bound,
node->bound_Done + nTuples);

But, why bother? Why not simply prevent tuplesort.c from ever using
the top-N heapsort method when it is called from nodeSort.c for a
partial sort (probably in the planner)?

Why, at a high level, does it make sense to pass down a limit to *any*
sort opera

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 12751 matches

Mail list logo