Re: [Patch] Checksums for SLRU files

2018-07-17 Thread Thomas Munro
On Wed, Jul 18, 2018 at 5:41 PM, Andrey Borodin  wrote:
>> I think we'd want pg_upgrade tests showing an example of each SLRU
>> growing past one segment, and then being upgraded, and then being
>> accessed in various different pages and segment files, so that we can
>> see that we're able to shift the data to the right place successfully.
>> For example I think I'd want to see that a single aborted transaction
>> surrounded by many committed transactions shows up in the right place
>> after an upgrade.
>
> Can you elaborate a bit on how to implement this test. I've searched for some 
> automated pg_upgrade tests but didn't found one.
> Should it be one-time test script or something "make check-world"-able?

Hi Andrey,

Like this (also reached by check-world):

$ cd src/bin/pg_upgrade
$ make check

It looks like the interesting bits are in test.sh.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)

2018-07-17 Thread Simon Riggs
On 17 July 2018 at 23:10, Tom Lane  wrote:
> Peter Geoghegan  writes:
>> I've done plenty of research into the history of this hack. It was
>> your work, but it does actually make sense in the context of today's
>> nbtree code. It is essential with scankey-wise duplicates, since
>> groveling through hundreds or even thousands of pages full of
>> duplicates to find free space (and avoid a page split) is going to
>> have a very serious downside for latency.
>
> Well, the actual problem was O(N^2) behavior:
>
> https://www.postgresql.org/message-id/2378.967216388%40sss.pgh.pa.us
>
> https://git.postgresql.org/gitweb/?p=postgresql.git=commitdiff=40549e9cb5abd2986603883e4ab567dab34723c6
>
> I certainly have no objection to improving matters, but let's be sure
> we don't re-introduce any two-decade-old problems.

Looking at this in terms of CPU cost for a single insert, insertion at
the end of the list of duplicates was O(N), which was much improved by
the O(k) behavior.
Keeping the index entries in order is O(logN)

So there is a higher cost to keeping the entries in order, though that
becomes a win because of the reduced bloat in cases where we have a
high numbers of deletes. That is a win in insertion time as well,
since by packing the index more tightly we cause fewer costly splits
and less I/O.

If we knew that we were never going to do deletes/non-HOT updates from
the table we could continue to use the existing mechanism, otherwise
we will be better off to use sorted index entries. However, it does
appear as if keeping entries in sorted order would be a win on
concurrency from reduced block contention on the first few blocks of
the index key, so it may also be a win in cases where there are heavy
concurrent inserts but no deletes.

Bulk loading will improve because adding a new data block via COPY
will cause all of the resulting index entries to be added with more
adjacency, so a block full of duplicate entries could be sorted and
then merged efficiently into the index.

I hope we can see a patch that just adds the sorting-by-TID property
so we can evaluate that aspect before we try to add other more
advanced index ideas.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [Patch] Checksums for SLRU files

2018-07-17 Thread Andrey Borodin
Hi, Tomas!

> 
> I think we'd want pg_upgrade tests showing an example of each SLRU
> growing past one segment, and then being upgraded, and then being
> accessed in various different pages and segment files, so that we can
> see that we're able to shift the data to the right place successfully.
> For example I think I'd want to see that a single aborted transaction
> surrounded by many committed transactions shows up in the right place
> after an upgrade.

Can you elaborate a bit on how to implement this test. I've searched for some 
automated pg_upgrade tests but didn't found one.
Should it be one-time test script or something "make check-world"-able?

Best regards, Andrey Borodin.


print_path is missing GatherMerge and CustomScan support

2018-07-17 Thread Masahiko Sawada
Hi,

While debugging planner I realized that print_path() function is not
aware of both GatherMerge path and CustomScan path. Attached small
patch fixes it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


fix_print_path.patch
Description: Binary data


Re: GSOC 2018 Project - A New Sorting Routine

2018-07-17 Thread Andrey Borodin
Hi, Tomas!

> 15 июля 2018 г., в 1:20, Tomas Vondra  
> написал(а):
> 
> So I doubt it's this, but I've tweaked the scripts to also set this GUC
> and restarted the tests on both machines. Let's see what that does.

Do you observe any different results?

Thanks!

Best regards, Andrey Borodin.

More consistency for some file-related error message

2018-07-17 Thread Michael Paquier
Hi all,

While looking at the source code for more consistency work with error
messages, I have bumped into a couple of messages which could be
simplified, as those include in the name of the file manipulated
basically the same information as the context added.

I have finished with the attached.  Note that for most of the messages,
I think that the context can be useful, like for example the stats
temporary file which could be user-defined, so those are left out.
There are also other cases, say the reorder buffer spill file, where we
may not know the path worked on, so the messages are kept consistent as
per HEAD.

That's a bit less work to do for translators, particularly with
pg_stat_statements.

Thoughts?
--
Michael
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index cc9efab243..33f9a79f54 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -642,19 +642,19 @@ pgss_shmem_startup(void)
 read_error:
 	ereport(LOG,
 			(errcode_for_file_access(),
-			 errmsg("could not read pg_stat_statement file \"%s\": %m",
+			 errmsg("could not read file \"%s\": %m",
 	PGSS_DUMP_FILE)));
 	goto fail;
 data_error:
 	ereport(LOG,
 			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-			 errmsg("ignoring invalid data in pg_stat_statement file \"%s\"",
+			 errmsg("ignoring invalid data in file \"%s\"",
 	PGSS_DUMP_FILE)));
 	goto fail;
 write_error:
 	ereport(LOG,
 			(errcode_for_file_access(),
-			 errmsg("could not write pg_stat_statement file \"%s\": %m",
+			 errmsg("could not write file \"%s\": %m",
 	PGSS_TEXT_FILE)));
 fail:
 	if (buffer)
@@ -761,7 +761,7 @@ pgss_shmem_shutdown(int code, Datum arg)
 error:
 	ereport(LOG,
 			(errcode_for_file_access(),
-			 errmsg("could not write pg_stat_statement file \"%s\": %m",
+			 errmsg("could not write file \"%s\": %m",
 	PGSS_DUMP_FILE ".tmp")));
 	if (qbuffer)
 		free(qbuffer);
@@ -1871,7 +1871,7 @@ qtext_store(const char *query, int query_len,
 error:
 	ereport(LOG,
 			(errcode_for_file_access(),
-			 errmsg("could not write pg_stat_statement file \"%s\": %m",
+			 errmsg("could not write file \"%s\": %m",
 	PGSS_TEXT_FILE)));
 
 	if (fd >= 0)
@@ -1913,7 +1913,7 @@ qtext_load_file(Size *buffer_size)
 		if (errno != ENOENT)
 			ereport(LOG,
 	(errcode_for_file_access(),
-	 errmsg("could not read pg_stat_statement file \"%s\": %m",
+	 errmsg("could not read file \"%s\": %m",
 			PGSS_TEXT_FILE)));
 		return NULL;
 	}
@@ -1923,7 +1923,7 @@ qtext_load_file(Size *buffer_size)
 	{
 		ereport(LOG,
 (errcode_for_file_access(),
- errmsg("could not stat pg_stat_statement file \"%s\": %m",
+ errmsg("could not stat file \"%s\": %m",
 		PGSS_TEXT_FILE)));
 		CloseTransientFile(fd);
 		return NULL;
@@ -1939,7 +1939,7 @@ qtext_load_file(Size *buffer_size)
 		ereport(LOG,
 (errcode(ERRCODE_OUT_OF_MEMORY),
  errmsg("out of memory"),
- errdetail("Could not allocate enough memory to read pg_stat_statement file \"%s\".",
+ errdetail("Could not allocate enough memory to read file \"%s\".",
 		   PGSS_TEXT_FILE)));
 		CloseTransientFile(fd);
 		return NULL;
@@ -1958,7 +1958,7 @@ qtext_load_file(Size *buffer_size)
 		if (errno)
 			ereport(LOG,
 	(errcode_for_file_access(),
-	 errmsg("could not read pg_stat_statement file \"%s\": %m",
+	 errmsg("could not read file \"%s\": %m",
 			PGSS_TEXT_FILE)));
 		free(buf);
 		CloseTransientFile(fd);
@@ -2088,7 +2088,7 @@ gc_qtexts(void)
 	{
 		ereport(LOG,
 (errcode_for_file_access(),
- errmsg("could not write pg_stat_statement file \"%s\": %m",
+ errmsg("could not write file \"%s\": %m",
 		PGSS_TEXT_FILE)));
 		goto gc_fail;
 	}
@@ -2118,7 +2118,7 @@ gc_qtexts(void)
 		{
 			ereport(LOG,
 	(errcode_for_file_access(),
-	 errmsg("could not write pg_stat_statement file \"%s\": %m",
+	 errmsg("could not write file \"%s\": %m",
 			PGSS_TEXT_FILE)));
 			hash_seq_term(_seq);
 			goto gc_fail;
@@ -2136,14 +2136,14 @@ gc_qtexts(void)
 	if (ftruncate(fileno(qfile), extent) != 0)
 		ereport(LOG,
 (errcode_for_file_access(),
- errmsg("could not truncate pg_stat_statement file \"%s\": %m",
+ errmsg("could not truncate file \"%s\": %m",
 		PGSS_TEXT_FILE)));
 
 	if (FreeFile(qfile))
 	{
 		ereport(LOG,
 (errcode_for_file_access(),
- errmsg("could not write pg_stat_statement file \"%s\": %m",
+ errmsg("could not write file \"%s\": %m",
 		PGSS_TEXT_FILE)));
 		qfile = NULL;
 		goto gc_fail;
@@ -2203,7 +2203,7 @@ gc_fail:
 	if (qfile == NULL)
 		ereport(LOG,
 (errcode_for_file_access(),
- errmsg("could not write new pg_stat_statement file \"%s\": %m",
+ errmsg("could not recreate file \"%s\": %m",
 		PGSS_TEXT_FILE)));
 	else
 		FreeFile(qfile);
@@ -2255,7 +2255,7 @@ entry_reset(void)
 	{
 		ereport(LOG,
 (errcode_for_file_access(),
- errmsg("could not create 

Re: Make foo=null a warning by default.

2018-07-17 Thread David Fetter
On Tue, Jul 17, 2018 at 08:34:17AM -0400, Fabien COELHO wrote:
> 
> Hello David,
> 
> A few comments about this v2.
> 
> ISTM that there is quite strong opposition to having "warn" as a default, so
> probably you should set if "off"?

Done.

> >>I do not really understand the sort order of this array. Maybe it could be
> >>alphabetical, or per value? Or maybe it is standard values and then
> >>synonyms, in which is case maybe a comment on the end of the list.
> >
> >I've changed it to per value. Is this better?
> 
> At least, I can see the logic.

I've now ordered it in what seems like a reasonable way, namely all
the "off" options, then the "on" option.

> >>Or anything in better English.
> >
> >I've changed this, and hope it's clearer.
> 
> Yep, and more elegant than my proposal.

I assure you that you expression yourself in English a good deal
better than I do in Portuguese.

> >>* Test
> >>
> >> +select 1=null;
> >> + ?column?
> >>
> >>Maybe use as to describe the expected behavior, so that it is easier to
> >>check, and I think that "?column?" looks silly eg:
> >>
> >>  select 1=null as expect_{true,false}[_and_a_warning/error];
> >
> >Changed to more descriptive headers.
> 
> Cannot see it in the patch update?

Oops. They should be there now.

> >>   create table cnullchild (check (f1 = 1 or f1 = null)) 
> >> inherits(cnullparent);
> >>  +WARNING:  = NULL can only produce a NULL
> >>
> >>I'm not sure of this test case. Should it be turned into "is null"?
> >
> >This was just adjusting the existing test output to account for the
> >new warning. I presume it was phrased that way for a reason.
> 
> Hmmm. Probably you are right. I think that the test case deserves better
> comments, as it is most peculiar. It was added by Tom for testing CHECK
> constant NULL expressions simplications.
> 
> TRUE OR NULL is TRUE, but FALSE OR NULL is NULL. Why not. Combined with the
> fact that NULL is considered ok for a CHECK, this lead to strange but
> intentional behavior, such as:
> 
>   -- this CHECK is always nignored in practice
>   CREATE TABLE foo (i INT CHECK(i = 1 OR NULL));
>   INSERT INTO foo(i) VALUES(2); # ACCEPTED
> 
>   -- but this one is not
>   CREATE TABLE foo (i INT CHECK(i = 1 OR FALSE));
>   INSERT INTO foo(i) VALUES(2); # REFUSED
> 
> Can't say I'm thrilled about that, and the added warning looks appropriate.

Per request, the warning is off by default, so the warning went away.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From 87d004f2a054a0c0999145aa9b44d97d7951b19c Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Sun, 15 Jul 2018 16:11:08 -0700
Subject: [PATCH] Make transform_null_equals into an enum
To: pgsql-hack...@postgresql.org

The transform_null_equals GUC can now be off, warn, error, or on.
---
 doc/src/sgml/config.sgml  | 20 ++---
 src/backend/parser/parse_expr.c   | 30 ++---
 src/backend/utils/misc/guc.c  | 44 +--
 src/backend/utils/misc/postgresql.conf.sample |  2 +-
 src/include/parser/parse_expr.h   | 11 -
 src/test/regress/expected/guc.out | 30 +
 src/test/regress/sql/guc.sql  | 18 
 7 files changed, 128 insertions(+), 27 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4d48d93305..b0d951190f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8066,7 +8066,7 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
  
 
  
-  transform_null_equals (boolean)
+  transform_null_equals (enum)
   IS NULL
   
transform_null_equals configuration parameter
@@ -8074,15 +8074,23 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
   
   

-When on, expressions of the form expr =
+When enabled, expressions of the form expr =
 NULL (or NULL =
 expr) are treated as
 expr IS NULL, that is, they
 return true if expr evaluates to the null value,
-and false otherwise. The correct SQL-spec-compliant behavior of
-expr = NULL is to always
-return null (unknown). Therefore this parameter defaults to
-off.
+and false otherwise.  Valid values are off, warn, error, and on.
+   
+
+   
+The correct SQL-spec-compliant behavior of
+expr = null_expression
+is to always return null (unknown); PostgreSQL allows NULL
+as a null-valued expression of unknown type, which is an extension to the spec.
+The transformation of expr = NULL
+to expr IS NULL is inconsistent
+with the normal semantics of the SQL specification, and is therefore set to "off"
+by default.

 

diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c

Re: [bug fix] Produce a crash dump before main() on Windows

2018-07-17 Thread Kyotaro HORIGUCHI
At Wed, 18 Jul 2018 11:12:06 +0800, Craig Ringer  wrote 
in 
> On 26 February 2018 at 12:06, Tsunakawa, Takayuki <
> tsunakawa.ta...@jp.fujitsu.com> wrote:
> 
> > From: Craig Ringer [mailto:cr...@2ndquadrant.com]
> > > The patch proposed here means that early crashes will invoke WER. If
> > we're
> > > going to allow WER we should probably just do so unconditionally.
> > >
> > > I'd be in favour of leaving WER on when we find out we're in a
> > noninteractive
> > > service too, but that'd be a separate patch for pg11+ only.
> >
> > As for PG11+, I agree that we want to always leave WER on.  That is, call
> > SetErrorMode(SEM_FAILCRITICALERRORS) but not specify
> > SEM_NOGPFAULTERRORBOX.  The problem with the current specification of
> > PostgreSQL is that the user can only get crash dumps in a fixed folder
> > $PGDATA\crashdumps.  That location is bad because the crash dumps will be
> > backed up together with the database cluster without the user noticing it.
> > What's worse, the crash dumps are large.  With WER, the user can control
> > the location and size of crash dumps.
> >
> 
> Yeah, that's quite old and dates back to when Windows didn't offer much if
> any control over WER in services.

Yeah. If we want to take a crash dump, we cannot have
auto-restart. Since it is inevitable what we can do for this
would be adding a new knob for that, which cannot be turned on
together with restart_after_crash...?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: AtEOXact_ApplyLauncher() and subtransactions

2018-07-17 Thread Amit Khandekar
On 17 July 2018 at 03:29, Robert Haas  wrote:
> On Mon, Jul 16, 2018 at 2:36 AM, Amit Khandekar  
> wrote:
>> 0001 patch contains the main fix. In this patch I have used some
>> naming conventions and some comments that you used in your patch,
>> plus, I used your method of lazily allocating new stack level. The
>> stack is initially Null.
>
> Committed and back-patched to v11 and v10.

Thanks Robert.



Re: missing toast table for pg_policy

2018-07-17 Thread Michael Paquier
On Tue, Jul 17, 2018 at 06:03:26PM +0900, Michael Paquier wrote:
> On Tue, Jul 17, 2018 at 02:55:07PM +0700, John Naylor wrote:
>> I didn't dig deeper, since TOAST and the large object facility are
>> mutually exclusive so there shouldn't be a toast table here anyway.
>> Hope this helps.
> 
> [... digging ...]
> This comes from get_rel_infos where large objects are treated as user
> data.  Rather than the comment you added, I would rather do the
> following:
> "Large object catalogs and toast tables are mutually exclusive and large
> object data is handled as user data by pg_upgrade, which would cause
> failures."

So, I have been playing with the previous patch and tortured it through
a couple of upgrade scenarios, where it proves to work.  Attached is an
updated patch which adds toast tables except for pg_class, pg_attribute,
pg_index and pg_largeobject* with a proposal of commit message.  Any
objections for the commit of this stuff?
--
Michael
From 816dc770bc6bd2914b00ded5f1523265cf6c6d48 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 18 Jul 2018 12:56:21 +0900
Subject: [PATCH] Add toast tables to most system catalogs

It has been project policy to create toast tables only for those catalogs
that might reasonably need one.  Since this judgment call can change over
time, just create one for every catalog, as this can be useful when
creating rather-long entries in catalogs, with recent examples being in
the shape of policy expressions or customly-formatted SCRAM verifiers.

To prevent circular dependencies and to avoid adding complexity to VACUUM
FULL logic, exclude pg_class, pg_attribute, and pg_index.  Also, to
prevent pg_upgrade from seeing a non-empty new cluster, exclude
pg_largeobject and pg_largeobject_metadata from the set as large object
data is handled as user data.  Those relations have no reason to use a
toast table anyway.

Author: Joe Conway, John Naylor
Reviewed-by: Michael Paquier, Tom Lane
Discussion: https://postgr.es/m/84ddff04-f122-784b-b6c5-353680449...@joeconway.com
---
 src/backend/catalog/catalog.c | 18 -
 src/include/catalog/toasting.h| 40 +++-
 src/test/regress/expected/misc_sanity.out | 80 +++
 src/test/regress/sql/misc_sanity.sql  | 12 ++--
 4 files changed, 82 insertions(+), 68 deletions(-)

diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index a42155eeea..6061428bcc 100644
--- a/src/backend/catalog/catalog.c
+++ b/src/backend/catalog/catalog.c
@@ -253,12 +253,24 @@ IsSharedRelation(Oid relationId)
 		relationId == SubscriptionNameIndexId)
 		return true;
 	/* These are their toast tables and toast indexes (see toasting.h) */
-	if (relationId == PgShdescriptionToastTable ||
-		relationId == PgShdescriptionToastIndex ||
+	if (relationId == PgAuthidToastTable ||
+		relationId == PgAuthidToastIndex ||
+		relationId == PgDatabaseToastTable ||
+		relationId == PgDatabaseToastIndex ||
 		relationId == PgDbRoleSettingToastTable ||
 		relationId == PgDbRoleSettingToastIndex ||
+		relationId == PgPlTemplateToastTable ||
+		relationId == PgPlTemplateToastIndex ||
+		relationId == PgReplicationOriginToastTable ||
+		relationId == PgReplicationOriginToastIndex ||
+		relationId == PgShdescriptionToastTable ||
+		relationId == PgShdescriptionToastIndex ||
 		relationId == PgShseclabelToastTable ||
-		relationId == PgShseclabelToastIndex)
+		relationId == PgShseclabelToastIndex ||
+		relationId == PgSubscriptionToastTable ||
+		relationId == PgSubscriptionToastIndex ||
+		relationId == PgTablespaceToastTable ||
+		relationId == PgTablespaceToastIndex)
 		return true;
 	return false;
 }
diff --git a/src/include/catalog/toasting.h b/src/include/catalog/toasting.h
index 3db39b8f86..f259890e43 100644
--- a/src/include/catalog/toasting.h
+++ b/src/include/catalog/toasting.h
@@ -46,25 +46,59 @@ extern void BootstrapToastTable(char *relName,
  */
 
 /* normal catalogs */
+DECLARE_TOAST(pg_aggregate, 4159, 4160);
 DECLARE_TOAST(pg_attrdef, 2830, 2831);
+DECLARE_TOAST(pg_collation, 4161, 4162);
 DECLARE_TOAST(pg_constraint, 2832, 2833);
+DECLARE_TOAST(pg_default_acl, 4143, 4144);
 DECLARE_TOAST(pg_description, 2834, 2835);
+DECLARE_TOAST(pg_event_trigger, 4145, 4146);
+DECLARE_TOAST(pg_extension, 4147, 4148);
+DECLARE_TOAST(pg_foreign_data_wrapper, 4149, 4150);
+DECLARE_TOAST(pg_foreign_server, 4151, 4152);
+DECLARE_TOAST(pg_foreign_table, 4153, 4154);
+DECLARE_TOAST(pg_init_privs, 4155, 4156);
+DECLARE_TOAST(pg_language, 4157, 4158);
+DECLARE_TOAST(pg_namespace, 4163, 4164);
+DECLARE_TOAST(pg_partitioned_table, 4165, 4166);
+DECLARE_TOAST(pg_policy, 4167, 4168);
 DECLARE_TOAST(pg_proc, 2836, 2837);
 DECLARE_TOAST(pg_rewrite, 2838, 2839);
 DECLARE_TOAST(pg_seclabel, 3598, 3599);
 DECLARE_TOAST(pg_statistic, 2840, 2841);
 DECLARE_TOAST(pg_statistic_ext, 3439, 3440);
 DECLARE_TOAST(pg_trigger, 2336, 2337);
+DECLARE_TOAST(pg_ts_dict, 4169, 4170);
+DECLARE_TOAST(pg_type, 4171, 4172);

Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

2018-07-17 Thread Kyotaro HORIGUCHI
Hello. I confirmed that this patch fixes the crash.

At Tue, 17 Jul 2018 20:01:05 -0400, Tom Lane  wrote in 
<14892.1531872...@sss.pgh.pa.us>
> I wrote:
> >> So I said I didn't want to do extra work on this, but I am looking into
> >> fixing it by having these aux process types run a ResourceOwner that can
> >> be told to clean up any open buffer pins at exit.
> 
> > That turned out to be a larger can of worms than I'd anticipated, as it
> > soon emerged that we'd acquired a whole lot of cargo-cult programming
> > around ResourceOwners. ...
> > I'm mostly pretty happy with this, but I think there are a couple of
> > loose ends in logicalfuncs.c and slotfuncs.c: those are creating
> > non-standalone ResourceOwners (children of whatever the active
> > ResourceOwner is) and doing nothing much to clean them up.  That seems
> > pretty bogus.
> 
> Further investigation showed that the part of that code that was
> actually needed was not the creation of a child ResourceOwner, but rather
> restoration of the old CurrentResourceOwner setting after exiting the
> logical decoding loop.  Apparently we can come out of that with the
> TopTransaction resowner being active.  This still seems a bit bogus;
> maybe there should be a save-and-restore happening somewhere else?
> But I'm not really excited about doing more than commenting it.

CurrentResourceOwner doesn't seem to be changed. It is just saved
during snapshot export and used just as a flag only for checking
for duplicate snapshot exporting.

> Also, most of the other random creations of ResourceOwners seem to just
> not be necessary at all, even with the new rule that you must have a
> resowner to acquire buffer pins.  So the attached revised patch just
> gets rid of them, and improves some misleading/wrong comments on the
> topic.  It'd still be easy to use CreateAuxProcessResourceOwner in any
> process where we discover we need one, but I don't see the value in adding
> useless overhead.

+1 for unifying to resowner for auxiliary processes. I found the
comment below just before ending cleanup of auxiliary process
main funcs.

| * These operations are really just a minimal subset of
| * AbortTransaction().  We don't have very many resources to worry
| * about in checkpointer, but we do have LWLocks, buffers, and temp
| * files.

So couldn't we use TopTransactionResourceOwner instead of
AuxProcessResrouceOwner? I feel a bit uneasy that bootstrap and
standalone-backend have *AuxProcess*ResourceOwner.


It's not about this ptch, but while looking this closer, I found
the following comment on ShutdownXLOG().

| /*
|  * This must be called ONCE during postmaster or standalone-backend shutdown
|  */

Is the "postmaster" typo of "bootstrap process"? It is also
called from checkpointer when non-standlone mode.


> At this point I'm leaning to just applying this in HEAD and calling it
> good.  The potential for assertion failures isn't relevant to production
> builds, and my best guess at this point is that there isn't really any
> other user-visible bug.  The resowners that were being created and not
> adequately cleaned up all seem to have been dead code anyway (ie they'd
> never acquire any resources).

Agreed.

>We could have a problem if, say, the
> bgwriter exited via the FATAL path while holding a pin, but I don't think
> there's a reason for it to do that except in a database-wide shutdown,
> where the consequences of a leaked pin seem pretty minimal.
> 
> Any objections?  Anyone want to do further review?

FWIW I think we won't be concerned about leaked pins after FATAL.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [bug fix] Produce a crash dump before main() on Windows

2018-07-17 Thread Craig Ringer
On 26 February 2018 at 12:06, Tsunakawa, Takayuki <
tsunakawa.ta...@jp.fujitsu.com> wrote:

> From: Craig Ringer [mailto:cr...@2ndquadrant.com]
> > The patch proposed here means that early crashes will invoke WER. If
> we're
> > going to allow WER we should probably just do so unconditionally.
> >
> > I'd be in favour of leaving WER on when we find out we're in a
> noninteractive
> > service too, but that'd be a separate patch for pg11+ only.
>
> As for PG11+, I agree that we want to always leave WER on.  That is, call
> SetErrorMode(SEM_FAILCRITICALERRORS) but not specify
> SEM_NOGPFAULTERRORBOX.  The problem with the current specification of
> PostgreSQL is that the user can only get crash dumps in a fixed folder
> $PGDATA\crashdumps.  That location is bad because the crash dumps will be
> backed up together with the database cluster without the user noticing it.
> What's worse, the crash dumps are large.  With WER, the user can control
> the location and size of crash dumps.
>

Yeah, that's quite old and dates back to when Windows didn't offer much if
any control over WER in services.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)

2018-07-17 Thread Peter Geoghegan
On Tue, Jul 17, 2018 at 5:45 PM, Robert Haas  wrote:
> Yes, that's a good point.  Also, and I think pretty importantly, this
> seems essential if we want to allow retail index tuple deletion, which
> has its own set of advantages.

Retail index tuple deletion is clearly something we should have in
some form, and is the top reason to pursue the project. I could
probably come up with dozens of other reasons if I had to, though.
Here are a few of the less obvious ones:

* This improves usage_count of heap page buffers automatically, by
recognizing correlated references when there are many duplicates. [1]

* How else is VACUUM/deletion supposed to work with indirect indexes?
I think that it's impossible without either retail deletion/retail
updates, unless an AEL is okay.

* Global indexes (indexes that simultaneously cover multiple
partitions) are just a generalization of this idea, with a partition
number attribute before the heap TID attribute.

* The sort order of duplicates actually matters quite a lot to the
executor, and to the planner, which this lets us reason about in many
cases. [2][3]

* There is nothing to stop us from exploiting this within
tidbitmap.c/bitmap index scans. We can build disjunct lists of sorted
TIDs, one per value. Merging lists like this together is typically
very cheap, even when there are relatively many distinct lists. We
could probably do that on-the-fly.

* When building a bitmap for a low cardinality value from a secondary
index, why bother even visiting the leaf pages? We can make a lossy
bitmap from a subset of the internal pages.

* Techniques like bitmap semi-join, which are important for
star-schema queries, are dependent on the merging and intersection of
fact table bitmaps being cheap (basically, you build several bitmaps
through nestloop joins with some dimension table on the outer side,
and some fact table index on the inner side -- that's relatively easy
to parallelize well). Tom's "Improve OR conditions on joined columns"
patch sorts a list of TIDs [4], and I suspect that there is some
high-level connection there.

* Heap prefetching for nbtree index scans is a lot more effective than
it might otherwise be.

I really could go on, but you get the idea. The merits of each of
these ideas are debatable, but the fact that there are so many ideas
that are at least plausible-sounding seems significant to me.

> I don't think you're going to be able to rip out the getting-tired
> code, though, because presumably we'll have to continue support
> existing btree indexes that don't include TIDs for some
> probably-not-small number of future releases, even if we decide that
> all future btree indexes (and any that are reindexed) should have
> TIDs.

That's definitely true. There is no way that it's okay to ignore this
issue with pg_upgrade.

[1] 
https://postgr.es/m/cah2-wzkrkkw88yq4-bln5n3drfv93p8r+ti-kfbter08p8c...@mail.gmail.com
[2] https://postgr.es/m/20170707234119.gn17...@telsasoft.com
[3] 
https://postgr.es/m/520D6610.8040907%40emulex.com#520d6610.8040...@emulex.com
[4] https://postgr.es/m/22002.1487099934%40sss.pgh.pa.us
-- 
Peter Geoghegan



Re: PG 10: could not generate random cancel key

2018-07-17 Thread Michael Paquier
On Tue, Jul 17, 2018 at 02:28:14PM +0100, Dean Rasheed wrote:
> From what I understand from here [1], some parts of OpenSSL call
> RAND_poll() once on initialisation, and that's enough to get the PRNG
> going. It's not obvious that calling it multiple times would have any
> benefit.
> 
> They also don't appear to bother checking the return code from
> RAND_poll() [2]. If it did fail, there'd not be much you could do
> anyway, so you might as well just let it continue and let RAND_bytes()
> fail. In fact it may even be possible for RAND_poll() to fail, but
> just do enough to cause RAND_bytes() to succeed.
>
> [1] https://wiki.openssl.org/index.php/Random_Numbers

This quote from the wiki is scary so that's not quite clean either for
Windows:
"Be careful when deferring to RAND_poll on some Unix systems because it
does not seed the generator. See the code guarded with
OPENSSL_SYS_VXWORKS in rand_unix.c. Additionally, RAND_poll can have
negative interactions on newer Windows platforms, so your program could
hang or crash depending on the potential issue. See Windows Issues
below."

> [2] 
> https://github.com/benvanik/openssl/blob/master/openssl/crypto/rand/md_rand.c

This repository is outdated, on OpenSSL HEAD I am seeing this used only
in rand_win.c.  And this commit is sort of interesting because there was
a retry loop done with RAND_poll().  Please see this one:
commit: c16de9d8329d41a2433d0f273c080d9d06ad7a87
author: Dr. Matthias St. Pierre 
date: Thu, 31 Aug 2017 23:16:22 +0200
committer: Ben Kaduk 
date: Wed, 18 Oct 2017 08:39:20 -0500
Fix reseeding issues of the public RAND_DRBG

apps/ocsp.c also has the wisdom to check for a failure on RAND_poll().
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] PATCH: Keep one postmaster monitoring pipe per process

2018-07-17 Thread Thomas Munro
On Sat, Jul 14, 2018 at 5:34 AM, Heikki Linnakangas  wrote:
> On 18/04/18 09:55, Thomas Munro wrote:
>> Here's a draft patch that does that.  One contentious question is:
>> should you have to opt *in* to auto-exit-on-postmaster death?  Andres
>> opined that you should.  I actually think it's not so bad if you don't
>> have to do that, and instead have to opt out.  I think of it as a kind
>> of 'process cancellation point' or a quiet PANIC that you can opt out
>> of.  It's nice to remove the old boilerplate code without having to
>> add a new boilerplate event that you have to remember every time.  Any
>> other opinions?
>
> Hmm. Exiting on postmaster death by default does feel a bit too magical to
> me. But as your patch points out, there are currently no places where you
> *don't* want to exit on postmaster death, some callers just prefer to handle
> it themselves. And I like having a default or something to make sure that
> all call sites in the future will also exit quickly.
>
> I'd suggest that we add a new WL_EXIT_ON_POSTMASTER_DEATH flag, making it
> opt-on.

Ok, it's now 2 against 1.  So here's a version that uses an explicit
WL_EXIT_ON_PM_DEATH value.  I like that name because it's shorter and
more visually distinctive (harder to confuse with
WL_POSTMASTER_DEATH).

> But add an assertion in WaitLatchOrSocket:
>
> Assert ((wakeEvents & (WL_EXIT_POSTMASTER_DEATH | WL_POSTMASTER_DEATH)) !=
> 0);

Ok.  Done for the WaitLatchXXX() interface.  The WaitEventSet
interface (rarely used directly) is less amenable to that change.

Here are some of the places I had to add WL_EXIT_ON_PM_DEATH:
gather_readnext(), shm_mq_send_bytes(), shm_mq_receive_bytes(),
shm_mq_wait_internal(), ProcSleep(), ProcWaitForSignal(), pg_sleep(),
pgfdw_get_result().

Was it intentional that any of those places don't currently exit on
postmaster vaporisation?

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Add-WL_EXIT_ON_PM_DEATH-pseudo-event-v3.patch
Description: Binary data


Re: PG 10: could not generate random cancel key

2018-07-17 Thread Michael Paquier
On Tue, Jul 17, 2018 at 01:31:01PM -0400, Robert Haas wrote:
> On Tue, Jul 17, 2018 at 1:27 PM, Alvaro Herrera  
> wrote:
>> On 2018-Jul-17, Robert Haas wrote:
>>> On Tue, Jul 17, 2018 at 8:33 AM, Dean Rasheed  
>>> wrote:
 if (RAND_status() == 0)
 RAND_poll();
>>>
>>> Looks like a recipe for an infinite loop.  At least, I think we ought
>>> to have a CHECK_FOR_INTERRUPTS() in that loop.
>>
>> What loop?

The CHECK_FOR_INTERRUPTS() addition could be an addition if the number
of retries gets high enough, but that just does not apply here.

> Ugh, I'm not doing very well today, am I?  I read that as while() but
> it says if().

Time for vacations :)
--
Michael


signature.asc
Description: PGP signature


Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

2018-07-17 Thread Tom Lane
Robert Haas  writes:
> On Tue, Jul 17, 2018 at 8:55 PM, Tom Lane  wrote:
>> Uh, what?  There are only a few callers of those, and they'd all have
>> crashed already if they were somehow dealing with an invalid buffer.

> Sorry, I meant Assert(owner != NULL).

Oh, gotcha: so that if an external developer hits it, he can more
easily see that this is from an (effective) API change and not some
mysterious bug.  Makes sense, especially if we include a comment:

/* We used to allow pinning buffers without a resowner, but no more */
Assert(CurrentResourceOwner != NULL);

I think it's sufficient to do this in ResourceOwnerEnlargeBuffers,
though.  The other two should be unreachable without having gone
through that.

regards, tom lane



Re: patch to allow disable of WAL recycling

2018-07-17 Thread Robert Haas
On Tue, Jul 17, 2018 at 3:12 PM, Peter Eisentraut
 wrote:
> The actual implementation could use another round of consideration.  I
> wonder how this should interact with min_wal_size.  Wouldn't
> min_wal_size = 0 already do what we need (if you could set it to 0,
> which is currently not possible)?

Hmm, would that actually disable recycling, or just make it happen only rarely?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

2018-07-17 Thread Robert Haas
On Tue, Jul 17, 2018 at 8:55 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Jul 17, 2018 at 8:01 PM, Tom Lane  wrote:
>>> Any objections?  Anyone want to do further review?
>
>> LGTM.  I think this is an improvement.  However, it seems like it
>> might be a good idea for ResourceOwnerRememberBuffer and
>> ResourceOwnerForgetBuffer to Assert(buffer != NULL), so that if
>> somebody messes up it will trip an assertion rather than just seg
>> faulting.
>
> Uh, what?  There are only a few callers of those, and they'd all have
> crashed already if they were somehow dealing with an invalid buffer.

Sorry, I meant Assert(owner != NULL).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

2018-07-17 Thread Tom Lane
Robert Haas  writes:
> On Tue, Jul 17, 2018 at 8:01 PM, Tom Lane  wrote:
>> Any objections?  Anyone want to do further review?

> LGTM.  I think this is an improvement.  However, it seems like it
> might be a good idea for ResourceOwnerRememberBuffer and
> ResourceOwnerForgetBuffer to Assert(buffer != NULL), so that if
> somebody messes up it will trip an assertion rather than just seg
> faulting.

Uh, what?  There are only a few callers of those, and they'd all have
crashed already if they were somehow dealing with an invalid buffer.

regards, tom lane



Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

2018-07-17 Thread Robert Haas
On Tue, Jul 17, 2018 at 8:01 PM, Tom Lane  wrote:
> Any objections?  Anyone want to do further review?

LGTM.  I think this is an improvement.  However, it seems like it
might be a good idea for ResourceOwnerRememberBuffer and
ResourceOwnerForgetBuffer to Assert(buffer != NULL), so that if
somebody messes up it will trip an assertion rather than just seg
faulting.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)

2018-07-17 Thread Robert Haas
On Tue, Jul 17, 2018 at 5:16 PM, Peter Geoghegan  wrote:
> There is actually a flipside to that downside, though (i.e. the
> downside is also an upside): While not filling up leaf pages that have
> free space on them is bad, it's only bad when it doesn't leave the
> pages completely empty. Leaving the pages completely empty is actually
> good, because then VACUUM is in a position to delete entire pages,
> removing their downlinks from parent pages. That's a variety of bloat
> that we can reverse completely. I suspect that you'll see far more of
> that favorable case in the real world with my patch. It's pretty much
> impossible to do page deletions with pages full of duplicates today,
> because the roughly-uniform distribution of still-live tuples among
> leaf pages fails to exhibit any temporal locality. So, maybe my patch
> would still come out ahead of simply ripping out "getting tired" in
> this parallel universe where latency doesn't matter, and space
> utilization is everything.

Yes, that's a good point.  Also, and I think pretty importantly, this
seems essential if we want to allow retail index tuple deletion, which
has its own set of advantages.

I don't think you're going to be able to rip out the getting-tired
code, though, because presumably we'll have to continue support
existing btree indexes that don't include TIDs for some
probably-not-small number of future releases, even if we decide that
all future btree indexes (and any that are reindexed) should have
TIDs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Another usability issue with our TAP tests

2018-07-17 Thread Michael Paquier
On Tue, Jul 17, 2018 at 03:02:32PM -0400, Tom Lane wrote:
> Cute idea, but it seems not to work with older versions of prove:
> 
> $ which prove
> /usr/local/perl5.8.3/bin/prove
> $ prove --state=save
> Unknown option: s

I didn't know this one, and that's actually nice, but I cannot get
easily a way to track down which tests failed during a parallel run...
And I am used as well to hunt those with "git status".

> We could just do a "touch .prove_running" beforehand and an "rm" after,
> perhaps (I think Michael suggested something like that already).

Yes, except that in the idea of upthread TestLib.pm would be in control
of it, so as there is less code churn in prove_check and
prove_installcheck.  But I am having second-thoughts about putting the
test state directly in the module as that's a four-liner in
Makefile.global.in, and that seems to work even if you put a die() to
fail hard or even if only a subset of tests is failing.

Thoughts?
--
Michael
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 95d090e72d..fc81380ad2 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -417,15 +417,19 @@ endef
 ifeq ($(enable_tap_tests),yes)
 
 define prove_installcheck
+touch '$(CURDIR)'/tap_running
 rm -rf '$(CURDIR)'/tmp_check
 $(MKDIR_P) '$(CURDIR)'/tmp_check
 cd $(srcdir) && TESTDIR='$(CURDIR)' PATH="$(bindir):$$PATH" PGPORT='6$(DEF_PGPORT)' top_builddir='$(CURDIR)/$(top_builddir)' PG_REGRESS='$(CURDIR)/$(top_builddir)/src/test/regress/pg_regress' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl)
+rm '$(CURDIR)'/tap_running
 endef
 
 define prove_check
+touch '$(CURDIR)'/tap_running
 rm -rf '$(CURDIR)'/tmp_check
 $(MKDIR_P) '$(CURDIR)'/tmp_check
 cd $(srcdir) && TESTDIR='$(CURDIR)' $(with_temp_install) PGPORT='6$(DEF_PGPORT)' PG_REGRESS='$(CURDIR)/$(top_builddir)/src/test/regress/pg_regress' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl)
+rm '$(CURDIR)'/tap_running
 endef
 
 else


signature.asc
Description: PGP signature


Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

2018-07-17 Thread Tom Lane
I wrote:
>> So I said I didn't want to do extra work on this, but I am looking into
>> fixing it by having these aux process types run a ResourceOwner that can
>> be told to clean up any open buffer pins at exit.

> That turned out to be a larger can of worms than I'd anticipated, as it
> soon emerged that we'd acquired a whole lot of cargo-cult programming
> around ResourceOwners. ...
> I'm mostly pretty happy with this, but I think there are a couple of
> loose ends in logicalfuncs.c and slotfuncs.c: those are creating
> non-standalone ResourceOwners (children of whatever the active
> ResourceOwner is) and doing nothing much to clean them up.  That seems
> pretty bogus.

Further investigation showed that the part of that code that was
actually needed was not the creation of a child ResourceOwner, but rather
restoration of the old CurrentResourceOwner setting after exiting the
logical decoding loop.  Apparently we can come out of that with the
TopTransaction resowner being active.  This still seems a bit bogus;
maybe there should be a save-and-restore happening somewhere else?
But I'm not really excited about doing more than commenting it.

Also, most of the other random creations of ResourceOwners seem to just
not be necessary at all, even with the new rule that you must have a
resowner to acquire buffer pins.  So the attached revised patch just
gets rid of them, and improves some misleading/wrong comments on the
topic.  It'd still be easy to use CreateAuxProcessResourceOwner in any
process where we discover we need one, but I don't see the value in adding
useless overhead.

At this point I'm leaning to just applying this in HEAD and calling it
good.  The potential for assertion failures isn't relevant to production
builds, and my best guess at this point is that there isn't really any
other user-visible bug.  The resowners that were being created and not
adequately cleaned up all seem to have been dead code anyway (ie they'd
never acquire any resources).  We could have a problem if, say, the
bgwriter exited via the FATAL path while holding a pin, but I don't think
there's a reason for it to do that except in a database-wide shutdown,
where the consequences of a leaked pin seem pretty minimal.

Any objections?  Anyone want to do further review?

regards, tom lane

diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 4e32cff..30ddf94 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -37,7 +37,6 @@
 #include "utils/guc.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
-#include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/typcache.h"
 
@@ -1220,16 +1219,19 @@ ParallelWorkerMain(Datum main_arg)
 	Assert(ParallelWorkerNumber == -1);
 	memcpy(, MyBgworkerEntry->bgw_extra, sizeof(int));
 
-	/* Set up a memory context and resource owner. */
-	Assert(CurrentResourceOwner == NULL);
-	CurrentResourceOwner = ResourceOwnerCreate(NULL, "parallel toplevel");
+	/* Set up a memory context to work in, just for cleanliness. */
 	CurrentMemoryContext = AllocSetContextCreate(TopMemoryContext,
  "Parallel worker",
  ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Now that we have a resource owner, we can attach to the dynamic shared
-	 * memory segment and read the table of contents.
+	 * Attach to the dynamic shared memory segment for the parallel query, and
+	 * find its table of contents.
+	 *
+	 * Note: at this point, we have not created any ResourceOwner in this
+	 * process.  This will result in our DSM mapping surviving until process
+	 * exit, which is fine.  If there were a ResourceOwner, it would acquire
+	 * ownership of the mapping, but we have no need for that.
 	 */
 	seg = dsm_attach(DatumGetUInt32(main_arg));
 	if (seg == NULL)
@@ -1393,7 +1395,7 @@ ParallelWorkerMain(Datum main_arg)
 	/* Must exit parallel mode to pop active snapshot. */
 	ExitParallelMode();
 
-	/* Must pop active snapshot so resowner.c doesn't complain. */
+	/* Must pop active snapshot so snapmgr.c doesn't complain. */
 	PopActiveSnapshot();
 
 	/* Shut down the parallel-worker transaction. */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4049deb..5d01edd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6356,6 +6356,15 @@ StartupXLOG(void)
 	struct stat st;
 
 	/*
+	 * We should have an aux process resource owner to use, and we should not
+	 * be in a transaction that's installed some other resowner.
+	 */
+	Assert(AuxProcessResourceOwner != NULL);
+	Assert(CurrentResourceOwner == NULL ||
+		   CurrentResourceOwner == AuxProcessResourceOwner);
+	CurrentResourceOwner = AuxProcessResourceOwner;
+
+	/*
 	 * Verify XLOG status looks valid.
 	 */
 	if (ControlFile->state < DB_SHUTDOWNED ||
@@ -8467,6 +8476,15 @@ GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch)
 void
 ShutdownXLOG(int code, Datum arg)
 

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-07-17 Thread Tomas Vondra


On 07/18/2018 12:41 AM, Konstantin Knizhnik wrote:
> ...
> 
> Teodor Sigaev has proposed an alternative approach for calculating
> selectivity of multicolumn join or compound index search.
> Usually DBA creates compound indexes which can be used  by optimizer to
> build efficient query execution plan based on index search.
> We can stores statistic for compound keys of such indexes and (as it is
> done now for functional indexes) and use it to estimate selectivity
> of clauses. I have implemented this idea and will publish patch in
> separate thread soon.
> Now I just want to share some results for the Tomas examples.
> 
> So for Vanilla Postges without extra statistic  estimated number of rows
> is about 4 times smaller than real.
> 

Can you please post plans with parallelism disabled, and perhaps without
the aggregate? Both makes reading the plans unnecessarily difficult ...

>  postgres=# explain analyze SELECT count(*) FROM foo WHERE a=1 and (b=1
> or b=2);
> QUERY PLAN
> 
> 
> 
>  Aggregate  (cost=10964.76..10964.77 rows=1 width=8) (actual
> time=49.251..49.251 rows=1 loops=1)
>    ->  Bitmap Heap Scan on foo  (cost=513.60..10906.48 rows=23310
> width=0) (actual time=17.368..39.928 rows=9 loops=1)

ok, 23k vs. 90k

> 
> If we add statistic for a and b  columns:
> 
>  create statistics ab on a,b from foo;
>  analyze foo;
> 
> then expected results is about 30% larger then real: 120k vs 90k:
> 

Eh? The plan however says 9k vs. 30k ...

>    ->  Parallel Bitmap Heap Scan on foo
> (cost=2561.33..13424.24 rows=9063 width=0) (actual time=15.551..26.057
> rows=3 loops=3)

...

> With compound index statistic estimation is almost equal to real value:
> 
>    ->  Bitmap Heap Scan on foo  (cost=1880.20..13411.66 rows=23310
> width=0) (actual time=38.776..61.050 rows=9 loops=1)

Well, this says 23k vs. 90k.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)

2018-07-17 Thread Peter Geoghegan
On Tue, Jul 17, 2018 at 3:10 PM, Tom Lane  wrote:
> Well, the actual problem was O(N^2) behavior:
>
> https://www.postgresql.org/message-id/2378.967216388%40sss.pgh.pa.us
>
> https://git.postgresql.org/gitweb/?p=postgresql.git=commitdiff=40549e9cb5abd2986603883e4ab567dab34723c6

Oh, yeah. We still put the new tuple to the left of all existing
duplicates on the leaf that we decide to insert on, because "we'll
insert it before any existing equal keys because of the way
_bt_binsrch() works". I actually plan on mostly fixing that aspect of
_bt_binsrch() while I'm at it, which is why I didn't think to frame it
that way.

It is claimed that we need to do this _bt_binsrch() go-left-on-equal
thing for internal pages because we allow duplicates, contrary to L
I find it much easier to think of it as being necessary due to index
scans where only some columns are specified in the initial insertion
scan key that finds the initial leaf page (i.e. the _bt_first()
stuff). I'm not going to touch the _bt_binsrch() go-left-on-equal
thing, because it's actually necessary for any real world
implementation that has multiple columns in tuples. Fortunately, we
can usually avoid the go-left-on-equal thing for internal pages by
avoiding equality -- a sentinel heap scan TID value may be used for
this in many cases.

The Lehman and Yao paper is badly written. They should have talked
about the _bt_binsrch() go-left-on-equal thing, rather than leaving it
to the reader to figure it out. It's not why L require unique keys,
though.

Sorry if all this detail isn't useful right now. The important point
is that I actually think that this code gets most things right
already.

> I certainly have no objection to improving matters, but let's be sure
> we don't re-introduce any two-decade-old problems.

A mountain of work remains to validate the performance of the patch.

-- 
Peter Geoghegan



Re: Fix some error handling for read() and errno

2018-07-17 Thread Michael Paquier
On Mon, Jul 16, 2018 at 04:04:12PM -0400, Alvaro Herrera wrote:
> No objection here -- incremental progress is better than none.

Thanks.  I have pushed 0001 now.  I have found some more work which
could be done, for which I'll spawn a new thread.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timing is (fwd)

2018-07-17 Thread Fabien COELHO


Hello Heikki,

[...] Let's keep it that way. I think the only change we need to make in 
the logic is to check at the end, if *any* progress reports at all have 
been printed, and print one if not.


Ok, this simplifies the condition.

And do that only when the -P option is smaller than the -T option, I 
suppose.


Yep, why not.

Oh. I'm a bit surprised we don't support decimals, i.e. -P 0.5. Actually, it 
seems to be acceptd, but it's truncated down to the nearest integer.


Indeed, "atoi" does not detect errors, and it is true of the many uses in 
pgbench: clients, threads, scale, duration, fillfactor...



That's not very nice :-(. But it's a separate issue.


Yep. For human consumption, seconds seem okay.


The more reasonable alternative could be to always last 2 seconds under
-T 2, even if the execution can be shorten because there is nothing to do
at all, i.e. remove the environment-based condition but keep the sleep.


That sounds reasonable. It's a bit silly to wait when there's nothing to do, 
but it's also weird if the test exits before the specified time is up. Seems 
less surprising to always sleep.


I did that in the attached version: no more environment variable hack, and 
no execution shortcut even if there is nothing to do.


I also had to reproduce the progress logic to keep on printing report of 
(no) progress in this tailing phase.


I'm not very happy because it is a change of behavior. I suggest that I 
could add a special "--strict-time-compliance" option to do this only when 
required... and it would only be required by tap tests.


--
Fabien.diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 41b756c089..02aff58a48 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -5597,6 +5597,96 @@ main(int argc, char **argv)
return 0;
 }
 
+/* display the progress report */
+static void
+doProgressReport(TState *thread, StatsData *plast, int64 *plast_report,
+int64 now, int64 thread_start)
+{
+   StatsData   cur;
+   int64   run = now - *plast_report,
+   ntx;
+   double  tps,
+   total_run,
+   latency,
+   sqlat,
+   lag,
+   stdev;
+   chartbuf[64];
+   int i;
+
+   /*
+* Add up the statistics of all threads.
+*
+* XXX: No locking. There is no guarantee that we get an
+* atomic snapshot of the transaction count and latencies, so
+* these figures can well be off by a small amount. The
+* progress report's purpose is to give a quick overview of
+* how the test is going, so that shouldn't matter too much.
+* (If a read from a 64-bit integer is not atomic, you might
+* get a "torn" read and completely bogus latencies though!)
+*/
+   initStats(, 0);
+   for (i = 0; i < nthreads; i++)
+   {
+   mergeSimpleStats(, [i].stats.latency);
+   mergeSimpleStats(, [i].stats.lag);
+   cur.cnt += thread[i].stats.cnt;
+   cur.skipped += thread[i].stats.skipped;
+   }
+
+   /* we count only actually executed transactions */
+   ntx = (cur.cnt - cur.skipped) - (plast->cnt - plast->skipped);
+   total_run = (now - thread_start) / 100.0;
+   tps = 100.0 * ntx / run;
+   if (ntx > 0)
+   {
+   latency = 0.001 * (cur.latency.sum - plast->latency.sum) / ntx;
+   sqlat = 1.0 * (cur.latency.sum2 - plast->latency.sum2) / ntx;
+   stdev = 0.001 * sqrt(sqlat - 100.0 * latency * latency);
+   lag = 0.001 * (cur.lag.sum - plast->lag.sum) / ntx;
+   }
+   else
+   {
+   latency = sqlat = stdev = lag = 0;
+   }
+
+   if (progress_timestamp)
+   {
+   /*
+* On some platforms the current system timestamp is
+* available in now_time, but rather than get entangled
+* with that, we just eat the cost of an extra syscall in
+* all cases.
+*/
+   struct timeval tv;
+
+   gettimeofday(, NULL);
+   snprintf(tbuf, sizeof(tbuf), "%ld.%03ld s",
+(long) tv.tv_sec, (long) (tv.tv_usec / 1000));
+   }
+   else
+   {
+   /* round seconds are expected, but the thread may be late */
+   snprintf(tbuf, sizeof(tbuf), "%.1f s", total_run);
+   }
+
+   fprintf(stderr,
+   "progress: %s, %.1f tps, lat %.3f ms stddev %.3f",
+   tbuf, tps, latency, stdev);
+
+   if (throttle_delay)
+   {
+   fprintf(stderr, ", lag %.3f ms", lag);
+   if (latency_limit)
+   fprintf(stderr, ", " 

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-07-17 Thread Konstantin Knizhnik




On 16.07.2018 23:55, Tomas Vondra wrote:


On 07/16/2018 02:54 PM, Dean Rasheed wrote:

On 16 July 2018 at 13:23, Tomas Vondra  wrote:

The top-level clauses allow us to make such deductions, with deeper
clauses it's much more difficult (perhaps impossible). Because for
example with (a=1 AND b=1) there can be just a single match, so if we
find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR
b=2)) it's not that simple, because there may be multiple combinations
and so a match in MCV does not guarantee anything.

Actually, it guarantees a lower bound on the overall selectivity, and
maybe that's the best that we can do in the absence of any other
stats.


Hmmm, is that actually true? Let's consider a simple example, with two
columns, each with just 2 values, and a "perfect" MCV list:

 a | b | frequency
---
 1 | 1 | 0.5
 2 | 2 | 0.5

And let's estimate sel(a=1 & b=2).

OK.In this case, there are no MCV matches, so there is no lower bound (it's 0).

What we could do though is also impose an upper bound, based on the
sum of non-matching MCV frequencies. In this case, the upper bound is
also 0, so we could actually say the resulting selectivity is 0.


Hmmm, it's not very clear to me how would we decide which of these cases
applies, because in most cases we don't have MCV covering 100% rows.

Anyways, I've been thinking about how to modify the code to wort the way
you proposed (in a way sufficient for a PoC). But after struggling with
it for a while it occurred to me it might be useful to do it on paper
first, to verify how would it work on your examples.

So let's use this data

create table foo(a int, b int);
insert into foo select 1,1 from generate_series(1,5);
insert into foo select 1,2 from generate_series(1,4);
insert into foo select 1,x/10 from generate_series(30,25) g(x);
insert into foo select 2,1 from generate_series(1,3);
insert into foo select 2,2 from generate_series(1,2);
insert into foo select 2,x/10 from generate_series(30,50) g(x);
insert into foo select 3,1 from generate_series(1,1);
insert into foo select 3,2 from generate_series(1,5000);
insert into foo select 3,x from generate_series(3,60) g(x);
insert into foo select x,x/10 from generate_series(4,75) g(x);

Assuming we have perfectly exact statistics, we have these MCV lists
(both univariate and multivariate):

   select a, count(*), round(count(*) /2254937.0, 4) AS frequency
 from foo group by a order by 2 desc;

  a| count  | frequency
   ++---
 3 | 614998 |0.2727
 2 | 549971 |0.2439
 1 | 339971 |0.1508
  1014 |  1 |0.
 57220 |  1 |0.
 ...

   select b, count(*), round(count(*) /2254937.0, 4) AS frequency
 from foo group by b order by 2 desc;

  b| count | frequency
   +---+---
 1 | 90010 |0.0399
 2 | 65010 |0.0288
 3 |31 |0.
 7 |31 |0.
...

   select a, b, count(*), round(count(*) /2254937.0, 4) AS frequency
 from foo group by a, b order by 3 desc;

  a|   b| count | frequency
   ++---+---
 1 |  1 | 5 |0.0222
 1 |  2 | 4 |0.0177
 2 |  1 | 3 |0.0133
 2 |  2 | 2 |0.0089
 3 |  1 | 1 |0.0044
 3 |  2 |  5000 |0.0022
 2 |  12445 |10 |0.
 ...

And let's estimate the two queries with complex clauses, where the
multivariate stats gave 2x overestimates.

SELECT * FROM foo WHERE a=1 and (b=1 or b=2);
-- actual 9, univariate: 24253, multivariate: 181091

univariate:

  sel(a=1) = 0.1508
  sel(b=1) = 0.0399
  sel(b=2) = 0.0288
  sel(b=1 or b=2) = 0.0673

multivariate:
  sel(a=1 and (b=1 or b=2)) = 0.0399 (0.0770)

The second multivariate estimate comes from assuming only the first 5
items make it to the multivariate MCV list (covering 6.87% of the data)
and extrapolating the selectivity to the non-MCV data too.

(Notice it's about 2x the actual selectivity, so this extrapolation due
to not realizing the MCV already contains all the matches is pretty much
responsible for the whole over-estimate).

So, how would the proposed algorithm work? Let's start with "a=1":

sel(a=1) = 0.1508

I don't see much point in applying the two "b" clauses independently (or
how would it be done, as it's effectively a single clause):

sel(b=1 or b=2) = 0.0673

And we get

total_sel = sel(a=1) * sel(b=1 or b=2) = 0.0101

 From the multivariate MCV we get

mcv_sel = 0.0399

And finally

total_sel = Max(total_sel, mcv_sel) = 0.0399

Which is great, but I don't see how that actually helped anything? We
still only have the estimate for the ~7% covered by the MCV list, and
not the remaining non-MCV part.

I could do the same thing for the second 

Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)

2018-07-17 Thread Tom Lane
Peter Geoghegan  writes:
> I've done plenty of research into the history of this hack. It was
> your work, but it does actually make sense in the context of today's
> nbtree code. It is essential with scankey-wise duplicates, since
> groveling through hundreds or even thousands of pages full of
> duplicates to find free space (and avoid a page split) is going to
> have a very serious downside for latency.

Well, the actual problem was O(N^2) behavior:

https://www.postgresql.org/message-id/2378.967216388%40sss.pgh.pa.us

https://git.postgresql.org/gitweb/?p=postgresql.git=commitdiff=40549e9cb5abd2986603883e4ab567dab34723c6

I certainly have no objection to improving matters, but let's be sure
we don't re-introduce any two-decade-old problems.

regards, tom lane



Re: Allowing multiple DDL commands to run simultaneously

2018-07-17 Thread Simon Riggs
On 17 July 2018 at 19:47, Robert Haas  wrote:
> On Mon, Jul 9, 2018 at 6:00 AM, Simon Riggs  wrote:
>> Proposal would be to add a new lock mode "ShareUpdate", which does not
>> conflict with itself and yet conflicts with "ShareUpdateExclusive" or
>> higher. (Hence, it is a strong lock type). DDL would take a
>> ShareUpdateLock on the table, then during critical portions of
>> commands it would take a ShareUpdateExclusiveLock and then release it
>> again before commit.
>
> I think this would be quite prone to deadlocks.  Suppose someone tries
> to grab an AccessExclusiveLock on the table during a window in which
> we hold only ShareUpdateLock.  The next attempt to upgrade to
> ShareUpdateExclusiveLock will cause a simple deadlock.  In general,
> any approach that involves upgrading our lock strength is likely to
> have this problem.
>
> You might be able to work around this by inventing a whole new lock
> type, say "Relation Maintenance".  Make a rule that you can only take
> the "Relation Maintenance" lock while holding a Relation lock with
> strength >= ShareUpdateLock and that you do not need to bother
> acquiring it if you hold a self-exclusive lock that conflicts with
> ShareUpdateLock.  I think that works out to about the same thing as
> what you're proposing, except without the deadlock hazard.

Yes, it seems better to invent a new orthogonal lock type than have a
new lock level. Thanks.

Seems more like a critical section than a lock.

I'd make code take that lock, even if they have a self-exclusive lock,
just to avoid later problems when the lock level changes.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)

2018-07-17 Thread Peter Geoghegan
On Tue, Jul 17, 2018 at 1:29 PM, Tom Lane  wrote:
> Yeah ... if memory serves, there were specific usage patterns where
> that hack made things way better than they'd been before.  (I do not
> recall if the hack itself was mine, but I think I can be blamed for
> the "getting tired" comment ...)  I'd suggest git blaming your way
> to the commit that put that in, and then checking the hackers archives
> around that date for more info.

I've done plenty of research into the history of this hack. It was
your work, but it does actually make sense in the context of today's
nbtree code. It is essential with scankey-wise duplicates, since
groveling through hundreds or even thousands of pages full of
duplicates to find free space (and avoid a page split) is going to
have a very serious downside for latency.

Vadim wanted to fix the problem by more or less doing what I propose
[1], though he never got into figuring out how to make that practical
(i.e. how to make it not bloat up internal pages, which must represent
heap TID as just another part of the key space). Having unique keys
isn't just assumed by Lehman & Yao; I think that it's assumed by most
or all real-world B-Tree implementations.

[1] https://www.postgresql.org/message-id/18788.963953...@sss.pgh.pa.us
-- 
Peter Geoghegan



Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)

2018-07-17 Thread Peter Geoghegan
On Tue, Jul 17, 2018 at 1:12 PM, Robert Haas  wrote:
> This seems like really interesting and important work.  I wouldn't
> have foreseen that the "getting tired" code would have led to this
> kind of bloat (even if I had known about it at all).

Thanks!

I'm glad that I can come up with concrete, motivating examples like
this, because it's really hard to talk about the big picture here.
With something like a pgbench workload, there are undoubtedly many
different factors in play, since temporal locality influences many
different things all at once. I don't think that I understand it all
just yet. Taking a holistic view of the problem seems very helpful,
but it's also very frustrating at times.

> I wonder,
> though, whether it's possible that the reverse could happen in some
> other scenario.  It seems to me that with the existing code, if you
> reinsert a value many copies of which have been deleted, you'll
> probably find partially-empty pages whose free space can be reused,
> but if there's one specific place where each tuple needs to go, you
> might end up having to split pages if the new TIDs are all larger or
> smaller than the old TIDs.

That's a legitimate concern. After all, what I've done boils down to
adding a restriction on space utilization that wasn't there before.
This clearly helps because it makes it practical to rip out the
"getting tired" thing, but that's not everything. There are good
reasons for that hack, but if latency magically didn't matter then we
could just tear the hack out without doing anything else. That would
make groveling through pages full of duplicates at least as discerning
about space utilization as my patch manages to be, without any of the
complexity.

There is actually a flipside to that downside, though (i.e. the
downside is also an upside): While not filling up leaf pages that have
free space on them is bad, it's only bad when it doesn't leave the
pages completely empty. Leaving the pages completely empty is actually
good, because then VACUUM is in a position to delete entire pages,
removing their downlinks from parent pages. That's a variety of bloat
that we can reverse completely. I suspect that you'll see far more of
that favorable case in the real world with my patch. It's pretty much
impossible to do page deletions with pages full of duplicates today,
because the roughly-uniform distribution of still-live tuples among
leaf pages fails to exhibit any temporal locality. So, maybe my patch
would still come out ahead of simply ripping out "getting tired" in
this parallel universe where latency doesn't matter, and space
utilization is everything.

I made one small mistake with my test case: It actually *is* perfectly
efficient at recycling space even at the end, since I don't delete all
the duplicates (just 90% of them). Getting tired might have been a
contributing factor there, too.

-- 
Peter Geoghegan



Re: patch to allow disable of WAL recycling

2018-07-17 Thread Tomas Vondra
On 07/17/2018 09:12 PM, Peter Eisentraut wrote:
> On 17.07.18 00:04, Jerry Jelinek wrote:
>> There have been quite a few comments since last week, so at this point I
>> am uncertain how to proceed with this change. I don't think I saw
>> anything concrete in the recent emails that I can act upon.
> 
> The outcome of this could be multiple orthogonal patches that affect the
> WAL file allocation behavior somehow.  I think your original idea of
> skipping recycling on a COW file system is sound.  But I would rather
> frame the option as "preallocating files is obviously useless on a COW
> file system" rather than "this will make things mysteriously faster with
> uncertain trade-offs".
> 

Makes sense, I guess. But I think many claims made in this thread are
mostly just assumptions at this point, based on our beliefs how CoW or
non-CoW filesystems work. The results from ZFS (showing positive impact)
are an exception, but that's about it. I'm sure those claims are based
on real-world experience and are likely true, but it'd be good to have
data from a wider range of filesystems / configurations etc. so that we
can give better recommendations to users, for example.

That's something I can help with, assuming we agree on what tests we
want to do. I'd say the usual batter of write-only pgbench tests with
different scales (fits into s_b, fits into RAM, larger then RAM) on
common Linux filesystems (ext4, xfs, btrfs) and zfsonlinux, and
different types of storage would be enough. I don't have any freebsd box
available, unfortunately.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)

2018-07-17 Thread Tom Lane
Robert Haas  writes:
> On Sun, Jul 8, 2018 at 7:59 PM, Peter Geoghegan  wrote:
>> The whole "getting tired" thing is the root of the problem here, which
>> is why the pending v3 of my patch will remove that code completely
>> (_bt_findinsertloc() is streamlined).

> This seems like really interesting and important work.  I wouldn't
> have foreseen that the "getting tired" code would have led to this
> kind of bloat (even if I had known about it at all).  I wonder,
> though, whether it's possible that the reverse could happen in some
> other scenario.  It seems to me that with the existing code, if you
> reinsert a value many copies of which have been deleted, you'll
> probably find partially-empty pages whose free space can be reused,
> but if there's one specific place where each tuple needs to go, you
> might end up having to split pages if the new TIDs are all larger or
> smaller than the old TIDs.

Yeah ... if memory serves, there were specific usage patterns where
that hack made things way better than they'd been before.  (I do not
recall if the hack itself was mine, but I think I can be blamed for
the "getting tired" comment ...)  I'd suggest git blaming your way
to the commit that put that in, and then checking the hackers archives
around that date for more info.

regards, tom lane



Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)

2018-07-17 Thread Robert Haas
On Sun, Jul 8, 2018 at 7:59 PM, Peter Geoghegan  wrote:
> The whole "getting tired" thing is the root of the problem here, which
> is why the pending v3 of my patch will remove that code completely
> (_bt_findinsertloc() is streamlined).

Peter,

This seems like really interesting and important work.  I wouldn't
have foreseen that the "getting tired" code would have led to this
kind of bloat (even if I had known about it at all).  I wonder,
though, whether it's possible that the reverse could happen in some
other scenario.  It seems to me that with the existing code, if you
reinsert a value many copies of which have been deleted, you'll
probably find partially-empty pages whose free space can be reused,
but if there's one specific place where each tuple needs to go, you
might end up having to split pages if the new TIDs are all larger or
smaller than the old TIDs.

I'm really glad that you are working on this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: patch to allow disable of WAL recycling

2018-07-17 Thread Peter Eisentraut
On 17.07.18 00:04, Jerry Jelinek wrote:
> There have been quite a few comments since last week, so at this point I
> am uncertain how to proceed with this change. I don't think I saw
> anything concrete in the recent emails that I can act upon.

The outcome of this could be multiple orthogonal patches that affect the
WAL file allocation behavior somehow.  I think your original idea of
skipping recycling on a COW file system is sound.  But I would rather
frame the option as "preallocating files is obviously useless on a COW
file system" rather than "this will make things mysteriously faster with
uncertain trade-offs".

The actual implementation could use another round of consideration.  I
wonder how this should interact with min_wal_size.  Wouldn't
min_wal_size = 0 already do what we need (if you could set it to 0,
which is currently not possible)?  Should the new setting be something
like min_wal_size = -1?  Or even if it's a new setting, it might be
better to act on it in XLOGfileslop(), so these things are kept closer
together.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Another usability issue with our TAP tests

2018-07-17 Thread Tom Lane
Peter Eisentraut  writes:
> How about something like this:

> -PG_PROVE_FLAGS = -I $(top_srcdir)/src/test/perl/ -I $(srcdir)
> +PG_PROVE_FLAGS = -I $(top_srcdir)/src/test/perl/ -I $(srcdir) --state=save

Cute idea, but it seems not to work with older versions of prove:

$ which prove
/usr/local/perl5.8.3/bin/prove
$ prove --state=save
Unknown option: s

We could just do a "touch .prove_running" beforehand and an "rm" after,
perhaps (I think Michael suggested something like that already).

regards, tom lane



Re: untrusted PLs should be GRANTable

2018-07-17 Thread Peter Eisentraut
On 17.07.18 07:20, Craig Ringer wrote:
> A user has raised the point that our refusal to GRANT rights to
> untrusted PLs is counterproductive and inconsistent with how we behave
> elsewhere.

Previous discussion:
https://www.postgresql.org/message-id/flat/1357905627.24219.6.camel%40vanquo.pezone.net

What caused that to die was "What might actually be a problem in this
area is that, AFAICT, pg_dump does not save privileges granted to
objects in extensions."  But I think that is fixed/fixable now with
pg_init_privs.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Allowing multiple DDL commands to run simultaneously

2018-07-17 Thread Robert Haas
On Mon, Jul 9, 2018 at 6:00 AM, Simon Riggs  wrote:
> Proposal would be to add a new lock mode "ShareUpdate", which does not
> conflict with itself and yet conflicts with "ShareUpdateExclusive" or
> higher. (Hence, it is a strong lock type). DDL would take a
> ShareUpdateLock on the table, then during critical portions of
> commands it would take a ShareUpdateExclusiveLock and then release it
> again before commit.

I think this would be quite prone to deadlocks.  Suppose someone tries
to grab an AccessExclusiveLock on the table during a window in which
we hold only ShareUpdateLock.  The next attempt to upgrade to
ShareUpdateExclusiveLock will cause a simple deadlock.  In general,
any approach that involves upgrading our lock strength is likely to
have this problem.

You might be able to work around this by inventing a whole new lock
type, say "Relation Maintenance".  Make a rule that you can only take
the "Relation Maintenance" lock while holding a Relation lock with
strength >= ShareUpdateLock and that you do not need to bother
acquiring it if you hold a self-exclusive lock that conflicts with
ShareUpdateLock.  I think that works out to about the same thing as
what you're proposing, except without the deadlock hazard.

In general, though, +1 for trying to do something about this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Another usability issue with our TAP tests

2018-07-17 Thread Peter Eisentraut
On 16.07.18 19:13, Tom Lane wrote:
> But a TAP test failure leaves nothing behind that git will consider
> unusual.  I've repeatedly had to run check-world with no parallelism
> (wasting many minutes) in order to locate which test actually failed.

How about something like this:

diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 95d090e72d..12114d8427 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -396,7 +396,7 @@ endif
 PROVE = @PROVE@
 # There are common routines in src/test/perl, and some test suites have
 # extra perl modules in their own directory.
-PG_PROVE_FLAGS = -I $(top_srcdir)/src/test/perl/ -I $(srcdir)
+PG_PROVE_FLAGS = -I $(top_srcdir)/src/test/perl/ -I $(srcdir) --state=save
 # User-supplied prove flags such as --verbose can be provided in PROVE_FLAGS.
 PROVE_FLAGS =
 
@@ -420,12 +420,14 @@ define prove_installcheck
 rm -rf '$(CURDIR)'/tmp_check
 $(MKDIR_P) '$(CURDIR)'/tmp_check
 cd $(srcdir) && TESTDIR='$(CURDIR)' PATH="$(bindir):$$PATH" 
PGPORT='6$(DEF_PGPORT)' top_builddir='$(CURDIR)/$(top_builddir)' 
PG_REGRESS='$(CURDIR)/$(top_builddir)/src/test/regress/pg_regress' $(PROVE) 
$(PG_PROVE_FLAGS) $(PROVE_FLAGS) $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl)
+@rm -f .prove
 endef
 
 define prove_check
 rm -rf '$(CURDIR)'/tmp_check
 $(MKDIR_P) '$(CURDIR)'/tmp_check
 cd $(srcdir) && TESTDIR='$(CURDIR)' $(with_temp_install) 
PGPORT='6$(DEF_PGPORT)' 
PG_REGRESS='$(CURDIR)/$(top_builddir)/src/test/regress/pg_regress' $(PROVE) 
$(PG_PROVE_FLAGS) $(PROVE_FLAGS) $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl)
+@rm -f .prove
 endef
 
 else

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: GiST VACUUM

2018-07-17 Thread Andrey Borodin
Hi!

> 16 июля 2018 г., в 21:24, Andrey Borodin  написал(а):
> 

I was checking WAL replay of new scheme to log page deletes and found a bug 
there (incorrect value of deleted downlink in WAL record). Here's fixed patch 
v10.

Also I've added support to WAL identification for new record, done some 
improvements to comments and naming in data structures.

Thanks!

Best regards, Andrey Borodin.


0002-Physical-GiST-scan-during-VACUUM-v10.patch
Description: Binary data


0001-Delete-pages-during-GiST-VACUUM-v10.patch
Description: Binary data


Re: [HACKERS] logical decoding of two-phase transactions

2018-07-17 Thread Robert Haas
On Mon, Jul 16, 2018 at 3:25 PM, Tomas Vondra
 wrote:
> Oh, right, I forgot the patch also adds the leader into the group, for
> some reason (I agree it's unclear why that would be necessary, as you
> pointed out later).
>
> But all this is happening while holding the partition lock (in exclusive
> mode). And the decoding backends do synchronize on the lock correctly
> (although, man, the rechecks are confusing ...).
>
> But now I see ProcKill accesses decodeGroupLeader in multiple places,
> and only the first one is protected by the lock, for some reason
> (interestingly enough the one in lockGroupLeader block). Is that what
> you mean?

I haven't traced out the control flow completely, but it sure looks to
me like there are places where decodeGroupLeader is checked without
holding any LWLock at all.  Also, it looks to me like some places
(like where we're trying to find a PGPROC by XID) we use ProcArrayLock
and in others -- I guess where we're checking the decodeGroupBlah
stuff -- we are using the lock manager locks.  I don't know how safe
that is, and there are not a lot of comments justifying it.  I also
wonder why we're using the lock manager locks to protect the
decodeGroup stuff rather than backendLock.

> FWIW I suspect the ProcKill part is borked due to incorrectly resolved
> merge conflict or something, per my initial response from today.

Yeah I wasn't seeing the code the way I thought you were describing it
in that response, but I'm dumb this week so maybe I just
misunderstood.

>> I think that's probably not going to work out, but of course it's up
>> to you how you want to spend your time!
>
> Well, yeah. I'm sure I could think of more fun things to do, but OTOH I
> also have patches that require the capability to decode in-progress
> transactions.

It's not a matter of fun; it's a matter of whether it can be made to
work.  Don't get me wrong -- I want the ability to decode in-progress
transactions.  I complained about that aspect of the design to Andres
when I was reviewing and committing logical slots & logical decoding,
and I complained about it probably more than I complained about any
other aspect of it, largely because it instantaneously generates a
large lag when a bulk load commits.  But not liking something about
the way things are is not the same as knowing how to make them better.
I believe there is a way to make it work because I believe there's a
way to make anything work.  But I suspect that it's at least one order
of magnitude more complex than this patch currently is, and likely an
altogether different design.

> But the way I understand it, it pretty much *is* a list of waiters,
> along with a couple of flags to allow the processes to notify the other
> side about lock/unlock/abort. It does resemble the lock groups, but I
> don't think it has the same goals.

So the parts that aren't relevant shouldn't be copied over.

>> That having been said, I still don't see how that's really going to
>> work.  Just to take one example, suppose that the leader is trying to
>> ERROR out, and the decoding workers are blocked waiting for a lock
>> held by the leader.  The system has no way of detecting this deadlock
>> and resolving it automatically, which certainly seems unacceptable.
>> The only way that's going to work is if the leader waits for the
>> worker by trying to acquire a lock held by the worker.  Then the
>> deadlock detector would know to abort some transaction.  But that
>> doesn't really work either - the deadlock was created by the
>> foreground process trying to abort, and if the deadlock detector
>> chooses that process as its victim, what then?  We're already trying
>> to abort, and the abort code isn't supposed to throw further errors,
>> or fail in any way, lest we break all kinds of other things.  Not to
>> mention the fact that running the deadlock detector in the abort path
>> isn't really safe to begin with, again because we can't throw errors
>> when we're already in an abort path.
>
> Fair point, not sure. I'll leave this up to Nikhil.

That's fine, but please understand that I think there's a basic design
flaw here that just can't be overcome with any amount of hacking on
the details here.  I think we need a much higher-level consideration
of the problem here and probably a lot of new infrastructure to
support it.  One idea might be to initially support decoding of
in-progress transactions only if they don't modify any catalog state.
That would leave out a bunch of cases we'd probably like to support,
such as CREATE TABLE + COPY in the same transaction, but it would
likely dodge a lot of really hard problems, too, and we could improve
things later.  One approach to the problem of catalog changes would be
to prevent catalog tuples from being removed even after transaction
abort until such time as there's no decoding in progress that might
care about them.  That is not by itself sufficient because a
transaction can abort after inserting a heap tuple but before

Re: Allow auto_explain to log to NOTICE

2018-07-17 Thread Daniel Gustafsson
> On 17 Jul 2018, at 19:11, Andrew Dunstan  
> wrote:
> 
> On 07/17/2018 12:04 PM, Daniel Gustafsson wrote:

>> Since DEBUG is not a defined loglevel, it seems superfluous to include it 
>> here.
>> It’s also omitted from the documentation so it should probably be omitted 
>> from
>> here.
>> 
>> +   {"debug", DEBUG2, true},
> 
> I treated this like we do for client_min_messages and log_min_messages - the 
> alias is there but I don;t think we document it either.
> 
> I don't mind removing it, was just trying to be consistent. It seems odd that 
> we would accept it in one place but not another.

Ooh..  I didn’t know that alias existed and didn’t find it when poking at the
code.  In that case I agree with you, the alias should stay so I withdraw that
comment.  I learned something new today =)

>>> I haven't added tests for auto_explain - I think that would be useful
>>> but it's a separate project.
>> Agreeing that this would be beneficial, the attached patch (to apply on top 
>> of
>> the patch in question) takes a stab at adding tests for this new 
>> functionality.
>> 
>> In order to test plan output we need to support COSTS in the explain output, 
>> so
>> a new GUC auto_explain.log_costs is added.  We also need to not print the
>> duration, so as a hack this patch omits the duration if 
>> auto_explain.log_timing
>> is set to off and auto_explain.log_analyze is set off.  This is a hack and 
>> not
>> a nice overloading, but it seems overkill to add a separate GUC just to turn
>> off the duration, any better ideas on how support omitting the duration?
> 
> Great, I'll check it out.

I’m not sure it’s worth adding this much to the code just to be able to test
it, but it seemed like a good excercise to write to have something to reason
about.

cheers ./daniel


Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

2018-07-17 Thread Alvaro Herrera
On 2018-Jul-16, David Rowley wrote:

> On 16 July 2018 at 12:55, David Rowley  wrote:
> > Thinking about this some more, I don't quite see any reason that the
> > partitioned_rels for a single hierarchy couldn't just be a Bitmapset
> > instead of an IntList.
> 
> Of course, this is not possible since we can't pass a List of
> Bitmapsets to the executor due to Bitmapset not being a node type.

Maybe we can just add a new node type that wraps a lone bitmapset.  The
naive implementation (just print out individual members) should be
pretty straightforward; a more sophisticated one (print out the "words")
may end up more compact or not depending on density, but much harder for
debugging, and probably means a catversion bump when BITS_PER_BITMAPWORD
is changed, so probably not a great idea anyway.

I suppose the only reason we haven't done this yet is nobody has needed
it.  Sounds like its time has come.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PG 10: could not generate random cancel key

2018-07-17 Thread Robert Haas
On Tue, Jul 17, 2018 at 1:27 PM, Alvaro Herrera
 wrote:
> On 2018-Jul-17, Robert Haas wrote:
>
>> On Tue, Jul 17, 2018 at 8:33 AM, Dean Rasheed  
>> wrote:
>> > if (RAND_status() == 0)
>> > RAND_poll();
>>
>> Looks like a recipe for an infinite loop.  At least, I think we ought
>> to have a CHECK_FOR_INTERRUPTS() in that loop.
>
> What loop?

Ugh, I'm not doing very well today, am I?  I read that as while() but
it says if().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-17 Thread Robert Haas
On Fri, Apr 6, 2018 at 4:25 PM, Claudio Freire  wrote:
>> True all that. My point is that the multi-segmented array isn't all that
>> simple and proven, compared to an also straightforward B-tree. It's pretty
>> similar to a B-tree, actually, except that it has exactly two levels, and
>> the node (= segment) sizes grow exponentially. I'd rather go with a true
>> B-tree, than something homegrown that resembles a B-tree, but not quite.
>
> I disagree.

Yeah, me too.  I think a segmented array is a lot simpler than a
home-grown btree.  I wrote a home-grown btree that ended up becoming
src/backend/utils/mmgr/freepage.c and it took me a long time to get
rid of all the bugs.  Heikki is almost certainly better at coding up a
bug-free btree than I am, but a segmented array is a dead simple data
structure, or should be if done properly, and a btree is not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PG 10: could not generate random cancel key

2018-07-17 Thread Alvaro Herrera
On 2018-Jul-17, Robert Haas wrote:

> On Tue, Jul 17, 2018 at 8:33 AM, Dean Rasheed  
> wrote:
> > if (RAND_status() == 0)
> > RAND_poll();
> 
> Looks like a recipe for an infinite loop.  At least, I think we ought
> to have a CHECK_FOR_INTERRUPTS() in that loop.

What loop?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: chained transactions

2018-07-17 Thread Heikki Linnakangas

On 01/03/18 05:35, Peter Eisentraut wrote:

The SQL standard offers the "chained transactions" feature to address
this.  The new command variants COMMIT AND CHAIN and ROLLBACK AND CHAIN
immediately start a new transaction with the characteristics (isolation
level, read/write, deferrable) of the previous one.  So code that has
particular requirements regard transaction isolation and such can use
this to simplify code management.


Oh, is that all it does? That's disappointing, because that's a lot less 
powerful than how I understand chained transactions. And at the same 
time relieving, because that's a lot simpler to implement :-).


In Gray & Reuter's classic book, Transaction Processing, they describe 
chained transactions so that you also keep locks and cursors. 
Unfortunately I don't have a copy at hand, but that's my recollection, 
at least. I guess the SQL standard committee had a different idea.


- Heikki



Re: PG 10: could not generate random cancel key

2018-07-17 Thread Robert Haas
On Tue, Jul 17, 2018 at 8:33 AM, Dean Rasheed  wrote:
> if (RAND_status() == 0)
> RAND_poll();

Looks like a recipe for an infinite loop.  At least, I think we ought
to have a CHECK_FOR_INTERRUPTS() in that loop.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Allow auto_explain to log to NOTICE

2018-07-17 Thread Andrew Dunstan




On 07/17/2018 12:04 PM, Daniel Gustafsson wrote:

On 4 Jul 2018, at 15:34, Andrew Dunstan  wrote:

On Wed, Jun 20, 2018 at 2:06 PM, Daniel Gustafsson  wrote:

On 27 Apr 2018, at 04:24, Andres Freund  wrote:

On 2018-04-27 11:52:18 +0930, Tom Dunstan wrote:

I'd argue this should contain the non-error cases. It's just as
reasonable to want to add this as a debug level or such.

So all of warning, info, debug and debug1-5?

Yea. Likely nobody will ever use debug5, but I don't see a point making
that judgement call here.

Did you have a chance to hack up a new version of the patch with Andres’ review
comments?  I’m marking this patch as Waiting on Author for now based on the
feedback so far in this thread.


Here is an updated version of the patch. Setting this to "needs review”.



Thanks for the review



Thanks!  Looking at the code, and documentation this seems a worthwhile
addition.  Manual testing proves that it works as expected, and the patch
follows the expected style.  A few small comments:

Since DEBUG is not a defined loglevel, it seems superfluous to include it here.
It’s also omitted from the documentation so it should probably be omitted from
here.

+   {"debug", DEBUG2, true},




I treated this like we do for client_min_messages and log_min_messages - 
the alias is there but I don;t think we document it either.


I don't mind removing it, was just trying to be consistent. It seems odd 
that we would accept it in one place but not another.




The  tag should be closed with a matching .

+  auto_explain.log_level configuration 
parameter

With those fixed it’s ready for committer.




Thanks, will fix.



I haven't added tests for auto_explain - I think that would be useful
but it's a separate project.

Agreeing that this would be beneficial, the attached patch (to apply on top of
the patch in question) takes a stab at adding tests for this new functionality.

In order to test plan output we need to support COSTS in the explain output, so
a new GUC auto_explain.log_costs is added.  We also need to not print the
duration, so as a hack this patch omits the duration if auto_explain.log_timing
is set to off and auto_explain.log_analyze is set off.  This is a hack and not
a nice overloading, but it seems overkill to add a separate GUC just to turn
off the duration, any better ideas on how support omitting the duration?




Great, I'll check it out.

cheers

andrew


--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Allow auto_explain to log to NOTICE

2018-07-17 Thread Daniel Gustafsson
> On 4 Jul 2018, at 15:34, Andrew Dunstan  
> wrote:
> 
> On Wed, Jun 20, 2018 at 2:06 PM, Daniel Gustafsson  wrote:
>>> On 27 Apr 2018, at 04:24, Andres Freund  wrote:
>>> 
>>> On 2018-04-27 11:52:18 +0930, Tom Dunstan wrote:
> I'd argue this should contain the non-error cases. It's just as
> reasonable to want to add this as a debug level or such.
 
 So all of warning, info, debug and debug1-5?
>>> 
>>> Yea. Likely nobody will ever use debug5, but I don't see a point making
>>> that judgement call here.
>> 
>> Did you have a chance to hack up a new version of the patch with Andres’ 
>> review
>> comments?  I’m marking this patch as Waiting on Author for now based on the
>> feedback so far in this thread.
>> 
> 
> Here is an updated version of the patch. Setting this to "needs review”.

Thanks!  Looking at the code, and documentation this seems a worthwhile
addition.  Manual testing proves that it works as expected, and the patch
follows the expected style.  A few small comments:

Since DEBUG is not a defined loglevel, it seems superfluous to include it here.
It’s also omitted from the documentation so it should probably be omitted from
here.

+   {"debug", DEBUG2, true},

The  tag should be closed with a matching .

+  auto_explain.log_level configuration 
parameter

With those fixed it’s ready for committer.

> I haven't added tests for auto_explain - I think that would be useful
> but it's a separate project.

Agreeing that this would be beneficial, the attached patch (to apply on top of
the patch in question) takes a stab at adding tests for this new functionality.

In order to test plan output we need to support COSTS in the explain output, so
a new GUC auto_explain.log_costs is added.  We also need to not print the
duration, so as a hack this patch omits the duration if auto_explain.log_timing
is set to off and auto_explain.log_analyze is set off.  This is a hack and not
a nice overloading, but it seems overkill to add a separate GUC just to turn
off the duration, any better ideas on how support omitting the duration?

cheers ./daniel



auto_explain_tests.diff
Description: Binary data







Re: Another fun fact about temp tables and wraparound

2018-07-17 Thread Alvaro Herrera
On 2018-Jul-17, Grigory Smolkin wrote:

> Hello, hackers!
> 
> Recently I was investigating the case of 'stuck in wraparaound' problem.
> PostgreSQL instance(9.6.9) in question reached 'million-before-wraparound'
> threshold and switched to read-only mode.
> Running vacuum in single-mode gives not results, datfrozenxid was not
> advancing:
> 
> backend> vacuum freeze;
> 2018-07-13 16:43:58 MSK [3666-3] WARNING:  database "database_name" must be
> vacuumed within 991565 transactions
> 2018-07-13 16:43:58 MSK [3666-4] HINT:  To avoid a database shutdown,
> execute a database-wide VACUUM in that database.
> You might also need to commit or roll back old prepared
> transactions.
> backend>
> 
> pg_prepared_xacts was empty.
> After some poking around it became clear that some old temp table was
> holding the oldest relfrozenxid!

Hmm, autovacuum is supposed to drop temp tables that are above the
wraparound xid age to avoid this problem -- see autovacuum lines 2046ff.
(Except it doesn't do anything if the owning backend is active.  I guess
this could be a problem if the owning backend fails to do anything about
those tables.  Maybe this part is a mistake.)  Obviously, during
single-user mode autovacuum doesn't run anyway.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Another fun fact about temp tables and wraparound

2018-07-17 Thread Grigory Smolkin

Hello, hackers!

Recently I was investigating the case of 'stuck in wraparaound' problem.
PostgreSQL instance(9.6.9) in question reached 
'million-before-wraparound' threshold and switched to read-only mode.
Running vacuum in single-mode gives not results, datfrozenxid was not 
advancing:


backend> vacuum freeze;
2018-07-13 16:43:58 MSK [3666-3] WARNING:  database "database_name" must 
be vacuumed within 991565 transactions
2018-07-13 16:43:58 MSK [3666-4] HINT:  To avoid a database shutdown, 
execute a database-wide VACUUM in that database.
You might also need to commit or roll back old prepared 
transactions.

backend>

pg_prepared_xacts was empty.
After some poking around it became clear that some old temp table was 
holding the oldest relfrozenxid!
vacuum during get_rel_oids() ignored temp table but didn`t when it comes 
to calculating oldest relfrozenxid.

Dropping all temp schemas helped


Crude way to reproduce:

postgres=# create temp table t1();

gdb: set ShmemVariableCache->nextXid = ShmemVariableCache->xidStopLimit 
+ 100



pg_ctl stop -D PGDATA

with open('path_to_clog_file', 'w') as f:
x = 0
while x < 20:
f.write(chr(1))
x = x + 1

postgres --single -D $PGDATA

PostgreSQL stand-alone backend 9.6.9
backend> vacuum freeze;
WARNING:  database "postgres" must be vacuumed within 47 transactions
HINT:  To avoid a database shutdown, execute a database-wide VACUUM in 
that database.
You might also need to commit or roll back old prepared 
transactions.

backend>
backend> 
backend> vacuum freeze;
backend>

I think the root case of all temp table problems is that they are 
present in catalog. I think they should not be present in catalog.
And vacuum probably should ignore them during datfrozenxid calculation. 
In single mode at least. Or just drop them in single mode.

And it would be good to have advice 'drop temp schemas' in HINT message.


--
Grigory Smolkin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [HACKERS] Support for Secure Transport SSL library on macOS as OpenSSL alternative

2018-07-17 Thread Heikki Linnakangas

On 27/06/18 21:57, Daniel Gustafsson wrote:

On 27 Jun 2018, at 14:32, Daniel Gustafsson  wrote:



Attached is an updated patch for supporting the native macOS Secure Transport
library, rebased on top of current master.


Courtesy of the ever-present Murphy I managed to forget some testcode in
src/backend/Makefile which broke compilation for builds without secure
transport, attached v8 patch fixes that.


I've read through this patch now in more detail. Looks pretty good, but 
I have a laundry list of little things below. The big missing item is 
documentation.


--- laundry list begins ---

What exactly does 'ssl_is_server_start' mean? I don't understand that 
mechanism. ISTM it's only checked in load_key(), via 
be_tls_open_server(), which should only be called after be_tls_init(), 
in which case it's always 'true' when it's checked. Should there be an 
Assertion on it or something?


The "-framework" option, being added to CFLAGS, is clang specific. I 
think we need some more autoconf magic, to make this work with gcc.


In be_tls_open_server(), I believe there are several cases of using 
variables uninitialized. For example, load_identity_keychain() doesn't 
always set the 'identity' output parameter, but there's an "if (identity 
== NULL)" check after the call to it. And 'certificates' is left 
uninitialized, if load_identity_keychain() is used. Also, 'dh_len' is 
used uninitialized here in the "if (!dh_buf || dh_len == 0)" line, if 
the 'ssl_dh_params_file' option is not set. Did clang not give warnings 
about these?



+   /*
+* Certificate Revocation List are not supported in the Secure Transport
+* API
+*/


The corresponding client code has a longer comment on that:


+   /*
+* There is no API to load CRL lists in Secure Transport, they can 
however
+* be imported into a Keychain with the commandline application 
"certtool".
+* For libpq to use them, the certificate/key and root certificate 
needs to
+* be using an identity in a Keychain into which the CRL have been
+* imported. That needs to be documented.
+*/


Is it possible to programmatically create a temporary keychain, in 
memory, and load the CRL to it? (I googled a bit, and couldn't find any 
suitable API for it, so I guess not..)




+   if (ssl_crl_file[0])
+   ereport(FATAL,
+   (errmsg("CRL files not supported with Secure 
Transport")));


I think this should be COMMERROR, like other errors around this. We 
don't want to pass that error message to the client. Although I guess it 
won't reach the client as we haven't negotiated TLS yet.



+   /*
+* We use kTryAuthenticate here since we don't know which sslmode the
+* client is using. If we were to use kAlwaysAuthenticate then sslmode
+* require won't work as intended.
+*/
+   if (ssl_loaded_verify_locations)
+   SSLSetClientSideAuthenticate((SSLContextRef) port->ssl, 
kTryAuthenticate);


That comment sounds wrong. This is in backend code, and 
SSLSetClientSideAuthenticate() is all about checking the client's 
certificate in the server, while libpq 'sslmode' is about checking the 
server's certificate in the client.




+   for (;;)
+   {
+   status = SSLHandshake((SSLContextRef) port->ssl);
+   if (status == noErr)
+   break;
+
+   if (status == errSSLWouldBlock)
+   continue;
+   ...
+   }


Does this busy-wait, if there's not enough data from the client? That 
seems bad. In the corresponding client-side code, there's a comment on 
that too, but it's still bad.


In be_tls_get_version and PQsslAttribute, can we add support for 
kTLSProtocol13? Perhaps with an autoconf test, if the constant is not 
available on all macOS versions.


In be_tls_get_peerdn_name, wouldn't SecCertificateCopyCommonName() be 
more appropriate than SecCertificateCopyLongDescription()?


In be_tls_get_cipher, returning "unknown" would seem better than 
erroring out, if the cipher isn't recognized.


Check error message style. eg.:


+   ereport(COMMERROR,
+   (errmsg("could not load server certificate \"%s\": 
\"%s\"",
+   ssl_cert_file, 
pg_SSLerrmessage(status;



Why is the result of pg_SSLerrmessage() in quotes? Maybe errdetail?


+ * Any consumers of the Keychain array should always call this to ensure that
+ * it is set up in a manner that reflect the configuration. If it not, then


s/reflect/reflects/


+   else if (keychain_count == 2)
+   {
+   if (ssl_keychain_file[0] == '\0' || !ssl_keychain_use_default)
+   goto error;
+
+   return;
+   }
+   else
+   /* Fallthrough to erroring out */
+
+error:
+   ereport(FATAL,
+   

Re: PG 10: could not generate random cancel key

2018-07-17 Thread Dean Rasheed
On 17 July 2018 at 14:04, Michael Paquier  wrote:
> On Tue, Jul 17, 2018 at 01:33:11PM +0100, Dean Rasheed wrote:
>> Looking for precedents elsewhere, I found [2] which does exactly that,
>> although I'm slightly dubious about the need for the for-loop there. I
>> also found a thread [3], which recommends simply doing
>>
>> if (RAND_status() == 0)
>> RAND_poll();
>>
>> which seems preferable. Attached is a patch to do this in pg_strong_random().
>
> Checking for the return result of RAND_poll() would also be good thing
> to do.  From what I read in OpenSSL code it could fail as well, and
> we could combine that with a loop attempted to feed the machinery a
> decided amount of times, just failing after successive failures.

>From what I understand from here [1], some parts of OpenSSL call
RAND_poll() once on initialisation, and that's enough to get the PRNG
going. It's not obvious that calling it multiple times would have any
benefit.

They also don't appear to bother checking the return code from
RAND_poll() [2]. If it did fail, there'd not be much you could do
anyway, so you might as well just let it continue and let RAND_bytes()
fail. In fact it may even be possible for RAND_poll() to fail, but
just do enough to cause RAND_bytes() to succeed.

Regards,
Dean


[1] https://wiki.openssl.org/index.php/Random_Numbers
[2] 
https://github.com/benvanik/openssl/blob/master/openssl/crypto/rand/md_rand.c



Re: [bug fix] Produce a crash dump before main() on Windows

2018-07-17 Thread Haribabu Kommi
On Thu, Mar 1, 2018 at 4:14 PM Tsunakawa, Takayuki <
tsunakawa.ta...@jp.fujitsu.com> wrote:

> From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com]
> > Another idea to add to the current patch is to move the call to
> SetErrorMode()
> > to the below function, which is called first in main().  How about this?
> >
> > void
> > pgwin32_install_crashdump_handler(void)
> > {
> >   SetUnhandledExceptionFilter(crashDumpHandler);
> > }
>
> I moved SetErrorMode() to the beginning of main().  It should be placed
> before any code which could crash.  The current location is a bit late: in
> fact, write_stderr() crashed when WSAStartup() failed.
>

I reviewed patch and it works as per the subject, but I am not able to
verify the actual
bug that is reported in the upthread. The moving of setErrorMode() call to
the start
of the main function can handle all the cases that can lead to a crash with
no popup.

The reset of setErrorMode(0) before start of any process can generate the
coredump
because of any issue before it reaches the main() function, but this change
can lead
to stop the postmaster restart process, if no one present to observe the
scenario?
Doesn't this change can generate backward compatibility problems to
particular users?

I don't have any other comments with the current patch.

Regards,
Haribabu Kommi
Fujitsu Australia


Re: PG 10: could not generate random cancel key

2018-07-17 Thread Michael Paquier
On Tue, Jul 17, 2018 at 01:33:11PM +0100, Dean Rasheed wrote:
> Looking for precedents elsewhere, I found [2] which does exactly that,
> although I'm slightly dubious about the need for the for-loop there. I
> also found a thread [3], which recommends simply doing
> 
> if (RAND_status() == 0)
> RAND_poll();
> 
> which seems preferable. Attached is a patch to do this in pg_strong_random().

Checking for the return result of RAND_poll() would also be good thing
to do.  From what I read in OpenSSL code it could fail as well, and
we could combine that with a loop attempted to feed the machinery a
decided amount of times, just failing after successive failures.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-07-17 Thread Tomas Vondra

On 07/17/2018 11:09 AM, Dean Rasheed wrote:

On 16 July 2018 at 21:55, Tomas Vondra  wrote:



...
>>

So, how would the proposed algorithm work? Let's start with "a=1":

sel(a=1) = 0.1508

I don't see much point in applying the two "b" clauses independently (or
how would it be done, as it's effectively a single clause):

sel(b=1 or b=2) = 0.0673

And we get

total_sel = sel(a=1) * sel(b=1 or b=2) = 0.0101

 From the multivariate MCV we get

mcv_sel = 0.0399

And finally

total_sel = Max(total_sel, mcv_sel) = 0.0399

Which is great, but I don't see how that actually helped anything? We
still only have the estimate for the ~7% covered by the MCV list, and
not the remaining non-MCV part.



Right. If these are the only stats available, and there are just 2
top-level clauses like this, it either returns the MCV estimate, or
the old univariate estimate (whichever is larger). It avoids
over-estimating, but will almost certainly under-estimate when the MCV
matches are incomplete.



Yeah :-(

In my experience under-estimates tend to have much worse consequences 
(say a nested loop chosen by under-estimate vs. hash join chosen by 
over-estimate). This certainly influenced some of the choices I've made 
in this patch (extrapolation to non-MCV part is an example of that), but 
I agree it's not particularly scientific approach and I'd very much want 
something better.





I could do the same thing for the second query, but the problem there is
actually exactly the same - extrapolation from MCV to non-MCV part
roughly doubles the estimate.

So unless I'm applying your algorithm incorrectly, this does not seem
like a very promising direction :-(



You could be right. Actually it's the order dependence with more than
2 top-level clauses that bothers me most about this algorithm. It's
also not entirely obvious how to include histogram stats in this
scheme.



I think for inequalities that's fairly simple - histograms work pretty 
well for that, and I have a hunch that replacing the 0.5 estimate for 
partially-matching buckets with conver_to_scalar-like logic and the 
geometric mean (as you proposed) will work well enough.


For equalities it's going to be hard. The only thing I can think of at 
the moment is checking if there are any matching buckets at all, and 
using that to decide whether to extrapolate the MCV selectivity to the 
non-MCV part or not (or perhaps to what part of the non-MCV part).



A different approach that I have been thinking about is, in
mcv_update_match_bitmap(), attempt to work out the probability of all
the clauses matching and it not being an MCV value. For example, given
a clause like a=1 whose univariate estimate we know (0.1508 in the
above example), knowing that the MCV values account for 0.0222+0.0177
of that, the remainder must be from non-MCV values. So we could say
that the probability that a=1 and it not being and MCV is
0.1508-0.0222-0.0177 = 0.1109. So then the question is could we
combine these non-MCV probabilities to give an estimate of how many
non-MCV values we need to worry about? I've not fully thought that
through, but it might be useful.


Could we use it to derive some upper boundaries on the non-MCV part?


The problem is, it's still likely to
over-estimate when the MCV matches fully cover all possibilities, and
I still don't see a way to reliably detect that case.



I guess we can use a histogram to limit the over-estimates, as explained 
above. It may not be 100% reliable as it depends on the sample and how 
exactly the buckets are formed, but it might help.


But perhaps these over-estimates are a particularly serious issue? When 
you already get matches in a MCV, the number of matching rows is going 
to be pretty significant. If you over-estimate 2x, so what? IMHO that's 
still pretty accurate estimate.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Pluggable Storage - Andres's take

2018-07-17 Thread Haribabu Kommi
On Mon, Jul 16, 2018 at 11:35 PM Andres Freund  wrote:

> On 2018-07-04 20:11:21 +1000, Haribabu Kommi wrote:
> > On Tue, Jul 3, 2018 at 5:06 PM Andres Freund  wrote:
>
> > > The most fundamental issues I had with Haribabu's last version from [2]
> > > are the following:
> > >
> > > - The use of TableTuple, a typedef from void *, is bad from multiple
> > >   fronts. For one it reduces just about all type safety. There were
> > >   numerous bugs in the patch where things were just cast from HeapTuple
> > >   to TableTuple to HeapTuple (and even to TupleTableSlot).  I think
> it's
> > >   a really, really bad idea to introduce a vague type like this for
> > >   development purposes alone, it makes it way too hard to refactor -
> > >   essentially throwing the biggest benefit of type safe languages out
> of
> > >   the window.
> > >
> >
> > My earlier intention was to remove the HeapTuple usage entirely and
> replace
> > it with slot everywhere outside the tableam. But it ended up with
> TableTuple
> > before it reach to the stage because of heavy use of HeapTuple.
>
> I don't think that's necessary - a lot of the system catalog accesses
> are going to continue to be heap tuple accesses. And the conversions you
> did significantly continue to access TableTuples as heap tuples - it was
> just that the compiler didn't warn about it anymore.
>
> A prime example of that is the way the rewriteheap / cluster
> integreation was done. Cluster continued to sort tuples as heap tuples -
> even though that's likely incompatible with other tuple formats which
> need different state.
>

OK. Understood.


> > > - the heap tableam did a heap_copytuple() nearly everywhere. Leading to
> > >   a higher memory usage, because the resulting tuples weren't freed or
> > >   anything. There might be a reason for doing such a change - we've
> > >   certainly discussed that before - but I'm *vehemently* against doing
> > >   that at the same time we introduce pluggable storage. Analyzing the
> > >   performance effects will be hard enough without changes like this.
> > >
> >
> > How about using of slot instead of tuple and reusing of it? I don't know
> > yet whether it is possible everywhere.
>
> Not quite sure what you mean?
>

I thought of using slot everywhere can reduce the use of heap_copytuple,
I understand that you already did those changes from you reply from the
other mail.


> > Tasks / Questions:
> > >
> > > - split up patch
> > >
> >
> > How about generating refactoring changes as patches first based on
> > the code in your repo as discussed here[1]?
>
> Sure - it was just at the moment too much work ;)
>

Yes, it is too much work. How about doing this once most of the
open items are finished?


>
> > > - Change heap table AM to not allocate handler function for each table,
> > >   instead allocate it statically. Avoids a significant amount of data
> > >   duplication, and allows for a few more compiler optimizations.
> > >
> >
> > Some kind of static variable handlers for each tableam, but need to check
> > how can we access that static handler from the relation.
>
> I'm not sure what you mean by "how can we access"? We can just return a
> pointer from the constant data from the current handler? Except that
> adding a bunch of consts would be good, the external interface wouldn't
> need to change?
>

I mean we may need to store some tableam ID in each table, so that based on
that ID we get the static tableam handler, because at a time there may be
tables from different tableam methods.


>
> > > - change scan level slot creation to use tableam function for doing so
> > > - get rid of slot->tts_tid, tts_tupleOid and potentially tts_tableOid
> > >
> >
> > so with this there shouldn't be a way from slot to tid mapping or there
> > should be some other way.
>
> I'm not sure I follow?
>

Replacing of heaptuple with slot, currently with slot only the tid is passed
to the tableam methods, To get rid of the tid from slot, we may need any
other method of passing?


> > - bitmap index scans probably need a new tableam.h callback, abstracting
> > >   bitgetpage()
> > >
> >
> > OK.
>
> Any chance you could try to tackle this?  I'm going to be mostly out
> this week, so we'd probably not run across each others feet...
>


OK, I will take care of the above point.

Regards,
Haribabu Kommi
Fujitsu Australia


Re[2]: Alter index rename concurrently to

2018-07-17 Thread Andrey Klychkov
> Понедельник, 16 июля 2018, 22:19 +03:00 от Andrey Borodin 
> :
>
>Hi!
>
>> 16 июля 2018 г., в 22:58, Andrey Klychkov < aaklych...@mail.ru > написал(а):
>> Dear hackers!
>> 
>> I have an idea to facilitate work with index rebuilding.
>> 
>> "ALTER INDEX ... RENAME CONCURRENTLY TO ..."
>
>The idea seems useful. I'm not an expert in CIC, but your post do not cover 
>one very important topic. When we use CREATE INDEX CONCURRENTLY we pay for 
>less intrusive lock by scanning data twice. What is the price of RENAME 
>CONCURRENTLY? Should this mode be default?
Hello and thank you for the answer!

I don't think "alter index ... rename concurrently to ..." has large overheads
cause it just locks table and copies a new name instead of an old name
to the field relform->relname of the FormData_pg_class struct.

./src/include/catalog/pg_class.h: typedef of FormData_pg_class
./src/backend/commands/tablecmds.c: RenameRelation() and 
RenameRelationInternal()

As I wrote before, in my patch I've added the opportunity to change a locking
AccessShareLock -> ShareUpdateExclusiveLock
by passing "concurrently" in "alter", it's similar to the way of 
index_drop()/index_create()
functions.

It changes only one name field and nothing more.
I believe it is safe.
Also I ran tests again: select/insert queries in loops in several sessions and 
changed
an index name concurrently in parallel many times.
After that, index was valid and its index_scan was increased.

However, I don't have much experience and maybe I am wrong.

-- 
Kind regards,
Andrey Klychkov


Re: Make foo=null a warning by default.

2018-07-17 Thread Fabien COELHO


Hello David,

A few comments about this v2.

ISTM that there is quite strong opposition to having "warn" as a default, 
so probably you should set if "off"?



  transform_null_equals_options[] = { [...]

I do not really understand the sort order of this array. Maybe it could be
alphabetical, or per value? Or maybe it is standard values and then
synonyms, in which is case maybe a comment on the end of the list.


I've changed it to per value. Is this better?


At least, I can see the logic.


Or anything in better English.


I've changed this, and hope it's clearer.


Yep, and more elegant than my proposal.


* Test

 +select 1=null;
 + ?column?

Maybe use as to describe the expected behavior, so that it is easier to
check, and I think that "?column?" looks silly eg:

  select 1=null as expect_{true,false}[_and_a_warning/error];


Changed to more descriptive headers.


Cannot see it in the patch update?


   create table cnullchild (check (f1 = 1 or f1 = null)) inherits(cnullparent);
  +WARNING:  = NULL can only produce a NULL

I'm not sure of this test case. Should it be turned into "is null"?


This was just adjusting the existing test output to account for the
new warning. I presume it was phrased that way for a reason.


Hmmm. Probably you are right. I think that the test case deserves better 
comments, as it is most peculiar. It was added by Tom for testing CHECK 
constant NULL expressions simplications.


TRUE OR NULL is TRUE, but FALSE OR NULL is NULL. Why not. Combined with 
the fact that NULL is considered ok for a CHECK, this lead to strange but 
intentional behavior, such as:


  -- this CHECK is always nignored in practice
  CREATE TABLE foo (i INT CHECK(i = 1 OR NULL));
  INSERT INTO foo(i) VALUES(2); # ACCEPTED

  -- but this one is not
  CREATE TABLE foo (i INT CHECK(i = 1 OR FALSE));
  INSERT INTO foo(i) VALUES(2); # REFUSED

Can't say I'm thrilled about that, and the added warning looks 
appropriate.


--
Fabien.

PG 10: could not generate random cancel key

2018-07-17 Thread Dean Rasheed
Last week I upgraded 15 servers from various pre-10 versions to 10.4.
At first everything looked OK, but then (around 4 days later) one of
them failed with this in the logs:

2018-07-14 01:53:35.840 BST  LOG:  could not generate random cancel key
2018-07-14 01:53:37.233 BST  LOG:  could not generate random cancel key
2018-07-14 01:53:37.245 BST  LOG:  could not generate random cancel key
2018-07-14 01:53:38.553 BST  LOG:  could not generate random cancel key
2018-07-14 01:53:38.581 BST  LOG:  could not generate random cancel key
2018-07-14 01:54:43.851 BST  WARNING:  worker took too long to start; canceled
2018-07-14 01:54:43.862 BST  LOG:  could not generate random cancel key
2018-07-14 01:55:09.861 BST  LOG:  could not generate random cancel key
2018-07-14 01:55:09.874 BST  LOG:  could not generate random cancel key
...

After that it would not accept any new connections until I restarted
postmaster a few hours later. Since then, it has been OK.

It was built using --with-openssl and strong random support enabled,
so it was OpenSSL's RAND_bytes() that failed for some reason. I
attempted to reproduce it with a small C program directly calling
RAND_bytes(), but it refused to fail, even if I disabled haveged and
ran my tests in an @reboot cron job. So this failure is evidently
quite rare, but the documentation for RAND_bytes() says it *can* fail
(returning 0) if it isn't seeded with enough entropy, in which case
more must be added, which we're not doing.

In addition, once it does fail, repeated calls to RAND_bytes() will
continue to fail if it isn't seeded with more data -- hence the
inability to start any new backends until after a postmaster restart,
which is not a very friendly failure mode.

The OpenSSL documentation suggests that we should use RAND_status()
[1] to check that the generator has been seeded with enough data:

RAND_status() indicates whether or not the CSPRNG has been sufficiently
seeded. If not, functions such as RAND_bytes(3) will fail.

and if not, RAND_poll() can be used to fix that:

RAND_poll() uses the system's capabilities to seed the CSPRNG using
random input obtained from polling various trusted entropy sources. The
default choice of the entropy source can be modified at build time using
the --with-rand-seed configure option, see also the NOTES section. A
summary of the configure options can be displayed with the OpenSSL
version(1) command.

Looking for precedents elsewhere, I found [2] which does exactly that,
although I'm slightly dubious about the need for the for-loop there. I
also found a thread [3], which recommends simply doing

if (RAND_status() == 0)
RAND_poll();

which seems preferable. Attached is a patch to do this in pg_strong_random().

Thoughts?

Regards,
Dean


[1] https://www.openssl.org/docs/man1.1.1/man3/RAND_status.html
[2] https://github.com/nodejs/node/blob/master/src/node_crypto.cc
[3] https://github.com/openssl/openssl/issues/4148
diff --git a/src/port/pg_strong_random.c b/src/port/pg_strong_random.c
new file mode 100644
index bc7a8aa..fd5ad92
--- a/src/port/pg_strong_random.c
+++ b/src/port/pg_strong_random.c
@@ -103,6 +103,9 @@ pg_strong_random(void *buf, size_t len)
 	 * When built with OpenSSL, use OpenSSL's RAND_bytes function.
 	 */
 #if defined(USE_OPENSSL_RANDOM)
+	/* Make sure that OpenSSL's CSPRNG has been sufficiently seeded */
+	if (RAND_status() == 0)
+		(void) RAND_poll();
 	if (RAND_bytes(buf, len) == 1)
 		return true;
 	return false;


Re: [HACKERS] WAL logging problem in 9.4.3?

2018-07-17 Thread Andrew Dunstan




On 07/16/2018 08:01 PM, Michael Paquier wrote:



I doubt as well that we'd be able to catch all the holes as well as the
conditions where the optimization could be run safely are rather
basically impossible to catch beforehand.  I'd like to vote for getting
rid of this optimization for COPY, this can hurt more than it is
helpful.  Per the lack of complaints, this could happen only in HEAD?



Well, we'd be getting rid of it because of a danger of data loss which 
we can't otherwise mitigate. Maybe it does need to be backpatched, even 
if we haven't had complaints.


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Libpq support to connect to standby server as priority

2018-07-17 Thread Haribabu Kommi
On Tue, Jul 17, 2018 at 12:42 AM Laurenz Albe 
wrote:

> Haribabu Kommi wrote:
> > > On Wed, Jul 4, 2018 at 11:14 PM Laurenz Albe 
> wrote:
> > > > Haribabu Kommi wrote:
> > > >
> > > > - I think the construction with "read_write_host_index" makes the
> code even more
> > > >   complicated than it already is.
> > > >
> > > >   What about keeping the first successful connection open and
> storing it in a
> > > >   variable if we are in "prefer-read" mode.
> > > >   If we get the read-only connection we desire, close that cached
> connection,
> > > >   otherwise use it.
> > >
> > > Even if we add a variable to cache the connection, I don't think the
> logic of checking
> > > the next host for the read-only host logic may not change, but the
> extra connection
> > > request to the read-write host again will be removed.
> >
> > I evaluated your suggestion of caching the connection and reuse it when
> there is no
> > read only server doesn't find, but I am thinking that it will add more
> complexity and also
> > the connection to the other servers delays, the cached connection may be
> closed by
> > the server also because of timeout.
> >
> > I feel the extra time during connection may be fine, if user is
> preferring the prefer-read
> > mode, instead of adding more complexity in handling the cached
> connection?
> >
> > comments?
>
> I tested the new patch, and it works as expected.
>

Thanks for the confirmation.


> I don't think that time-out of the cached session is a valid concern,
> because that
> would have to be a really short timeout.
> On the other hand, establishing the connection twice (first to check if it
> is read-only,
> then again because no read-only connection is found) can be quite costly.
>
> But that is a matter of debate, as is the readability of the code.
>

Thanks for your opinion, let's wait for opinion from others also.
I can go for the modification, if others also find it useful.

Regards,
Haribabu Kommi
Fujitsu Australia


Re: Make foo=null a warning by default.

2018-07-17 Thread Fabien COELHO



Hello,

Yeah, I was wondering about that too. But Fabien brought up a completely new 
use-case for this: people learning SQL.


Indeed.

This year, about 30% of my students wrote "= NULL" in a query at least 
once. Probably I or a colleague were called to the rescue for help.


So this warning would save me time, which explains why I reviewed the 
patch with such enthousiasm:-)


I'm fine with keeping the current behavior as a default.

--
Fabien.



Re[2]: Alter index rename concurrently to

2018-07-17 Thread Andrey Klychkov



>Понедельник, 16 июля 2018, 22:40 +03:00 от Victor Yegorov :
>
>пн, 16 июл. 2018 г. в 21:58, Andrey Klychkov < aaklych...@mail.ru >:
>>I made a patch to solve this issue (see the attachment).
>>It allows to avoid locks by a query like this:
>>“alter index  rename CONCURRENTLY to ”.
>
>Please, have a look at previous discussions on the subject:
>- 2012  
>https://www.postgresql.org/message-id/cab7npqtys6juqdxuczbjb0bnw0kprw8wdzuk11kaxqq6o98...@mail.gmail.com
>- 2013  
>https://www.postgresql.org/message-id/cab7npqstfkuc7dzxcdx4hotu63yxhrronv2aouzyd-zz_8z...@mail.gmail.com
>-  https://commitfest.postgresql.org/16/1276/
>
Hello,
Thank you for this information!

In the first discussion the concurrent alter was mentioned.
In the next link and commitfest info I only saw "Reindex concurrently 2.0".
It sounds great!
If this component will be added to core it certainly facilitates index 
rebuilding.

What about "alter index ... rename to" in the concurrent mode?
Does "Reindex concurrently 2.0" add it?
From time to time we need just to rename some indexes.
Without concurrent mode this "alter index" makes queues.
It may be a problem on high load databases.

-- 
Kind regards,
Andrey Klychkov


Re: [HACKERS] [PATCH] kNN for SP-GiST

2018-07-17 Thread Nikita Glukhov

Attached 7th version of the patches:
 * renamed recheck fields and variables
 * fixed formatting of the one if-statement

On 15.07.2018 14:54, Andrey Borodin wrote:


14.07.2018, 1:31, Nikita Glukhov  wrote:

Attached  6th version of the patches.
...

2. The patch leaves contribs intact. Do extensions with sp-gist opclasses
need to update it's behavior somehow to be used as-is? Or to support new
functionality?

There are no SP-GiST opclasses in contrib/, so there is nothing to change in
contrib/.  Moreover, existing extensions need to be updated only for support of
new distance-ordered searches -- they need to implement distances[][] array
calculation in their spgInnerConsistent() and spgLeafConsistent() support
functions.

So, if extension will not update anything - it will keep all preexisting 
functionality, right?


Yes, of course.


I have some more nitpicks about naming
+   recheckQual = out.recheck;
+   recheckDist = out.recheckDistances;
Can we call this things somehow in one fashion?


I would prefer to rename "spgLeafConsitentOut.recheck" to "recheckQual" but it
is not good to rename user-visible fields, so I decided to rename all fields
and variables to "recheck"/"recheckDistances".


Also, this my be a stupid question, but do we really need to differ this two
recheck cases? It is certain that we cannot merge them?


This recheck cases are absolutely different.

In the first case, opclass support functions can not accurately determine
whether the leaf value satisfies the boolean search operator (compressed
values can be stored in leaves).

In the second case, opclass returns only a approximate value (the lower bound)
of the leaf distances.

In the example below operator 'polygon >> polygon' does not need recheck because
bounding box (they are stored in the index instead of polygons) test gives exact
results, but the ordering operator 'polygon <-> point' needs recheck:

SELECT * FROM polygons
WHERE poly >> polygon(box '((0,0),(1,1))')
ORDER BY poly <-> point '(0,0)';


This seems strange if-formatting
+   /* If allTheSame, they should all or none of 'em match */
+   if (innerTuple->allTheSame)
+   if (out.nNodes != 0 && out.nNodes != nNodes)
+   elog(ERROR, "inconsistent inner_consistent results for 
allTheSame inner tuple");
+
+   if (out.nNodes) // few lines before you compare with 0


Fixed.

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/utils/adt/geo_ops.c b/src/backend/utils/adt/geo_ops.c
index f57380a..a4345ef 100644
--- a/src/backend/utils/adt/geo_ops.c
+++ b/src/backend/utils/adt/geo_ops.c
@@ -41,7 +41,6 @@ enum path_delim
 static int	point_inside(Point *p, int npts, Point *plist);
 static int	lseg_crossing(double x, double y, double px, double py);
 static BOX *box_construct(double x1, double x2, double y1, double y2);
-static BOX *box_fill(BOX *result, double x1, double x2, double y1, double y2);
 static bool box_ov(BOX *box1, BOX *box2);
 static double box_ht(BOX *box);
 static double box_wd(BOX *box);
@@ -451,7 +450,7 @@ box_construct(double x1, double x2, double y1, double y2)
 
 /*		box_fill		-		fill in a given box struct
  */
-static BOX *
+BOX *
 box_fill(BOX *result, double x1, double x2, double y1, double y2)
 {
 	if (x1 > x2)
@@ -482,7 +481,7 @@ box_fill(BOX *result, double x1, double x2, double y1, double y2)
 /*		box_copy		-		copy a box
  */
 BOX *
-box_copy(BOX *box)
+box_copy(const BOX *box)
 {
 	BOX		   *result = (BOX *) palloc(sizeof(BOX));
 
diff --git a/src/include/utils/geo_decls.h b/src/include/utils/geo_decls.h
index a589e42..94806c2 100644
--- a/src/include/utils/geo_decls.h
+++ b/src/include/utils/geo_decls.h
@@ -182,6 +182,7 @@ typedef struct
 extern double point_dt(Point *pt1, Point *pt2);
 extern double point_sl(Point *pt1, Point *pt2);
 extern double pg_hypot(double x, double y);
-extern BOX *box_copy(BOX *box);
+extern BOX *box_copy(const BOX *box);
+extern BOX *box_fill(BOX *box, double xlo, double xhi, double ylo, double yhi);
 
 #endif			/* GEO_DECLS_H */
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index c4e8a3b..9279756 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -14,9 +14,9 @@
  */
 #include "postgres.h"
 
+#include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/relscan.h"
-#include "catalog/pg_type.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -543,7 +543,6 @@ getNextNearest(IndexScanDesc scan)
 {
 	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
 	bool		res = false;
-	int			i;
 
 	if (scan->xs_hitup)
 	{
@@ -564,45 +563,10 @@ getNextNearest(IndexScanDesc scan)
 			/* found a heap item at currently minimal distance */
 			scan->xs_ctup.t_self = item->data.heap.heapPtr;
 			scan->xs_recheck = item->data.heap.recheck;
-			scan->xs_recheckorderby = 

Re: Add SKIP LOCKED to VACUUM and ANALYZE

2018-07-17 Thread Robert Haas
On Tue, Jul 17, 2018 at 2:21 AM, Michael Paquier  wrote:
> For this part, it seems to me that we can do better than what is in
> HEAD:
> - Call RangeVarGetRelidExtended without lock.

I haven't been following this work closely, but I just want to point
out that the reason why RangeVarGetRelidExtended exists is because:

(1) we cannot lock a relation without looking up its OID, and should
not lock it without doing permissions checks first, and at least as
currently designed, those functions expect an OID, but

(2) we cannot look up a relation first and only lock it afterwards,
because DDL might happen in the middle and leave us locking the wrong
relation

When RangeVarGetRelidExtended is called with an argument other than
NoLock, it effectively makes locking, permissions checking, and the
name lookup happen simultaneously, so that it is neither possible to
lock a relation for which you don't have permissions, nor to change
the identity of the relation after the name lookup has been done and
before the lock is acquired.  On the other hand, when you call it with
NoLock, you don't get those nice behaviors.

So I'm inclined to think that any place in the source code that calls
RangeVarGetRelidExtended with NoLock is a bug, and we should be trying
to get rid of the cases we still have, not adding more.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Refactor documentation for wait events (Was: pgsql: Add wait event for fsync of WAL segments)

2018-07-17 Thread Robert Haas
On Tue, Jul 17, 2018 at 12:19 AM, Michael Paquier  wrote:
> And the patch previously sent removes them, but perhaps I am missing
> your point?

I was just confused.  Sorry for the noise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: foreign key to foreign table

2018-07-17 Thread Ashutosh Bapat
On Tue, Jul 17, 2018 at 12:13 PM, Kaye Ann Ignacio
 wrote:
> Hi,
>
> I have a foreign table created with postgres_fdw and with that, I tried to
> create a local table to reference the foreign table in order to set a
> foreign key constraint in my local table but I get an error message saying
> that the referenced table is not a table. Is there a way I can reference a
> foreign table?

I don't think so. Since the data in a foreign table resides on a
foreign server and can be manipulated without local server knowing
about it, it's very possible that those foreign key constraints will
become invalid without local server knowing about them.

This looks like a question for general mailing list to me.
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-07-17 Thread Peter Eisentraut
On 17.07.18 08:58, Michael Paquier wrote:
> Hm...  I am wondering if we actually want the "auto" mode where we make
> the option smarter automatically.  I am afraid of users trying to use it
> and being surprised that there is no gain while they expected some.  I
> would rather switch that to an on/off switch, which defaults to "off",
> or simply what is available now.  huge_pages=try was a bad idea as the
> result is not deterministic, so I would not have more of that...

Why do you think that was a bad idea?  Doing the best possible job by
default without requiring explicit configuration in each case seems like
an attractive feature.

> Putting CloneFile and check_reflink in a location that other frontend
> binaries could use would be nice, like pg_rewind.

This could be done in subsequent patches, but the previous iteration of
this patch for CREATE DATABASE integration already showed that each of
those cases needs separate consideration.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: 65279 Invisible ASCII Character

2018-07-17 Thread Christoph Moench-Tegeder
## ramsiddu007 (ramsiddu...@gmail.com):

>  If i remove first character it's run. That first
> character is invisible, I have checked that *ascii* value, it is *65279*.

That's not an ASCII-value, ASCII has 8 bits at most.
What you've got there is a UTF-16 Byte Order Mark: 65279 is 0xfeff
(one of the well-known constants).
I'd suggest you get your editor configured to write files without
BOM. Maybe there's a workaround via locale settings - but I have
no machine with an UTF-16 locale available. Another approach is using
recode on your files before concatenating.

This question isn't really for pgsql-hackers - I'm redirecting to -general.

Regards,
Christoph

-- 
Spare Space



Re: [HACKERS] Restricting maximum keep segments by repslots

2018-07-17 Thread Kyotaro HORIGUCHI
Hello.

At Tue, 17 Jul 2018 13:37:59 +0900, Masahiko Sawada  
wrote in 
> >> The usage of XLogSegNoOffsetToRecPtr is wrong. Since we specify the
> >> destination at 4th argument the wal_segment_size will be changed in
> >> the above expression. The regression tests by PostgreSQL Patch Tester
> >
> > I'm not sure I get this correctly, the definition of the macro is
> > as follows.
> >
> > | #define XLogSegNoOffsetToRecPtr(segno, offset, dest, wal_segsz_bytes) \
> > |   (dest) = (segno) * (wal_segsz_bytes) + (offset)
> >
> > The destination is the *3rd* parameter and the forth is segment
> > size which is not to be written.
> 
> Please see commit a22445ff0b which flipped input and output arguments.
> So maybe you need to rebase the patches to current HEAD.

Mmm. Thanks. I never thought such change happned but it is
accidentially took away in the latest patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




foreign key to foreign table

2018-07-17 Thread Kaye Ann Ignacio
Hi,

I have a foreign table created with postgres_fdw and with that, I tried to
create a local table to reference the foreign table in order to set a
foreign key constraint in my local table but I get an error message saying
that the referenced table is not a table. Is there a way I can reference a
foreign table?

Thank you.

-- 
Kaye Ann Ignacio, Programmer
proceedit "the BPaaS company"
kaye.igna...@proceedit.com +34 679188011 (mobile)


65279 Invisible ASCII Character

2018-07-17 Thread ramsiddu007
Hi all,
  Today i got one problem what i have saved more that one procedures in
a folder. After i have concatenate those files into single file through
shell script. After am trying to run that single file in my server, it was
showing syntax error. If i remove first character it's run. That first
character is invisible, I have checked that *ascii* value, it is *65279*.
How can i overcome this problem. I want to run my single file without this
error.

By below command i am concatenating individual scripting files to single.
*:> cat *.* > .sql*

select ascii('');

select chr(65279);

select length('');

Plz. give me suggestions, it's blocker to my work.


Thanking you,

-- 
*Best Regards:*
Ramanna Gunde


Re: [HACKERS] WAL logging problem in 9.4.3?

2018-07-17 Thread Kyotaro HORIGUCHI
Hello.

At Mon, 16 Jul 2018 16:14:09 -0400, Alvaro Herrera  
wrote in <20180716201409.2qfcneo4qkdwjvpv@alvherre.pgsql>
> On 2018-Jul-12, Heikki Linnakangas wrote:
> 
> > > > Thanks for the pointer.  My tap test has been covering two out of
> > > > the three scenarios you have in your script.  I have been able to
> > > > convert the extra as the attached, and I have added as well an
> > > > extra test with TRUNCATE triggers.  So it seems to me that we want
> > > > to disable the optimization if any type of trigger are defined on
> > > > the relation copied to as it could be possible that these triggers
> > > > work on the blocks copied as well, for any BEFORE/AFTER and
> > > > STATEMENT/ROW triggers.  What do you think?
> > > 
> > > Yeah, this seems like the only sane approach.
> > 
> > Doesn't have to be a trigger, could be a CHECK constraint, datatype
> > input function, etc. Admittedly, having a datatype input function that
> > inserts to the table is worth a "huh?", but I'm feeling very confident
> > that we can catch all such cases, and some of them might even be
> > sensible.
> 
> A counterexample could be a a JSON compresion scheme that uses a catalog
> for a dictionary of keys.  Hasn't this been described already?  Also not
> completely out of the question for GIS data, I think (Not sure if
> PostGIS does this already.)

In the third case, IIUC, disabling bulk-insertion after any
WAL-logging insertion happend seems to work. The attached diff to
v2 patch makes the three TAP tests pass. It uses relcache to
store the last insertion XID but it will not be invalidated
during a COPY operation.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 72395a50b8..e5c651b498 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2509,6 +2509,18 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 
 	MarkBufferDirty(buffer);
 
+	/*
+	 * Bulk insertion is not safe after a WAL-logging insertion in the same
+	 * transaction. We don't start bulk insertion under inhibitin conditions
+	 * but we also need to cancel WAL-skipping in the case where WAL-logging
+	 * insertion happens during a bulk insertion. This happens by anything
+	 * that can insert a tuple during bulk insertion such like triggers,
+	 * constraints or type conversions. We need not worry about relcache flush
+	 * happening while a bulk insertion is running.
+	 */
+	if (relation->last_logged_insert_xid == xid)
+		options &= ~HEAP_INSERT_SKIP_WAL;
+
 	/* XLOG stuff */
 	if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
 	{
@@ -2582,6 +2594,12 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		recptr = XLogInsert(RM_HEAP_ID, info);
 
 		PageSetLSN(page, recptr);
+
+		/*
+		 * If this happens during a bulk insertion, stop WAL skipping for the
+		 * rest of the current command.
+		 */
+		relation->last_logged_insert_xid = xid;
 	}
 
 	END_CRIT_SECTION();
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 7674369613..7b9a7af2d2 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2416,10 +2416,8 @@ CopyFrom(CopyState cstate)
 	{
 		hi_options |= HEAP_INSERT_SKIP_FSM;
 
-		if (!XLogIsNeeded() &&
-			cstate->rel->trigdesc == NULL &&
-			RelationGetNumberOfBlocks(cstate->rel) == 0)
-			hi_options |= HEAP_INSERT_SKIP_WAL;
+		if (!XLogIsNeeded() && RelationGetNumberOfBlocks(cstate->rel) == 0)
+ 			hi_options |= HEAP_INSERT_SKIP_WAL;
 	}
 
 	/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 30a956822f..34a692a497 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1575,6 +1575,9 @@ ExecuteTruncateGuts(List *explicit_rels, List *relids, List *relids_logged,
 		{
 			/* Immediate, non-rollbackable truncation is OK */
 			heap_truncate_one_rel(rel);
+
+			/* Allow bulk-insert */
+			rel->last_logged_insert_xid = InvalidTransactionId;
 		}
 		else
 		{
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 6125421d39..99fb7e1dd8 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1243,6 +1243,8 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
 	/* It's fully valid */
 	relation->rd_isvalid = true;
 
+	relation->last_logged_insert_xid = InvalidTransactionId;
+
 	return relation;
 }
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index c97f9d1b43..6ee575ad14 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -188,6 +188,9 @@ typedef struct RelationData
 
 	/* use "struct" here to avoid needing to include pgstat.h: */
 	struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+	/* XID of the last transaction on which WAL-logged insertion happened */
+	TransactionId		last_logged_insert_xid;
 } RelationData;
 
 


Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-07-17 Thread Dean Rasheed
On 16 July 2018 at 21:55, Tomas Vondra  wrote:
>
>
> On 07/16/2018 02:54 PM, Dean Rasheed wrote:
>> On 16 July 2018 at 13:23, Tomas Vondra  wrote:
> The top-level clauses allow us to make such deductions, with deeper
> clauses it's much more difficult (perhaps impossible). Because for
> example with (a=1 AND b=1) there can be just a single match, so if we
> find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR
> b=2)) it's not that simple, because there may be multiple combinations
> and so a match in MCV does not guarantee anything.

 Actually, it guarantees a lower bound on the overall selectivity, and
 maybe that's the best that we can do in the absence of any other
 stats.

>>> Hmmm, is that actually true? Let's consider a simple example, with two
>>> columns, each with just 2 values, and a "perfect" MCV list:
>>>
>>> a | b | frequency
>>>---
>>> 1 | 1 | 0.5
>>> 2 | 2 | 0.5
>>>
>>> And let's estimate sel(a=1 & b=2).
>>
>> OK.In this case, there are no MCV matches, so there is no lower bound (it's 
>> 0).
>>
>> What we could do though is also impose an upper bound, based on the
>> sum of non-matching MCV frequencies. In this case, the upper bound is
>> also 0, so we could actually say the resulting selectivity is 0.
>>
>
> Hmmm, it's not very clear to me how would we decide which of these cases
> applies, because in most cases we don't have MCV covering 100% rows.
>
> Anyways, I've been thinking about how to modify the code to wort the way
> you proposed (in a way sufficient for a PoC). But after struggling with
> it for a while it occurred to me it might be useful to do it on paper
> first, to verify how would it work on your examples.
>
> So let's use this data
>
> create table foo(a int, b int);
> insert into foo select 1,1 from generate_series(1,5);
> insert into foo select 1,2 from generate_series(1,4);
> insert into foo select 1,x/10 from generate_series(30,25) g(x);
> insert into foo select 2,1 from generate_series(1,3);
> insert into foo select 2,2 from generate_series(1,2);
> insert into foo select 2,x/10 from generate_series(30,50) g(x);
> insert into foo select 3,1 from generate_series(1,1);
> insert into foo select 3,2 from generate_series(1,5000);
> insert into foo select 3,x from generate_series(3,60) g(x);
> insert into foo select x,x/10 from generate_series(4,75) g(x);
>
> Assuming we have perfectly exact statistics, we have these MCV lists
> (both univariate and multivariate):
>
>   select a, count(*), round(count(*) /2254937.0, 4) AS frequency
> from foo group by a order by 2 desc;
>
>  a| count  | frequency
>   ++---
> 3 | 614998 |0.2727
> 2 | 549971 |0.2439
> 1 | 339971 |0.1508
>  1014 |  1 |0.
> 57220 |  1 |0.
> ...
>
>   select b, count(*), round(count(*) /2254937.0, 4) AS frequency
> from foo group by b order by 2 desc;
>
>  b| count | frequency
>   +---+---
> 1 | 90010 |0.0399
> 2 | 65010 |0.0288
> 3 |31 |0.
> 7 |31 |0.
>...
>
>   select a, b, count(*), round(count(*) /2254937.0, 4) AS frequency
> from foo group by a, b order by 3 desc;
>
>  a|   b| count | frequency
>   ++---+---
> 1 |  1 | 5 |0.0222
> 1 |  2 | 4 |0.0177
> 2 |  1 | 3 |0.0133
> 2 |  2 | 2 |0.0089
> 3 |  1 | 1 |0.0044
> 3 |  2 |  5000 |0.0022
> 2 |  12445 |10 |0.
> ...
>
> And let's estimate the two queries with complex clauses, where the
> multivariate stats gave 2x overestimates.
>
> SELECT * FROM foo WHERE a=1 and (b=1 or b=2);
> -- actual 9, univariate: 24253, multivariate: 181091
>
>univariate:
>
>  sel(a=1) = 0.1508
>  sel(b=1) = 0.0399
>  sel(b=2) = 0.0288
>  sel(b=1 or b=2) = 0.0673
>
>multivariate:
>  sel(a=1 and (b=1 or b=2)) = 0.0399 (0.0770)
>
> The second multivariate estimate comes from assuming only the first 5
> items make it to the multivariate MCV list (covering 6.87% of the data)
> and extrapolating the selectivity to the non-MCV data too.
>
> (Notice it's about 2x the actual selectivity, so this extrapolation due
> to not realizing the MCV already contains all the matches is pretty much
> responsible for the whole over-estimate).
>

Agreed. I think the actual MCV stats I got included the first 6
entries, but yes, that's only around 7% of the data.


> So, how would the proposed algorithm work? Let's start with "a=1":
>
>sel(a=1) = 0.1508
>
> I don't see much point in applying the two "b" clauses independently (or
> how would it be done, as it's effectively a single clause):
>
>sel(b=1 or b=2) = 0.0673
>
> And we get
>
>total_sel = 

Re: missing toast table for pg_policy

2018-07-17 Thread Michael Paquier
On Tue, Jul 17, 2018 at 02:55:07PM +0700, John Naylor wrote:
> I didn't dig deeper, since TOAST and the large object facility are
> mutually exclusive so there shouldn't be a toast table here anyway.
> Hope this helps.

[... digging ...]
This comes from get_rel_infos where large objects are treated as user
data.  Rather than the comment you added, I would rather do the
following:
"Large object catalogs and toast tables are mutually exclusive and large
object data is handled as user data by pg_upgrade, which would cause
failures."
--
Michael


signature.asc
Description: PGP signature


Re: missing toast table for pg_policy

2018-07-17 Thread John Naylor
On 7/17/18, Michael Paquier  wrote:
> I was just having a second look at this patch, and did a bit more tests
> with pg_upgrade which passed.
>
> +-- 2. pg_largeobject and pg_largeobject_metadata, to avoid problems
> +-- with pg_upgrade
> John, what's actually the failure that was seen here?  It would be nice
> to see this patch committed but the reason here should be more
> explicit about why this cannot happen.

I'll copy what I wrote upthread last month:

On 6/19/18, John Naylor  wrote:
> On 2/20/18, Michael Paquier  wrote:
>> Regression tests of pg_upgrade are failing as follows:
>> New cluster database "postgres" is not empty
>> Failure, exiting
>> + rm -rf /tmp/pg_upgrade_check-Xn0ZLe
>
> I looked into this briefly. The error comes from
> check_new_cluster_is_empty() in src/bin/pg_upgrade/check.c, which
> contains the comment
>
> /* pg_largeobject and its index should be skipped */

I didn't dig deeper, since TOAST and the large object facility are
mutually exclusive so there shouldn't be a toast table here anyway.
Hope this helps.

-John Naylor



Re: Another usability issue with our TAP tests

2018-07-17 Thread Michael Paquier
On Mon, Jul 16, 2018 at 01:13:36PM -0400, Tom Lane wrote:
> Since "make check-world" is rather chatty, if you get a failure while
> running it under high parallelism, the location of the failure has often
> scrolled off the terminal window by the time all the other subjobs
> exit.

Yes, I have pested about that myself a bit lately.

> This is not a huge problem for tests using our traditional infrastructure,
> because you can just run "git status" to look for regression.diffs files.
> But a TAP test failure leaves nothing behind that git will consider
> unusual.  I've repeatedly had to run check-world with no parallelism
> (wasting many minutes) in order to locate which test actually failed.

Even for tests currently running regression.diffs is created but remains
empty. 

> I'm not sure about a good way to improve this.  One idea that comes
> to mind is to tweak the "make check" rules so that the tmp_check
> subdirectories are automatically deleted on successful completion,
> but not on failure, and then remove tmp_check from the .gitignore lists.
> But the trouble with that is sometimes you want to look at the test logs
> afterwards, even when make thought the test succeeded.

I definitely want to be able to look at TAP test logs even if they
succeed, and only wipe them out at the next run, so I think that this
should be independent.  What about an on-disk empty file then which gets
simply created in the INIT() block of TestLib.pm, and removed in the END
block only if all tests pass?  We know the base directory thanks to
$ENV{TESTDIR} which should just be "." if undefined.  Then we name it
say, "prove_running" or similar.
--
Michael


signature.asc
Description: PGP signature


Re: file cloning in pg_upgrade and CREATE DATABASE

2018-07-17 Thread Michael Paquier
On Wed, Jun 06, 2018 at 11:58:14AM -0400, Peter Eisentraut wrote:
> 
>  The setting always requires the use of relinks.  If
>  they are not supported, the pg_upgrade run
>  will abort.  Use this in production to limit the upgrade run time.
>  The setting auto uses reflinks when available,
>  otherwise it falls back to a normal copy.  This is the default.  The
>  setting never prevents use of reflinks and always
>  uses a normal copy.  This can be useful to ensure that the upgraded
>  cluster has its disk space fully allocated and not shared with the old
>  cluster.
> 

Hm...  I am wondering if we actually want the "auto" mode where we make
the option smarter automatically.  I am afraid of users trying to use it
and being surprised that there is no gain while they expected some.  I
would rather switch that to an on/off switch, which defaults to "off",
or simply what is available now.  huge_pages=try was a bad idea as the
result is not deterministic, so I would not have more of that...

Putting CloneFile and check_reflink in a location that other frontend
binaries could use would be nice, like pg_rewind.
--
Michael


signature.asc
Description: PGP signature


Re: missing toast table for pg_policy

2018-07-17 Thread Michael Paquier
On Sat, Jul 14, 2018 at 03:47:38PM +0700, John Naylor wrote:
> I hope you don't mind, but since it might be tedious to piece together
> the addenda I left behind in this thread, I thought it might be useful
> to update Joe's patch. The attached was rebased over the new
> regression test, passes the pg_upgrade test, and has a draft commit
> message.

I was just having a second look at this patch, and did a bit more tests
with pg_upgrade which passed.

+-- 2. pg_largeobject and pg_largeobject_metadata, to avoid problems
+-- with pg_upgrade
John, what's actually the failure that was seen here?  It would be nice
to see this patch committed but the reason here should be more
explicit about why this cannot happen.
--
Michael


signature.asc
Description: PGP signature


Re: Add SKIP LOCKED to VACUUM and ANALYZE

2018-07-17 Thread Michael Paquier
On Thu, Jul 12, 2018 at 02:37:28PM +0900, Michael Paquier wrote:
> On Wed, Jun 13, 2018 at 08:29:12PM +, Bossart, Nathan wrote:
>> Previous thread: 
>> https://postgr.es/m/4BC0F3CD-F4B5-4F23-AADB-80607F9E4B4E%40amazon.com
>> 
>> This is a new thread for tracking the work to add SKIP LOCKED to
>> VACUUM and ANALYZE.  I've attached a rebased set of patches.
> 
> I am beginning to look at this stuff, and committed patch 4 and 5 on the
> way as they make sense independently.  I am still reviewing the rest,
> which applies with some fuzz, and the tests proposed still pass.

I have been looking at the rest of the patch set, and I find that the
way SKIP_LOCKED is handled is a bit inconsistent.  First, if the manual
VACUUM does not specify a list of relations, which can never happen for
autovacuum, then we get to use get_all_vacuum_rels to decide the list of
relations to look at, and then it is up to vacuum_rel() to decide if a
relation can be skipped or not.  If a list of relation is given by the
caller, then it is up to expand_vacuum_rel() to do the call.

In expand_vacuum_rel(), an OID could be defined for a relation, which is
something used by autovacuum.  If no OID is defined, which happens for a
manual VACUUM, then we use directly RangeVarGetRelidExtended() at this
stage and see if the relation is available.  If the relation listed in
the manual VACUUM is a partitioned table, then its full set of
partition OIDs (including down layers), are included in the list of
relations to include.  And we definitely want to hold, then release once
AccessShareLock so as the partition tree lookup can happen.  This uses
as well find_all_inheritors() with NoLock so as we rely on vacuum_rel()
to skip the relation or not.

The first thing which is striking me is that we may actually *not* want
to check for lock skipping within expand_vacuum_rel() as that's mainly a
function aimed at building the relations which are going to be vacuumed,
and that all the lock skipping operations should happen at the beginning
of vacuum_rel().  This way, even if the parent of a partition tree is
listed but cannot have its RangeVar opened immediately, then we have a
chance to have some of its partitions to be vacuumed after listing them.
This point is debatable as there are pros and cons.  I would be of the
opinion to not change expand_vacuum_rel()'s mission to build the list of
relations to VACUUM, and actually complain about a lock not available
when the relation is ready to be processed individually.  It is also
perfectly possible to skip locks for partitions by listing them
individually in the VACUUM command used. I am pretty sure that Andres
would complain at the sight of this paragraph as this is backwards with
the reason behind the refactoring of RangeVarGetRelidExtended but I like
the new API better :p 

For this part, it seems to me that we can do better than what is in
HEAD:
- Call RangeVarGetRelidExtended without lock.
- Check for has_subclass(), which should be extended so as it does not
complain because of a missing relation so as vacuum.c does the error
handling.
- Call RangeVarGetRelidExtended a second time if a lookup with
find_all_inheritors is necessary.  If the relation goes missing
in-between then we just get an error as the current behavior.

I am also questioning the fact of complicating vac_open_indexes() by
skipping a full relation vacuum if one of its indexes cannot be opened,
which would be the case for an ALTER INDEX for example.  It seems to me
that skipping locks should just happen for the parent relation, so I
would not bother much about this case, and just document the behavior.
If somebody argues for this option, we could always have a SKIP_INDEXES
or such.

"SKIP LOCKED" should be named "SKIP_LOCKED", with an underscore for
consistency with the other options.

-void
-cluster_rel(Oid tableOid, Oid indexOid, bool recheck, bool verbose)
+bool
+cluster_rel(Oid tableOid, Oid indexOid, bool recheck, bool verbose,
-bool skip_locked)
 {
Some refactoring of CLUSTER on the way here?  It would be nice to avoid
having three boolean arguments here, and opens the door for an extended
CLUSTER grammar with parenthesis as well as a SKIP_LOCKED option for
it.  And this could be refactored first.  VACUUM FULL being a variant of
CLUSTER, we could just reuse the same options...  Now using separate
option names could be cleaner.

The stuff of get_elevel_for_skipped_relation could be refactored into
something used as well by cluster_rel as the same logic is used in three
places (vacuum_rel, analyze_rel and cluster_rel with try_relation_open).

I think that it would be most interesting to get the refactoring around
get_elevel_for_skip_locked and cluster_rel done first.  The improvement
for RangeVarGetRelidExtended with relations not having subclasses could
be worth done separately as well.
--
Michael


signature.asc
Description: PGP signature