Re: A strange GiST error message or fillfactor of GiST build

2018-08-29 Thread Kyotaro HORIGUCHI
Hello.

At Wed, 29 Aug 2018 10:42:59 -0300, Andrey Borodin  wrote 
in <6fbe12b2-4f59-4db9-bde9-62c880118...@yandex-team.ru>
> >> postgres=# create table y as  select cube(array(SELECT random() as a FROM 
> >> generate_series(1,1000))) from generate_series(1,1e3,1); 
> >> SELECT 1000
> >> postgres=# create index on y using gist(cube );
> >> ERROR:  index row size 8016 exceeds maximum 8152 for index "y_cube_idx"
> > 
> > This is apparently strange. This is because the message doesn't
> > count fill factor at the time. It is fixed by passing freespace
> > to gistSplit() and that allows gistfitpage() to consider
> > fillfactor as TODO comment within.
> > 
> > After the attached patch applied, the above messages becomes as
> > follows. (And index can be built being a bit sparse by fill
> > factor.)
> 
> We are passing freespace everywhere. Also, we pass GistInsertState, and 
> GistState.
> Maybe let's put GistState into GistInsertState, GistState already has free 
> space, and pass just GistInsertState everywhere?

Yeah, I thought something like that first. GISTSTATE doesn't have
freespace size but we could refactor so that all insert-related
routines use GISTInsertState and make GISTBuildState have
it. (patch 1) But this prevents future back-patching so I don't
think this acceptable.

The second patch corresponds to the original patch, which is not
srinked so much.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 5fe611fe9edea2294c53ec9556237e7bf686cb7f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 30 Aug 2018 14:25:18 +0900
Subject: [PATCH 1/2] Refactor parameter of GiST insertion routines

Many of GiST routeins related to insertion requires Relation,
GISTSTATE and freespace, which is members of GISTInsertState.  This
patch refactors such routines to take one GISTInsertState instead of
taking individual values individually.
---
 src/backend/access/gist/gist.c | 133 ++---
 src/backend/access/gist/gistbuild.c|  66 +++---
 src/backend/access/gist/gistbuildbuffers.c |  25 +++---
 src/backend/access/gist/gistsplit.c|  67 ---
 src/backend/access/gist/gistutil.c |  32 ---
 src/include/access/gist_private.h  |  43 --
 6 files changed, 177 insertions(+), 189 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42effdf7..33b9532bff 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -28,16 +28,15 @@
 
 
 /* non-export function prototypes */
-static void gistfixsplit(GISTInsertState *state, GISTSTATE *giststate);
+static void gistfixsplit(GISTInsertState *state);
 static bool gistinserttuple(GISTInsertState *state, GISTInsertStack *stack,
-GISTSTATE *giststate, IndexTuple tuple, OffsetNumber oldoffnum);
+IndexTuple tuple, OffsetNumber oldoffnum);
 static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
- GISTSTATE *giststate,
  IndexTuple *tuples, int ntup, OffsetNumber oldoffnum,
  Buffer leftchild, Buffer rightchild,
  bool unlockbuf, bool unlockleftchild);
 static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
-GISTSTATE *giststate, List *splitinfo, bool releasebuf);
+List *splitinfo, bool releasebuf);
 static void gistvacuumpage(Relation rel, Page page, Buffer buffer);
 
 
@@ -152,31 +151,36 @@ gistinsert(Relation r, Datum *values, bool *isnull,
 		   IndexUniqueCheck checkUnique,
 		   IndexInfo *indexInfo)
 {
-	GISTSTATE  *giststate = (GISTSTATE *) indexInfo->ii_AmCache;
+	GISTInsertState state;
 	IndexTuple	itup;
 	MemoryContext oldCxt;
 
+	state.giststate = (GISTSTATE *) indexInfo->ii_AmCache;
+	state.r = r;
+	state.freespace = 0;
+	state.stack = NULL;
+
 	/* Initialize GISTSTATE cache if first call in this statement */
-	if (giststate == NULL)
+	if (state.giststate == NULL)
 	{
 		oldCxt = MemoryContextSwitchTo(indexInfo->ii_Context);
-		giststate = initGISTstate(r);
-		giststate->tempCxt = createTempGistContext();
-		indexInfo->ii_AmCache = (void *) giststate;
+		state.giststate = initGISTstate(r);
+		state.giststate->tempCxt = createTempGistContext();
+		indexInfo->ii_AmCache = (void *) state.giststate;
 		MemoryContextSwitchTo(oldCxt);
 	}
 
-	oldCxt = MemoryContextSwitchTo(giststate->tempCxt);
+	oldCxt = MemoryContextSwitchTo(state.giststate->tempCxt);
 
-	itup = gistFormTuple(giststate, r,
+	itup = gistFormTuple(,
 		 values, isnull, true /* size is currently bogus */ );
 	itup->t_tid = *ht_ctid;
 
-	gistdoinsert(r, itup, 0, giststate);
+	gistdoinsert(, itup);
 
 	/* cleanup */
 	MemoryContextSwitchTo(oldCxt);
-	MemoryContextReset(giststate->tempCxt);
+	MemoryContextReset(state.giststate->tempCxt);
 
 	return false;
 }
@@ -212,7 +216,7 @@ gistinsert(Relation r, Datum *values, bool *isnull,
  * Returns 'true' if the page was split, 'false' otherwise.
  */
 bool
-gistplacetopage(Relation rel, Size freespace, GISTSTATE 

Re: Use C99 designated initializers for some structs

2018-08-29 Thread Andres Freund
Hi,

On 2018-08-29 18:51:24 -0400, Tom Lane wrote:
> I agree that assuming that they're physically zeroes is OK from a
> portability standpoint, because we'd have a whole lot of other issues
> if they weren't.  But I have a different point to make, which is that
> it's fairly standard practice for us to initialize all fields of a struct
> explicitly, even when setting them to zero/false/NULL.  I don't think we
> should walk away from that practice just because C99 offers a shiny new
> syntax for doing so.
> 
> The main practical advantage I see to writing such "redundant" explicit
> field initializations is that it aids greppability: when you're adding a
> new field Y beside field X, grepping for X is a good way of finding the
> places where you need to initialize/copy/write/read/generically-frob Y
> too.  Omitting mention of X just because you're implicitly initializing
> it puts a big hole in the reliability of that technique.

FWIW, I think this has for bigger costs than gains.  You can't rely on
it being done everywhere anyway - there's *heaps* of places were we
don't set all members - and it makes changing fieldnames etc. way more
verbose than it has to be.

Greetings,

Andres Freund



Re: Use C99 designated initializers for some structs

2018-08-29 Thread Andres Freund
Hi,

On 2018-08-29 20:35:57 -0400, Chapman Flack wrote:
> On 08/29/18 18:51, Tom Lane wrote:
> 
> > As against that, of course, explicitly zeroing fields that you know very
> > well are already zero eats some cycles.  I've occasionally wondered if
> 
> I haven't checked what a smart C99 compiler actually emits for a
> designated initializer giving a field a compile-time known constant zero.
> Is it sure to eat any more cycles than the same initializer with the field
> unmentioned?

It's unlikely that any compiler worth its salt will emit redundant zero
initializations after a struct initialization (it's dead trivial to
recognize than in any SSA like form, which I think most compilers use
these days, certainly gcc and clang).  What it can't optimize away
however is the x = makeNode(SomeType); x->foo = EquivalentToZero;
case.  Currently the compiler has no way to know that the memory is zero
initialized (except for the type member, which the compiler can see
being set).

Greetings,

Andres Freund



Fix comments of IndexInfo

2018-08-29 Thread Yugo Nagata
Hi,

Attached is a patch to fix comments of IndexInfo.  ii_KeyAttrNumbers was
renamed to ii_IndexAttrNumbers and ii_Am was added but these are
not reflected to the comment.

Regards,
-- 
Yugo Nagata 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 41fa2052a2..c830f141b1 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -120,7 +120,7 @@ typedef struct ExprState
  *
  *		NumIndexAttrs		total number of columns in this index
  *		NumIndexKeyAttrs	number of key columns in index
- *		KeyAttrNumbers		underlying-rel attribute numbers used as keys
+ *		IndexAttrNumbers	underlying-rel attribute numbers used as keys
  *			(zeroes indicate expressions). It also contains
  * 			info about included columns.
  *		Expressions			expr trees for expression entries, or NIL if none
@@ -138,6 +138,7 @@ typedef struct ExprState
  *		Concurrent			are we doing a concurrent index build?
  *		BrokenHotChain		did we detect any broken HOT chains?
  *		ParallelWorkers		# of workers requested (excludes leader)
+ *		Am	Oid of index AM
  *		AmCacheprivate cache area for index AM
  *		Contextmemory context holding this IndexInfo
  *


Re: [PATCH] Fix formatting in pg_upgrade manpage doc

2018-08-29 Thread Bruce Momjian
On Wed, Aug 29, 2018 at 08:02:54PM -0700, Martín Fernández wrote:
> *
> Stumble upon this formatting issue while reading the pg_upgrade docs.

I am confused.   What is the value of moving "without" to the next line
in the sgml file?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



[PATCH] Fix formatting in pg_upgrade manpage doc

2018-08-29 Thread Martín Fernández
Stumble upon this formatting issue while reading the pg_upgrade docs.

Martíndiff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index d51146d641..4f69425c59 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -661,7 +661,8 @@ psql --username=postgres --file=script.sql postgres
 
   

-If you ran pg_upgrade without
+If you ran pg_upgrade
+without
 --link or did not start the new server, the
 old cluster was not modified except that, if linking
 started, a .old suffix was appended to


Re: Postgres, fsync, and OSs (specifically linux)

2018-08-29 Thread Craig Ringer
On 15 August 2018 at 07:32, Thomas Munro 
wrote:

> On Wed, Aug 15, 2018 at 11:08 AM, Asim R P  wrote:
> > I was looking at the commitfest entry for feature
> > (https://commitfest.postgresql.org/19/1639/) for the most recent list
> > of patches to try out.  The list doesn't look correct/complete.  Can
> > someone please check?
>
> Hi Asim,
>
> This thread is a bit tangled up.  There are two related patchsets in it:
>
> 1.  Craig Ringer's PANIC-on-EIO patch set, to cope with the fact that
> Linux throws away buffers and errors after reporting an error, so the
> checkpointer shouldn't retry as it does today.  The latest is here:
>
> https://www.postgresql.org/message-id/CAMsr%2BYFPeKVaQ57PwHqmRNjPCPABsdbV%
> 3DL85he2dVBcr6yS1mA%40mail.gmail.com
>
> 2.  Andres Freund's fd-sending fsync queue, to cope with the fact that
> some versions of Linux only report writeback errors that occurred
> after you opened the file, and all versions of Linux and some other
> operating systems might forget about writeback errors while no one has
> it open.
>
> Here is the original patchset:
>
> https://www.postgresql.org/message-id/20180522010823.
> z5bdq7wnlsna5qoo%40alap3.anarazel.de
>
> Here is a fix-up you need:
>
> https://www.postgresql.org/message-id/20180522185951.
> 5sdudzl46spktyyz%40alap3.anarazel.de
>
> Here are some more fix-up patches that I propose:
>
> https://www.postgresql.org/message-id/CAEepm%3D2WSPP03-20XHpxohSd2UyG_
> dvw5zWS1v7Eas8Rd%3D5e4A%40mail.gmail.com
>
> I will soon post some more fix-up patches that add EXEC_BACKEND
> support, Windows support, and a counting scheme to fix the timing
> issue that I mentioned in my first review.  I will probably squash it
> all down to a tidy patch-set after that.
>


Thanks very much Tomas.

I've had to back off from this a bit after posting my initial
panic-for-safety patch, as the changes Andres proposed are a bit out of my
current depth and time capacity.

I still think the panic patch is needed and appropriate, but agree it's not
*sufficient*.


Re: Fix help option of contrib/oid2name

2018-08-29 Thread Tatsuro Yamada

On 2018/08/28 22:36, Michael Paquier wrote:

On Mon, Aug 27, 2018 at 07:03:18PM +0900, Michael Paquier wrote:

Thanks, I have looked at the patch set.


I have been through the set once again, and pushed both things.  Thanks
a lot Yamada-san.


Thank you very much for your time to review and revise the patch! :)
In this year, I will try to review someone's patch as you did.
Please let me know if you need reviewer.

Regards,
Tatsuro Yamada




RE: speeding up planning with partitions

2018-08-29 Thread Tsunakawa, Takayuki
From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
> We can definitely try, but I'm not immediately sure if the further
> improvements will come from continuing to fix the planner.  Maybe, the
> overhead of partitioning could be attributed to other parts of the system.

> Actually, I wrote that for patch 0002.  The next patch (0003) is meant to
> fix that.  So, the overhead you're seeing is even after making sure that
> only the selected partitions are locked.

Thanks for telling your thought.  I understood we should find the bottleneck 
with profiling first.


Regards
Takayuki Tsunakawa





Re: speeding up planning with partitions

2018-08-29 Thread Amit Langote
On 2018/08/30 10:09, Tsunakawa, Takayuki wrote:
> From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
>> I measured the gain in performance due to each patch on a modest virtual
>> machine.  Details of the measurement and results follow.
> 
> Amazing!

Thanks.

> UPDATE
>> nparts  master0001   0002   0003
>> ==  ==      
>> 0 28562893   2862   2816
>> 8  5071115   1447   1872
> 
> SELECT
>> nparts  master0001   0002   0003
>> ==  ==      
>> 0 22902329   2319   2268
>> 8 10581077   1414   1788
> 
> Even a small number of partitions still introduces a not-small overhead 
> (UPDATE:34%, SELECT:22%).

Yeah, that's true.

> Do you think this overhead can be reduced further?

We can definitely try, but I'm not immediately sure if the further
improvements will come from continuing to fix the planner.  Maybe, the
overhead of partitioning could be attributed to other parts of the system.

> What part do you guess would be relevant?  This one?
> 
>> that it is now performed after pruning. However, it doesn't do anything
>> about the fact that partitions are all still locked in the earlier phase.

Actually, I wrote that for patch 0002.  The next patch (0003) is meant to
fix that.  So, the overhead you're seeing is even after making sure that
only the selected partitions are locked.

Thanks,
Amit




RE: speeding up planning with partitions

2018-08-29 Thread Tsunakawa, Takayuki
From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
> I measured the gain in performance due to each patch on a modest virtual
> machine.  Details of the measurement and results follow.

Amazing!

UPDATE
> nparts  master0001   0002   0003
> ==  ==      
> 0 28562893   2862   2816
> 8  5071115   1447   1872

SELECT
> nparts  master0001   0002   0003
> ==  ==      
> 0 22902329   2319   2268
> 8 10581077   1414   1788

Even a small number of partitions still introduces a not-small overhead 
(UPDATE:34%, SELECT:22%).  Do you think this overhead can be reduced further?  
What part do you guess would be relevant?  This one?


> that it is now performed after pruning. However, it doesn't do anything
> about the fact that partitions are all still locked in the earlier phase.


Regards
Takayuki Tsunakawa



Re: speeding up planning with partitions

2018-08-29 Thread Amit Langote
On 2018/08/30 7:27, David Rowley wrote:
> On 30 August 2018 at 00:06, Amit Langote  
> wrote:
>> nparts  master0001   0002   0003
>> ==  ==      
>> 0 28562893   2862   2816
>> 8  5071115   1447   1872
>> 16 260 765   1173   1892
>> 32 119 483922   1884
>> 64  59 282615   1881
>> 128 29 153378   1835
>> 256 14  79210   1803
>> 512  5  40113   1728
>> 1024 2  17 57   1616
>> 2048 0*  9 30   1471
>> 4096 0+  4 15   1236
>> 8192 0=  2  7975
> 
> Those look promising. Are you going to submit these patches to the
> September 'fest?

Thanks, I just did.

https://commitfest.postgresql.org/19/1778/

Regards,
Amit




Re: Use C99 designated initializers for some structs

2018-08-29 Thread Chapman Flack
On 08/29/18 18:51, Tom Lane wrote:

> As against that, of course, explicitly zeroing fields that you know very
> well are already zero eats some cycles.  I've occasionally wondered if

I haven't checked what a smart C99 compiler actually emits for a
designated initializer giving a field a compile-time known constant zero.
Is it sure to eat any more cycles than the same initializer with the field
unmentioned?

-Chap



Re: Postmaster doesn't send SIGTERM to bgworker during fast shutdown when pmState == PM_STARTUP

2018-08-29 Thread Michael Paquier
On Wed, Aug 29, 2018 at 09:09:08AM +0200, Alexander Kukushkin wrote:
> Yeah, good catch, it starts checkpointer, bgwriter and in some cases
> even archiver processes (when archive_mode=always) while pmState is
> still equaled PM_START.
> Please find attached the new version of the fix.

Thanks, pushed and back-patched down to 9.5 which is where the bug has
been introduced as before that SignalUnconnectedWorkers() was doing all
the work.
--
Michael


signature.asc
Description: PGP signature


Re: Use C99 designated initializers for some structs

2018-08-29 Thread Tom Lane
Andres Freund  writes:
> On 2018-08-29 15:51:07 -0500, David Steele wrote:
>> One thing: I'm not sure that excluding the InvalidOid assignment in the
>> TopTransactionStateData initializer is a good idea.  That is, it's not clear
>> that InvalidOid is 0.
>> NULL, false, and 0 seem like no-brainers, but maybe it would be better to
>> explicitly include constants that we define that are not obviously 0, or
>> maybe just all of them.

> We rely on that in many other places imo. So I don't think it's worth
> starting to care about it here.

I agree that assuming that they're physically zeroes is OK from a
portability standpoint, because we'd have a whole lot of other issues
if they weren't.  But I have a different point to make, which is that
it's fairly standard practice for us to initialize all fields of a struct
explicitly, even when setting them to zero/false/NULL.  I don't think we
should walk away from that practice just because C99 offers a shiny new
syntax for doing so.

The main practical advantage I see to writing such "redundant" explicit
field initializations is that it aids greppability: when you're adding a
new field Y beside field X, grepping for X is a good way of finding the
places where you need to initialize/copy/write/read/generically-frob Y
too.  Omitting mention of X just because you're implicitly initializing
it puts a big hole in the reliability of that technique.

As against that, of course, explicitly zeroing fields that you know very
well are already zero eats some cycles.  I've occasionally wondered if
it's worth doing things like

mynode = makeNode(MyNodeType);
mynode->w = 42;
/* mynode->x = 0; */
/* mynode->y = NULL; */
mynode->z = ptr;

but that seems mighty ugly.

An argument I *don't* buy is that leaving out redundant field assignments
is a good savings of programmer time.  It's not a useful savings to the
original developer, and it's a net negative for anyone who has to review
or modify such code later.  I think it was Brooks who said something to
the effect that any programmer would happily type out every line of code
thrice over, if only that would guarantee no bugs.  It doesn't, of course,
but there are virtues in verbosity nonetheless.

regards, tom lane



Re: speeding up planning with partitions

2018-08-29 Thread David Rowley
On 30 August 2018 at 00:06, Amit Langote  wrote:
> nparts  master0001   0002   0003
> ==  ==      
> 0 28562893   2862   2816
> 8  5071115   1447   1872
> 16 260 765   1173   1892
> 32 119 483922   1884
> 64  59 282615   1881
> 128 29 153378   1835
> 256 14  79210   1803
> 512  5  40113   1728
> 1024 2  17 57   1616
> 2048 0*  9 30   1471
> 4096 0+  4 15   1236
> 8192 0=  2  7975

Those look promising. Are you going to submit these patches to the
September 'fest?


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: patch to allow disable of WAL recycling

2018-08-29 Thread Jerry Jelinek
Tomas,

This is really interesting data, thanks a lot for collecting all of it and
formatting the helpful graphs.

Jerry


On Sun, Aug 26, 2018 at 4:14 PM, Tomas Vondra 
wrote:

>
>
> On 08/25/2018 12:11 AM, Jerry Jelinek wrote:
> > Alvaro,
> >
> > I have previously posted ZFS numbers for SmartOS and FreeBSD to this
> > thread, although not with the exact same benchmark runs that Tomas did.
> >
> > I think the main purpose of running the benchmarks is to demonstrate
> > that there is no significant performance regression with wal recycling
> > disabled on a COW filesystem such as ZFS (which might just be intuitive
> > for a COW filesystem). I've tried to be sure it is clear in the doc
> > change with this patch that this tunable is only applicable to COW
> > filesystems. I do not think the benchmarks will be able to recreate the
> > problematic performance state that was originally described in Dave's
> > email thread here:
> >
> > https://www.postgresql.org/message-id/flat/
> CACukRjO7DJvub8e2AijOayj8BfKK3XXBTwu3KKARiTr67M3E3w%40mail.gmail.com#
> cacukrjo7djvub8e2aijoayj8bfkk3xxbtwu3kkaritr67m3...@mail.gmail.com
> >
>
> I agree - the benchmarks are valuable both to show improvement and lack
> of regression. I do have some numbers from LVM/ext4 (with snapshot
> recreated every minute, to trigger COW-like behavior, and without the
> snapshots), and from ZFS (on Linux, using zfsonlinux 0.7.9 on kernel
> 4.17.17).
>
> Attached are PDFs with summary charts, more detailed results are
> available at
>
>   https://bitbucket.org/tvondra/wal-recycle-test-xeon/src/master/
>
>
>
> lvm/ext4 (no snapshots)
> ---
> This pretty much behaves like plain ex4, at least for scales 200 and
> 2000. I don't have results for scale 8000, because the test ran out of
> disk space (I've used part of the device for snapshots, and it was
> enough to trigger the disk space issue).
>
>
> lvm/ext4 (snapshots)
> -
> On the smallest scale (200), there's no visible difference. On scale
> 2000 disabling WAL reuse gives about 10% improvement (21468 vs. 23517
> tps), although it's not obvious from the chart. On the largest scale
> (6000, to prevent the disk space issues) the improvement is about 10%
> again, but it's much clearer.
>
>
> zfs (Linux)
> ---
> On scale 200, there's pretty much no difference. On scale 2000, the
> throughput actually decreased a bit, by about 5% - from the chart it
> seems disabling the WAL reuse somewhat amplifies impact of checkpoints,
> for some reason.
>
> I have no idea what happened at the largest scale (8000) - on master
> there's a huge drop after ~120 minutes, which somewhat recovers at ~220
> minutes (but not fully). Without WAL reuse there's no such drop,
> although there seems to be some degradation after ~220 minutes (i.e. at
> about the same time the master partially recovers. I'm not sure what to
> think about this, I wonder if it might be caused by almost filling the
> disk space, or something like that. I'm rerunning this with scale 600.
>
> I'm also not sure how much can we extrapolate this to other ZFS configs
> (I mean, this is a ZFS on a single SSD device, while I'd generally
> expect ZFS on multiple devices, etc.).
>
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)

2018-08-29 Thread Peter Geoghegan
On Tue, Aug 28, 2018 at 11:32 PM, Peter Eisentraut
 wrote:
> Do you plan to submit this patch to the upcoming commit fest perhaps?  I
> have done some testing on it and it seems worth pursuing further.

I should make sure that this makes it into the September 'fest. Thanks
for reminding me. I've been distracted by other responsibilities, but
I think that this project is well worthwhile. I intend to spend lots
of time on it, as I think that it has real strategic importance. I
would be most grateful if you signed up to review the patch. I've been
testing it with quite a variety of real-world data, which seems like
the way to go until the code really matures.

As I indicated to Simon on August 2nd, it seems like I should further
refine my current working draft of the next version to have less
magic. I have a cumbersome test suite that proves that I have
something that improves fan-out on TPC-C and TPC-H indexes by quite a
bit (e.g. the main TPC-C order_line pkey is about 7% smaller after
initial preload, despite not even having smaller pivot tuples due to
alignment). It also proves that the benefits of not "getting tired" in
the event of many duplicates are preserved. We need to have both at
once.

The code to pick a split point during a page split is rather tricky.
That's what's holding the next version up. I can post what I have
right now on the other thread, to give you a better idea of the
direction of the patch. You'll have to hold your nose when you look at
the code that picks a split point, though. Let me know if you think
that that makes sense. I wouldn't want you to spend too much time on
old-ish code.

-- 
Peter Geoghegan



Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-29 Thread Tom Lane
I wrote:
> We could perhaps fix this with a less invasive change than what you
> suggest here, by attacking the missed-call-due-to-recursion aspect
> rather than monkeying with how relcache rebuild itself works.

Seeing that rearranging the relcache rebuild logic is looking less than
trivial, I thought it'd be a good idea to investigate this alternative
approach.

Essentially, the problem here is that lmgr.c supposes that
AcceptInvalidationMessages is an atomic operation, which it most
certainly isn't.  In reality, it's unsafe to suppose that we can skip
reading incoming inval messages until we have *completed*
AcceptInvalidationMessages, not just invoked it.

However, we can fix that very easily, by retaining one additional bit
of state in each LOCALLOCK entry, which records whether we know we have
completed AcceptInvalidationMessages at least once subsequent to obtaining
that lock.  In the attached draft patch, I refer to that state as having
"cleared" the lock, but I'm not super pleased with that terminology ---
anybody have a better idea?

A small problem with the lock.c API as it stands is that we'd have to do
an additional hashtable lookup to re-find the relevant LOCALLOCK entry.
In the attached, I fixed that by extending LockAcquireExtended's API
to allow the caller to obtain a pointer to the LOCALLOCK entry, making
it trivially cheap to set the flag after we finish
AcceptInvalidationMessages.  LockAcquireExtended has only one external
caller right now, which makes me think it's OK to change its API rather
than introduce YA variant entry point, but that could be argued.

There are two ways we could adjust the return value from
LockAcquire(Extended) to cope with this requirement.  What I did here
was to add an additional LockAcquireResult code, so that "already held"
can be distinguished from "already held and cleared".  But we don't
really need to make that distinction, at least not in the core code.
So another way would be to redefine LOCKACQUIRE_ALREADY_HELD as meaning
"held and cleared", and then just return LOCKACQUIRE_OK if you haven't
called MarkLockClear for the lock.  I'm not entirely sure which way is
less likely to break any third-party code that might be inspecting the
result of LockAcquire.  You could probably argue it either way depending
on what you think the third-party code is doing with the result.

Anyway, the attached appears to fix the problem: Andres' test script
fails in fractions of a second with HEAD on my workstation, but it's
run error-free for quite awhile now with this patch.  And this is sure
a lot simpler than any relcache rebuild refactoring we're likely to come
up with.

Discuss.

regards, tom lane

PS: this also fixes what seems to me to be a buglet in the fast-path
locks stuff: there are failure paths out of LockAcquire that don't
remove the failed LOCALLOCK entry.  Not any more.

diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 2e07702..e924c96 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -661,7 +661,8 @@ StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid)
 
 	SET_LOCKTAG_RELATION(locktag, newlock->dbOid, newlock->relOid);
 
-	LockAcquireExtended(, AccessExclusiveLock, true, false, false);
+	LockAcquireExtended(, AccessExclusiveLock, true, false,
+		false, NULL);
 }
 
 static void
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 7b2dcb6..dc0a439 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -105,11 +105,12 @@ void
 LockRelationOid(Oid relid, LOCKMODE lockmode)
 {
 	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
 	LockAcquireResult res;
 
 	SetLocktagRelationOid(, relid);
 
-	res = LockAcquire(, lockmode, false, false);
+	res = LockAcquireExtended(, lockmode, false, false, true, );
 
 	/*
 	 * Now that we have the lock, check for invalidation messages, so that we
@@ -120,9 +121,18 @@ LockRelationOid(Oid relid, LOCKMODE lockmode)
 	 * relcache entry in an undesirable way.  (In the case where our own xact
 	 * modifies the rel, the relcache update happens via
 	 * CommandCounterIncrement, not here.)
+	 *
+	 * However, in corner cases where code acts on tables (usually catalogs)
+	 * recursively, we might get here while still processing invalidation
+	 * messages in some outer execution of this function or a sibling.  The
+	 * "cleared" status of the lock tells us whether we really are done
+	 * absorbing relevant inval messages.
 	 */
-	if (res != LOCKACQUIRE_ALREADY_HELD)
+	if (res != LOCKACQUIRE_ALREADY_CLEAR)
+	{
 		AcceptInvalidationMessages();
+		MarkLockClear(locallock);
+	}
 }
 
 /*
@@ -138,11 +148,12 @@ bool
 ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 {
 	LOCKTAG		tag;
+	LOCALLOCK  *locallock;
 	LockAcquireResult res;
 
 	SetLocktagRelationOid(, relid);
 
-	res = LockAcquire(, lockmode, false, true);
+	res = LockAcquireExtended(, lockmode, 

Re: Use C99 designated initializers for some structs

2018-08-29 Thread Thomas Munro
On Thu, Aug 30, 2018 at 8:51 AM David Steele  wrote:
> On 8/29/18 5:14 AM, Peter Eisentraut wrote:
> > On 29/08/2018 12:13, Peter Eisentraut wrote:
> >> Here is a patch to change some struct initializations to use C99-style
> >> designated initializers.  These are just a few particularly egregious
> >> cases that were hard to read and write, and error prone because of many
> >> similar adjacent types.
> >>
> >> (The PL/Python changes currently don't compile with Python 3 because of
> >> the situation described in the parallel thread "PL/Python: Remove use of
> >> simple slicing API".)
> >>
> >> Thoughts?
>
> +1.  This is an incredible win for readability/maintainability.

+1.  Much nicer.

I know several people have the goal of being able to compile
PostgreSQL as C and C++, and this syntax is not in the C++ standard.
Happily, popular compilers seem OK with, and it's already been voted
into the draft for C++20.  So no problem on that front.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Use C99 designated initializers for some structs

2018-08-29 Thread Andres Freund
Hi,

On 2018-08-29 15:51:07 -0500, David Steele wrote:
> One thing: I'm not sure that excluding the InvalidOid assignment in the
> TopTransactionStateData initializer is a good idea.  That is, it's not clear
> that InvalidOid is 0.
> 
> NULL, false, and 0 seem like no-brainers, but maybe it would be better to
> explicitly include constants that we define that are not obviously 0, or
> maybe just all of them.

We rely on that in many other places imo. So I don't think it's worth
starting to care about it here.

Greetings,

Andres Freund



Re: Use C99 designated initializers for some structs

2018-08-29 Thread David Steele

On 8/29/18 5:14 AM, Peter Eisentraut wrote:

On 29/08/2018 12:13, Peter Eisentraut wrote:

Here is a patch to change some struct initializations to use C99-style
designated initializers.  These are just a few particularly egregious
cases that were hard to read and write, and error prone because of many
similar adjacent types.

(The PL/Python changes currently don't compile with Python 3 because of
the situation described in the parallel thread "PL/Python: Remove use of
simple slicing API".)

Thoughts?


+1.  This is an incredible win for readability/maintainability.

One thing: I'm not sure that excluding the InvalidOid assignment in the 
TopTransactionStateData initializer is a good idea.  That is, it's not 
clear that InvalidOid is 0.


NULL, false, and 0 seem like no-brainers, but maybe it would be better 
to explicitly include constants that we define that are not obviously 0, 
or maybe just all of them.


Regards,
--
-David
da...@pgmasters.net



Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-29 Thread Tom Lane
Andres Freund  writes:
> On 2018-08-29 14:00:12 -0400, Tom Lane wrote:
>> 2. I think we may need to address the same order-of-operations hazards
>> as RelationCacheInvalidate() worries about.  Alternatively, maybe we
>> could simplify that function by making it use the same
>> delayed-revalidation logic as we're going to develop for this.

> Hm, just to make sure I understand correctly: You're thinking about
> having to rebuild various nailed entries in a specific order [2]?

> [2] Hm, I'm right now a bit confused as to why "Other nailed relations
> go to the front of rebuildList" is a good idea.

I'm not recalling the reason right now either, but I'm pretty sure that
logic was added out of necessity ... it hasn't always been like that.

regards, tom lane



Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-29 Thread Tom Lane
Andres Freund  writes:
> On 2018-08-29 12:56:07 -0400, Tom Lane wrote:
>> BTW, I now have a theory for why we suddenly started seeing this problem
>> in mid-June: commits a54e1f158 et al added a ScanPgRelation call where
>> there had been none before (in RelationReloadNailed, for non-index rels).
>> That didn't create the problem, but it probably increased the odds of
>> seeing it happen.

> Yea.  Doesn't explain why it's only really visible on the BF in
> 11/master though :/

Probably we changed around the tests that run in parallel with vacuum.sql
enough during v11 to make it happen with noticeable probability.  I think
this is sufficient evidence that it's happening now in the back branches:

https://www.postgresql.org/message-id/20180829140149.go23...@telsasoft.com

regards, tom lane



Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-29 Thread Andres Freund
Hi,

On 2018-08-29 14:00:12 -0400, Tom Lane wrote:
> A couple thoughts after reading and reflecting for awhile:

Thanks. This definitely is too complicated for a single brain :(


> 1. I don't much like the pending_rebuilds list, mainly because of this
> consideration: what happens if we hit an OOM error trying to add an entry
> to that list?  As you've drafted the patch, we don't even mark the
> relevant relcache entry rd_invalid before that fails, so that's surely
> bad.  Now, I'm not sure how bulletproof relcache inval is in general
> with respect to OOM failures, but let's not add more hazards.

Obviously you're right and we first need to mark the entry as invalid -
and I think that should also mostly [1] address the OOM issue. If we
error out there's no need to eagerly rebuild the entry during
invalidation - subsequent RelationIdGetRelation() calls will trigger the
necessary rebuild.

But it sounds like your concern might be larger?  I don't quite see how
we could get around without such a list?  I mean I think it'd better if
it were an array, for efficiency, but that seems like a fairly minor
point?

[1] I'm kinda doubful that's entirely bulletproof in case an OOM, or
similar, error is caught with a savepoint, and then execution
continues outside. That in turn might access the relcache entry
without a RelationIdGetRelation call. But that doesn't seem new.


> 2. I think we may need to address the same order-of-operations hazards
> as RelationCacheInvalidate() worries about.  Alternatively, maybe we
> could simplify that function by making it use the same
> delayed-revalidation logic as we're going to develop for this.

Hm, just to make sure I understand correctly: You're thinking about
having to rebuild various nailed entries in a specific order [2]?  I'm
not quite sure I really see a problem here - in contrast to the current
RelationCacheInvalidate() logic, what I'm proposing marks the entries as
invalid in the first phase, so any later access guarantees will rebuild
the entry as necessary.

[2] Hm, I'm right now a bit confused as to why "Other nailed relations
go to the front of rebuildList" is a good idea. That means they might be
built while ClassOidIndexId isn't yet rebuilt?  Practically I don't
think it matters, because the lookup for RelationRelationId will rebuild
it, but I don't quite understand what that logic is precisely getting
at.


> 3. I don't at all like the ProcessPendingRelcacheRebuilds call you added
> to ProcessInvalidationMessages.  That's effectively assuming that the
> "func" *must* be LocalExecuteInvalidationMessage and not anything else;
> likewise, the lack of such a call inside ProcessInvalidationMessagesMulti
> presumes that that one is never called to actually execute invalidations.
> (While those are true statements, it's a horrible violation of modularity
> for these two functions to know it.)  Probably better to put this into the
> callers, which will know what the actual semantics are.

Yea, I mostly hacked this together quickly to have a proposal on the
table, I'm not happy with this as it stands.


> 4. The call added to the middle of ReceiveSharedInvalidMessages doesn't
> seem all that safe either; the problem is its relationship to the
> "catchup" processing.  We are caught up at the moment we exit the loop,
> but are we sure we still would be after doing assorted work for relcache
> rebuild?  Swapping the order of the two steps might help, but then we
> have to consider what happens if we error out from SICleanupQueue.

I'm not sure I understand the problem here. If there's new invalidation
processing inside the rebuilds, that'd mean there was another
ReceiveSharedInvalidMessages, which then would have triggered the "new"
ReceiveSharedInvalidMessages, to then process the new pending why I made
ProcessPendingRelcacheRebuilds() unlink the queue it processes first.  I
think locking should prevent issues with entries that are currently
accessed during rebuild from being re-invalidated.

Did I misunderstand?

As the comment in ProcessPendingRelcacheRebuilds() notes, I'm concerned
what happens if we error out after unlinking the current pending
items. But I think, if we handle the memory leak with a PG_CATCH, that
should be ok, because after the error new accesses should trigger the
rebuilds.


> (In general, the hard part of all this stuff is being sure that sane
> things happen if you error out part way through ...)

Yea :/

Greetings,

Andres Freund



Re: Removing useless \. at the end of copy in pgbench

2018-08-29 Thread Andres Freund
On 2018-08-29 21:42:42 +0200, Fabien COELHO wrote:
> 
> Hello,
> 
> > What about:
> > 
> > """
> > Pgbench requires a PostgreSQL version 8.2 or above server.
> > """
> > 
> > Some information is provided...
> > 
> > I understood that Tom found that an explicit compatibility note would be
> > welcome, so I provided one. I'm also fine with saying nothing.
> 
> Here is a patch with the following accurate information:
> 
> """
> In order to run, pgbench requires a PostgreSQL server version 8.2 or above.
> """
> 
> 8.2 has been tested by Tom, and is required because of DROP TABLE IF EXISTS
> & CREATE TABLE ... FILLFACTOR, which I pointed out in a mail upthread.
> 
> Now if the information is not added to the doc, this is also fine with me.

I'd vote for not adding it.  It seems almost guaranteed to get out of
date without anybody noticing so.  Maybe that's overly pragmatic, but I
really can't see the harm of not documenting which precise ancient
version pgbench doesn't support anymore...

Greetings,

Andres Freund



Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-29 Thread Andres Freund
Hi,

On 2018-08-29 12:56:07 -0400, Tom Lane wrote:
> I wrote:
> > * We now recursively enter ScanPgRelation, which (again) needs to do a
> > search using pg_class_oid_index, so it (again) opens and locks that.
> > BUT: LockRelationOid sees that *this process already has share lock on
> > pg_class_oid_index*, so it figures it can skip AcceptInvalidationMessages.
> 
> BTW, I now have a theory for why we suddenly started seeing this problem
> in mid-June: commits a54e1f158 et al added a ScanPgRelation call where
> there had been none before (in RelationReloadNailed, for non-index rels).
> That didn't create the problem, but it probably increased the odds of
> seeing it happen.

Yea.  Doesn't explain why it's only really visible on the BF in
11/master though :/


> Also ... isn't the last "relation->rd_isvalid = true" in
> RelationReloadNailed wrong?  If it got cleared during ScanPgRelation,
> I do not think we want to believe that we got an up-to-date row.

I don't really think so - note how a normal relcache inval essentially
does the same. RelationClearRelation() first marks the entry as invalid,
then goes and builds a new entry that's *not* hooked into the hashtable
(therefore doesn't receive new invals), and then moves the contents
over. That overwrites rd_isvalid to true, as that's guaranteed to be set
by by RelationBuildDesc(). During the move no new invalidations are
accepted.   So this really is just behaving equivalently.

The harder question is why that's safe. I think I convinced myself that
it is a couple times over the years, but I don't think we've properly
documented it. As the header says:
 *  The following code contains many undocumented hacks.  Please be
 *  careful

We definitely relied on RelationClearRelation() always returning a valid
record for a while, c.f. RelationIdGetRelation()'s rd_isvalid assertion,
and the lack of a loop in that function.

(There's no coffee in this hotel at 4am. Shame.)

Ah, yes.  This assumption is currently safe because the locking on
relations being looked up, better guarantees that there's no critical
changes to relcache entries while the entry is being rebuilt.

I think we'd also run into trouble with clobber cache recursively etc
without it.

Greetings,

Andres Freund



Re: rare crash - FailedAssertion snapbuild.c Line: 580

2018-08-29 Thread Andres Freund
Hi,

On 2018-08-29 17:43:17 +0200, Erik Rijkers wrote:
> To test postgres 11, I still regularly run series of short sessions of
> pgbench-over-logical-replication (basically the same thing that I used last
> year [1] - now in a perl incarnation).  Most of the time the replication is
> stable and finishes correctly but sometimes (rarely) I get:
> 
> TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid,
> snap->xmin))", File: "snapbuild.c", Line: 580)
> 
> This will probably be difficult to reproduce and to act upon but I wanted to
> report it anyway as in the course of the last few months I have seen it
> several times, on several machines. Always rarely, always postgres 11 (I did
> not try other versions).

Thanks for testing! Could you possibly run the tests with core files
enabled, so we at get a backtrace in case of trouble?  Knowing what the
values here are would be tremendously helpful...

Greetings,

Andres Freund



Re: FailedAssertion on partprune

2018-08-29 Thread Robert Haas
On Fri, Aug 17, 2018 at 2:58 AM, David Rowley
 wrote:
>> I'm baffled as to why looking through Gather to find
>> Append/MergeAppend subpaths would ever be a sane thing to do.
>
> Can you explain why it's less sane than what the current code is
> doing?  Below a Gather there will be partial paths, but we can also
> get those in a Parallel Append, which the accumulate_append_subpath()
> code already attempts to handle.

Sorry, I lost track of this email.  accumulate_append_subpath's job is
to flatten stacked Append nodes into a single Append node.  But you
can't in general flatten two Append nodes that are separated by
something else in the middle, because that other node probably does
something.  To take a really simple example, consider:

Append
-> Result
  -> Append
-> ...
-> ...

Well, the Result node computes a new target list, so the inner Append
node can't just blindly be combined with the outer Append node.  You
might be able to make it work if you adjusted the inner Append node's
children to produce the same target list that the Result node
produces, but you can't just flatten it blindly.

The same thing is true of Gather:

Append
-> Gather
  -> Append
-> ...
-> ...

The child paths of the inner Append node are known to be parallel-safe
and are necessarily partial paths unless the inner Append is a
Parallel Append; the paths under the outer Append are definitely not
partial and might not even be parallel-safe (they can't be
parallel-unsafe, or we couldn't have a Gather anywhere in the plan,
but they *could* be parallel-restricted).  If you tried to flatten the
inner Append into the outer one, you'd no longer have anything like a
valid plan, because you can't have a partial path anywhere in the tree
without a Gather someplace higher up, which wouldn't be true here any
more, which I guess is what you meant by ...

> If the Gather Path is there already
> then I guess one difference would be that the caller would need to
> ensure that another Gather path is placed below the Parallel Append
> again.

...this, but that's not right.  The Gather would have to go ABOVE the
Parallel Append.

The broader point here, though, is that even if you could do this kind
of surgery, it's the wrong way of going about the problem.  In any
case where Append-Gather-Append could be flattened to Gather-Append,
we should have just generated Gather-Append initially, rather than
trying to rejigger things that way later.  And I think that's what the
code does.  I'm not sure exactly what's going wrong with runtime
partition pruning in this case, but ISTM that if runtime partition
pruning doesn't work in some obscure case, that's just a limitation we
should accept until somebody gets around to working on it.  The key
thing right now is not to have crashes or wrong behavior, so that we
can ship.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] proposal: schema variables

2018-08-29 Thread Fabien COELHO



Hello Pavel L.

I do not understand your point, as usual. I raise a factual issue about 
security, and you do not answer how this can be solved with your proposal, 
but appeal to argument of authority and declare your "strong opinion".


I do not see any intrinsic opposition between having session objects and 
transactions. Nothing prevents a session object to be transactional beyond 
your willingness that it should not be.


Now, I do expect all PostgreSQL features to be security-wise, whatever 
their scope.


I do not think that security should be traded for "cheap & fast", esp as 
the sole use case for a feature is a security pattern that cannot be 
implemented securely with it. This appears to me as a huge contradiction, 
hence my opposition against this feature as proposed.


I can't to agree with your position.

Consider this example. I want to record some inappropriate user actions 
to audit table and rollback transaction. But aborting transaction will 
also abort record to audit table. So, do not use tables, becouse they 
have security implications.


Indeed, you cannot record a transaction failure from a transaction.


This is very similar to your approach.


I understand that your point is that some use case could require a non 
transactional session variable. I'm not sure of how the use case would go 
on though, because once the "attacker" disconnects, the session variable 
disappears, so it does not record that there was a problem.


Anyway, I'm not against having session variables per se. I'm argumenting 
that there is a good case to have them transactional by default, and 
possibly an option to have them non transactional if this is really needed 
by some use case to provide.


The only use case put forward by Pavel S. is the security audit one 
where a session variable stores that audit checks have been performed, 
which AFAICS cannot be implemented securely with the proposed non 
transactional session variables.


--
Fabien.



Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-29 Thread Tom Lane
Andres Freund  writes:
> A bit of food, a coke and a talk later, here's a first draft *prototype*
> of how this could be solved. ...
> Obviously this is far from clean enough, but what do you think about the
> basic approach?  It does, in my limited testing, indeed solve the "could
> not read block" issue.

A couple thoughts after reading and reflecting for awhile:

1. I don't much like the pending_rebuilds list, mainly because of this
consideration: what happens if we hit an OOM error trying to add an entry
to that list?  As you've drafted the patch, we don't even mark the
relevant relcache entry rd_invalid before that fails, so that's surely
bad.  Now, I'm not sure how bulletproof relcache inval is in general
with respect to OOM failures, but let's not add more hazards.

2. I think we may need to address the same order-of-operations hazards
as RelationCacheInvalidate() worries about.  Alternatively, maybe we
could simplify that function by making it use the same
delayed-revalidation logic as we're going to develop for this.

3. I don't at all like the ProcessPendingRelcacheRebuilds call you added
to ProcessInvalidationMessages.  That's effectively assuming that the
"func" *must* be LocalExecuteInvalidationMessage and not anything else;
likewise, the lack of such a call inside ProcessInvalidationMessagesMulti
presumes that that one is never called to actually execute invalidations.
(While those are true statements, it's a horrible violation of modularity
for these two functions to know it.)  Probably better to put this into the
callers, which will know what the actual semantics are.

4. The call added to the middle of ReceiveSharedInvalidMessages doesn't
seem all that safe either; the problem is its relationship to the
"catchup" processing.  We are caught up at the moment we exit the loop,
but are we sure we still would be after doing assorted work for relcache
rebuild?  Swapping the order of the two steps might help, but then we
have to consider what happens if we error out from SICleanupQueue.

(In general, the hard part of all this stuff is being sure that sane
things happen if you error out part way through ...)

regards, tom lane



Re: Removing useless \. at the end of copy in pgbench

2018-08-29 Thread Fabien COELHO




The "may" is because I could *not* test:


Works for me with 8.2.


Thanks for the confirmation.

Earlier branches fail immediately: dropping old tables... ERROR: 
syntax error at or near "exists" LINE 1: drop table if exists 
pgbench_accounts, pgbench_branches, pgb...

 ^


Ok, even before the create table with a fillfactor.


We could probably fix that if anyone cared, but I doubt anyone does.


Indeed, if people needed using a new pgbench against a that old server, 
they would have complained.


The point was just to document the current status of pgbench 
compatibility, and 8.2 onward is the answer.


--
Fabien.



Re: FailedAssertion on partprune

2018-08-29 Thread Robert Haas
On Mon, Aug 27, 2018 at 6:05 PM, Jonathan S. Katz  wrote:
> On behalf of the RMT, I just want to make sure this keeps moving along.
> It sounds like the next step is for Robert to verify that [3] is the
> expected
> behavior and then David can decide what to do from there.

Yes, that's the expected behavior.  If we didn't regenerate the paths
there, we'd end up with

Result
-> Append
 -> [various paths that produce a tlist which needs to be adjusted
later by the result node]

Instead of:

Append
-> [various paths that produce an adjusted tlist]

Paths of the former kind have already been generated; we regenerate
paths here to get the latter kind as well, which end up displacing the
old ones on cost.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-29 Thread Tom Lane
I wrote:
> * We now recursively enter ScanPgRelation, which (again) needs to do a
> search using pg_class_oid_index, so it (again) opens and locks that.
> BUT: LockRelationOid sees that *this process already has share lock on
> pg_class_oid_index*, so it figures it can skip AcceptInvalidationMessages.

BTW, I now have a theory for why we suddenly started seeing this problem
in mid-June: commits a54e1f158 et al added a ScanPgRelation call where
there had been none before (in RelationReloadNailed, for non-index rels).
That didn't create the problem, but it probably increased the odds of
seeing it happen.

Also ... isn't the last "relation->rd_isvalid = true" in
RelationReloadNailed wrong?  If it got cleared during ScanPgRelation,
I do not think we want to believe that we got an up-to-date row.

regards, tom lane



Re: Something's busted in plpgsql composite-variable handling

2018-08-29 Thread Jonathan S. Katz
On 8/28/18 12:06 PM, Pavel Stehule wrote:
>
>
> 2018-08-28 17:04 GMT+02:00 Jonathan S. Katz  >:
>
>
>> On Aug 28, 2018, at 10:45 AM, Pavel Stehule
>> mailto:pavel.steh...@gmail.com>> wrote:
>>
>>
>>
>> 2018-08-28 16:38 GMT+02:00 Jonathan S. Katz > >:
>>
>>
>> > On Aug 26, 2018, at 4:09 PM, Tom Lane > > wrote:
>> > 
>> > I wrote:
>> >> [ dropping and recreating a composite type confuses plpgsql ]
>> >> That's not very nice.  What's worse is that it works
>> cleanly in v10,
>> >> making this a regression, no doubt caused by the hacking I
>> did on
>> >> plpgsql's handling of composite variables.
>> > 
>> > So I'm now inclined to withdraw this as an open item.  On
>> the other
>> > hand, it is a bit worrisome that I happened to hit on a
>> case that
>> > worked better before.  Maybe I'm wrong to judge this
>> unlikely to
>> > happen in the field.
>> > 
>> > Thoughts?
>>
>> Typically if you’re creating a composite type, you’re
>> planning to store
>> data in that type, so you’re probably not going to just drop
>> it without
>> an appropriate migration strategy around it, which would
>> (hopefully)
>> prevent the above case from happening.
>>
>> I wouldn’t let this block the release, so +1 for removing
>> from open
>> items.
>>
>>
>> That depends - the question is - what is a reason of this issue,
>> and how to fix it?
>
> Tom explained the cause and a proposed a fix earlier in the
> thread, and
> cautioned that it could involve a performance hit.
>
>> It is not strong issue, but it is issue, that breaks without
>> outage deployment.
>
> Have you encountered this issue in the field? It is a bug, but it
> seems to
> be an edge case based on normal usage of PostgreSQL, and I still don’t
> see a reason why it needs to be fixed prior to the release of 11.
> If there’s
> an easier solution for solving it, yes, we could go ahead, but it
> sounds like
> there’s a nontrivial amount of work + testing to do.
>
> I do think it should be fixed for 12 now that we’ve identified it.
> We could move
> it from the “Open Items” to the “Live Issues” list for what it’s
> worth.
>
>
> +1

I've gone ahead and moved this to "Live Issues" - Thanks!

Jonathan



signature.asc
Description: OpenPGP digital signature


Re: some pg_dump query code simplification

2018-08-29 Thread Tom Lane
Peter Eisentraut  writes:
> OK, updated patch attached.  If the updated style is acceptable, I'll
> start running more extensive tests against the older branches.

Looks sane to me.

regards, tom lane



Re: Removing useless \. at the end of copy in pgbench

2018-08-29 Thread Tom Lane
Fabien COELHO  writes:
>> Also, I don't find any reason why 8.2 is the cutoff, and saying that it 
>> may work down to 8.2 (implying that it may not) is content-free.

> The "may" is because I could *not* test:

Works for me with 8.2.  Earlier branches fail immediately:

dropping old tables...
ERROR:  syntax error at or near "exists"
LINE 1: drop table if exists pgbench_accounts, pgbench_branches, pgb...
  ^

We could probably fix that if anyone cared, but I doubt anyone does.

regards, tom lane



Re: Continue work on changes to recovery.conf API

2018-08-29 Thread Sergei Kornilov
Hello

Current patch moves recovery.conf settings into GUC system:
- if startup process found recovery.conf - it throw error
- recovery mode is turned on if new special file recovery.signal found
- standby_mode setting was removed. Standby mode can be enabled if startup 
found new special file standby.signal
- if present both standby.signal and recovery.signal - we use standby mode
(this design from previous thread)
- recovery parameters recovery_target_inclusive, recovery_target_timeline, 
recovery_target_action are new GUC without logic changes
- recovery_target (immediate), recovery_target_name, recovery_target_time, 
recovery_target_xid, recovery_target_lsn are replaced to recovery_target_type 
and recovery_target_value (was discussed and changed in previous thread)
- restore_command, archive_cleanup_command, recovery_end_command, 
primary_conninfo, primary_slot_name and recovery_min_apply_delay are just new 
GUC
- trigger_file was renamed to promote_signal_file for clarify (my change, in 
prev thread was removed)
- all new GUC are PGC_POSTMASTER (discussed in prev thread)
- pg_basebackup with -R (--write-recovery-conf) option will create 
standby.signal file and append primary_conninfo and primary_slot_name (if was 
used) to postgresql.auto.conf instead of writing new recovery.conf
- appropriate changes in tests and docs

regards, Sergei



rare crash - FailedAssertion snapbuild.c Line: 580

2018-08-29 Thread Erik Rijkers

Hello,

To test postgres 11, I still regularly run series of short sessions of 
pgbench-over-logical-replication (basically the same thing that I used 
last year [1] - now in a perl incarnation).  Most of the time the 
replication is stable and finishes correctly but sometimes (rarely) I 
get:


TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid, 
snap->xmin))", File: "snapbuild.c", Line: 580)


This will probably be difficult to reproduce and to act upon but I 
wanted to report it anyway as in the course of the last few months I 
have seen it several times, on several machines. Always rarely, always 
postgres 11 (I did not try other versions).



Erik Rijkers


[1] 
https://www.postgresql.org/message-id/3897361c7010c4ac03f358173adbcd60%40xs4all.nl



source/version: bf2d0462cd73 / REL_11_STABLE  (compiled 20180828_1629)

1 primary, 4 replicas (on 1 host).
Basically: pgbench --port=6515 --quiet --initialize --scale=5 postgres
  then: pgbench -M prepared -c 93 -j 8 -T 10 -P 2 -p 6515 postgres
  then: wait for log-repl sync.
(I also added extra data type columns to the basic pgbench tables to 
test all the different data types)


Below is a grep for the assert message, with 15 surrounding lines from 
the latest occurrence.


$ p=REL_11_STABLE; cd /var/data1/pg_stuff/tmp/cascade/$p ; grep -A 15 -B 
15 -Ei 'trap|assert' 65*/logfile.65* | less
6516JTq_E8/logfile.6516-2018-08-29 14:28:13.747 CEST [139409] LOG:  
logical decoding found consistent point at 0/89E038E0
6516JTq_E8/logfile.6516-2018-08-29 14:28:13.747 CEST [139409] DETAIL:  
There are no running transactions.
6516JTq_E8/logfile.6516-2018-08-29 14:28:14.522 CEST [139539] LOG:  
received replication command: CREATE_REPLICATION_SLOT 
"sub2_6517_6517_18834_sync_18804" TEMPORARY LOGICAL pgoutput 
USE_SNAPSHOT
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.681 CEST [139539] LOG:  
logical decoding found initial starting point at 0/8AF5D590
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.681 CEST [139539] DETAIL:  
Waiting for transactions (approximately 1) older than 886297 to end.
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.720 CEST [139539] LOG:  
logical decoding found consistent point at 0/8B018310
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.720 CEST [139539] DETAIL:  
There are no running transactions.
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.860 CEST [139407] LOG:  
received replication command: START_REPLICATION SLOT 
"sub2_6517_6517_18834_sync_18793" LOGICAL 0/89E03918 (proto_version '1', 
publication_names '"pub1_6516"')
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.861 CEST [139407] LOG:  
starting logical decoding for slot "sub2_6517_6517_18834_sync_18793"
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.861 CEST [139407] DETAIL:  
Streaming transactions committing after 0/89E03918, reading WAL from 
0/841EE960.
6516JTq_E8/logfile.6516-2018-08-29 14:28:16.064 CEST [139407] LOG:  
logical decoding found consistent point at 0/841EF3E0
6516JTq_E8/logfile.6516-2018-08-29 14:28:16.064 CEST [139407] DETAIL:  
Logical decoding will begin using saved snapshot.
6516JTq_E8/logfile.6516-2018-08-29 14:28:18.953 CEST [139541] LOG:  
received replication command: CREATE_REPLICATION_SLOT 
"sub2_6517_6517_18834_sync_18814" TEMPORARY LOGICAL pgoutput 
USE_SNAPSHOT
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.002 CEST [139541] LOG:  
logical decoding found consistent point at 0/8C6BA408
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.002 CEST [139541] DETAIL:  
There are no running transactions.
6516JTq_E8/logfile.6516:TRAP: 
FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid, snap->xmin))", 
File: "snapbuild.c", Line: 580)
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] LOG:  
server process (PID 139541) was terminated by signal 6: Aborted
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] DETAIL:  
Failed process was running: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE 
READ
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] LOG:  
terminating any other active server processes
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] WARNING:  
terminating connection because of crash of another server process
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] DETAIL:  
The postmaster has commanded this server process to roll back the 
current transaction and exit, because another server process exited 
abnormally and possibly corrupted shared memory.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] HINT:  In 
a moment you should be able to reconnect to the database and repeat your 
command.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] CONTEXT:  
slot "sub2_6517_6517", output plugin "pgoutput", in the change callback, 
associated LSN 0/8B11CF00
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130411] WARNING:  
terminating connection because of crash of another server process
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130411] DETAIL:  
The postmaster has 

Re: 10.5 but not 10.4: backend startup during reindex system: could not read block 0 in file "base/16400/..": read only 0 of 8192 bytes

2018-08-29 Thread Tom Lane
Justin Pryzby  writes:
> I've seen this message now a handful of times recently.  It seems to happen
> overnight, during a maintenance job which reindex things, including system
> catalog indices.
> It's easy to reproduce error under 10.5, but not under 10.3 nor 10.4.
> while :; do for a in pg_class_oid_index pg_class_relname_nsp_index 
> pg_class_tblspc_relfilenode_index; do psql ts -qc "REINDEX INDEX $a"; done; 
> done&

This looks suspiciously like the issue under discussion in

https://www.postgresql.org/message-id/12259.1532117714%40sss.pgh.pa.us

As far as we can tell, that bug is a dozen years old, so it's not clear
why you find that you can reproduce it only in 10.5.  But there might be
some subtle timing change accounting for that.

regards, tom lane



Re: Is child process of postmaster able to access all the databases?

2018-08-29 Thread Tom Lane
Hubert Zhang  writes:
> I wonder if there is a way to let a child process of postmaster to access
> all the databases one by one?

No.  For starters, you'd need some way to flush all database-specific
information from relcache, catcache, and a boatload of other places;
but that logic does not exist.

regards, tom lane



Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-29 Thread Tom Lane
Andres Freund  writes:
> It's not OK to rebuild relcache entries in the middle of
> ReceiveSharedInvalidMessages() - a later entry in the invalidation queue
> might be relmapper invalidation, and thus immediately processing a
> relcache invalidation might attempt to scan a relation that does not
> exist anymore.

Check, but the bug is actually more subtle than that.  I've reproduced the
problem using your scripts, changed the ERROR to PANIC, and studied the
stack traces of a few core dumps, and here's what I find is the scenario:

* Code tries to open some relation or other.  After acquiring lock, it
goes off to do AcceptInvalidationMessages (cf. LockRelationOid).  That
reads a relcache inval message for "t", and because "t"'s relcache entry
is already open, it decides to rebuild it immediately.

* Therefore it calls ScanPgRelation, which needs to do a search using
pg_class_oid_index, so it opens and locks that.  Again, having acquired
the lock in LockRelationOid, it goes off to do AcceptInvalidationMessages
(recursively).

* Now we read another relcache inval (for "t_pkey", in the cases I've
traced through in detail).  Again, that's open and we decide we'd better
do RelationReloadIndexInfo for it.

* We now recursively enter ScanPgRelation, which (again) needs to do a
search using pg_class_oid_index, so it (again) opens and locks that.
BUT: LockRelationOid sees that *this process already has share lock on
pg_class_oid_index*, so it figures it can skip AcceptInvalidationMessages.

Had we done AcceptInvalidationMessages at this point, we'd have found
the relcache inval message that's waiting for us about pg_class_oid_index,
and we'd have processed it before attempting to read pg_class_oid_index,
and all would be well.  It's the missed AIM call due to the recursive
entry to ScanPgRelation that is causing us to fail to read all available
inval messages before believing that our info about pg_class_oid_index
is up to date.

> Instead, receiving a relcache invalidation now just queues an entry onto
> a new "pending rebuilds" list, that is then processed in a second stage,
> after emptying the entire sinval queue.  At that stage we're guaranteed
> that the relmapper is up2date.

More to the point, we're guaranteed that we've seen all inval messages
that were sent before we acquired whatever lock we just acquired.
(In the cases I've looked at, the local relmapper *is* up to date,
it's the relcache entry for pg_class_oid_index that is not.)

My first gut feeling after understanding the recursion aspect was that
what you're suggesting would not be enough to be bulletproof.  We will
still have the hazard that while we're rebuilding relcache entries,
we may end up doing recursive ScanPgRelation calls and the inner ones
will decide they needn't call AcceptInvalidationMessages.  So we might
well still be rebuilding stuff without having seen every available
inval message.  But it should be OK, because any inval messages we could
possibly need to worry about will have been scanned and queued by the
outer call.

We could perhaps fix this with a less invasive change than what you
suggest here, by attacking the missed-call-due-to-recursion aspect
rather than monkeying with how relcache rebuild itself works.  But I'm
thinking this way might be a good idea anyhow, in order to reduce the
depth of recursion that occurs in scenarios like this.

I've not read your patch yet, so no comment on specifics.

regards, tom lane



Re: some pg_dump query code simplification

2018-08-29 Thread Peter Eisentraut
On 28/08/2018 23:43, Tom Lane wrote:
> I think I had this discussion already with somebody, but ... I do not
> like this style at all:
> 
> tbinfo->attidentity[j] = (i_attidentity >= 0 ? *(PQgetvalue(res, 
> j, i_attidentity)) : '\0');

OK, updated patch attached.  If the updated style is acceptable, I'll
start running more extensive tests against the older branches.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 1a51258aa317f8068c18f034906c3536d570b354 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 29 Aug 2018 16:45:32 +0200
Subject: [PATCH v2] pg_dump: Reorganize getTableAttrs()

Instead of repeating the almost same large query in each version branch,
use one query and add a few columns to the SELECT list depending on the
version.  This saves a lot of duplication.
---
 src/bin/pg_dump/pg_dump.c | 198 --
 1 file changed, 62 insertions(+), 136 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index d833c41147..f0ea83e6a9 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -8162,150 +8162,77 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int 
numTables)
 
resetPQExpBuffer(q);
 
+   appendPQExpBuffer(q,
+ "SELECT\n"
+ "a.attnum,\n"
+ "a.attname,\n"
+ "a.atttypmod,\n"
+ "a.attstattarget,\n"
+ "a.attstorage,\n"
+ "t.typstorage,\n"
+ "a.attnotnull,\n"
+ "a.atthasdef,\n"
+ "a.attisdropped,\n"
+ "a.attlen,\n"
+ "a.attalign,\n"
+ "a.attislocal,\n"
+ 
"pg_catalog.format_type(t.oid, a.atttypmod) AS atttypname,\n");
+
if (fout->remoteVersion >= 11)
-   {
-   /* atthasmissing and attmissingval are new in 11 */
-   appendPQExpBuffer(q, "SELECT a.attnum, a.attname, 
a.atttypmod, "
- "a.attstattarget, 
a.attstorage, t.typstorage, "
- "a.attnotnull, 
a.atthasdef, a.attisdropped, "
- "a.attlen, 
a.attalign, a.attislocal, "
- 
"pg_catalog.format_type(t.oid,a.atttypmod) AS atttypname, "
- 
"array_to_string(a.attoptions, ', ') AS attoptions, "
- "CASE WHEN 
a.attcollation <> t.typcollation "
- "THEN a.attcollation 
ELSE 0 END AS attcollation, "
- "a.attidentity, "
- 
"pg_catalog.array_to_string(ARRAY("
- "SELECT 
pg_catalog.quote_ident(option_name) || "
- "' ' || 
pg_catalog.quote_literal(option_value) "
- "FROM 
pg_catalog.pg_options_to_table(attfdwoptions) "
- "ORDER BY option_name"
- "), E',\n') AS 
attfdwoptions ,"
+   appendPQExpBuffer(q,
  "CASE WHEN 
a.atthasmissing AND NOT a.attisdropped "
- "THEN a.attmissingval 
ELSE null END AS attmissingval "
- "FROM 
pg_catalog.pg_attribute a LEFT JOIN pg_catalog.pg_type t "
- "ON a.atttypid = 
t.oid "
- "WHERE a.attrelid = 
'%u'::pg_catalog.oid "
- "AND a.attnum > 
0::pg_catalog.int2 "
- "ORDER BY a.attnum",
- 
tbinfo->dobj.catId.oid);
-   }
-   else if (fout->remoteVersion >= 10)
-   {
-   /*
-* 

Re: Removing useless \. at the end of copy in pgbench

2018-08-29 Thread Alvaro Herrera
On 2018-Aug-29, Fabien COELHO wrote:

> The "may" is because I could *not* test: I could not run a 8.2 on my laptop,
> "initdb" fails on:
> 
>  creating template1 database in 
> <...>/src/test/regress/./tmp_check/data/base/1 ... ok
>  initializing pg_authid ... FATAL:  wrong number of index expressions
>  STATEMENT:  CREATE TRIGGER pg_sync_pg_database   AFTER INSERT OR UPDATE OR 
> DELETE ON pg_database   FOR EACH STATEMENT EXECUTE PROCEDURE 
> flatfile_update_trigger();

Yeah, that's a problem when compiling 8.2 and 8.3 with newer gcc.  If
you grab the copy from git, it has only one commit after the 8.2.23 tag
and that one fixes this problem.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: some more error location support

2018-08-29 Thread Fabien COELHO




The majority rule (34 make & 22 free) suggest that it is more often use
than not. I'd suggest to stick to that for consistency & homogeneity.


But it's consistently not used in DDL command implementations, only in
normal query parsing.


I try to avoid complicated (context-sensitive) rules when I can, esp as 
some functions may be called from DDL and DML.


But fine with me.

--
Fabien.



Re: Removing useless \. at the end of copy in pgbench

2018-08-29 Thread Fabien COELHO



Hello Peter,


"""
Pgbench is expected to work with all PostgreSQL supported versions at
the time it is released. Some options may work only with newer servers. It
may work with older version down to 8.2.
"""


It is generally expected (nowadays) that client programs work
independent of the server version, unless restrictions are specifically
documented (e.g., pg_dump) or there is some obvious feature dependency
(e.g., unlogged tables).


Yep, that is somehow what I said in the mail.


So the above paragraph does not add any useful information.


It tells that it will not work from 8.1 and below.

Also, I don't find any reason why 8.2 is the cutoff, and saying that it 
may work down to 8.2 (implying that it may not) is content-free.


The "may" is because I could *not* test: I could not run a 8.2 on my 
laptop, "initdb" fails on:


 creating template1 database in <...>/src/test/regress/./tmp_check/data/base/1 
... ok
 initializing pg_authid ... FATAL:  wrong number of index expressions
 STATEMENT:  CREATE TRIGGER pg_sync_pg_database   AFTER INSERT OR UPDATE OR 
DELETE ON pg_database   FOR EACH STATEMENT EXECUTE PROCEDURE 
flatfile_update_trigger();

So I did not feel confident in saying that it would work for 8.2. That is 
just me being precise:-)


As explained in the previous mail, pgbench "CREATE TABLE" always uses 
"FILLFACTOR" which was introduced in 8.2, so it is sure to fail before 
that.


What about:

"""
Pgbench requires a PostgreSQL version 8.2 or above server.
"""

Some information is provided...

I understood that Tom found that an explicit compatibility note would be 
welcome, so I provided one. I'm also fine with saying nothing.


--
Fabien.



Re: pg_verify_checksums -d option (was: Re: pg_verify_checksums -r option)

2018-08-29 Thread Yugo Nagata
On Wed, 29 Aug 2018 21:09:03 +0900
Yugo Nagata  wrote:

> On Wed, 29 Aug 2018 13:46:38 +0200
> Magnus Hagander  wrote:
> 
> > On Wed, Aug 29, 2018 at 1:37 PM, Michael Banck 
> > wrote:
> > 
> > > Hi,
> > >
> > > On Wed, Aug 29, 2018 at 08:33:43PM +0900, Yugo Nagata wrote:
> > > > On Wed, 29 Aug 2018 10:28:33 +0200
> > > > Daniel Gustafsson  wrote:
> > > > > > On 27 Aug 2018, at 14:05, Yugo Nagata  wrote:
> > > > > > On Mon, 27 Aug 2018 13:34:12 +0200
> > > > > > Michael Banck  wrote:
> > > > > >> On Mon, Aug 27, 2018 at 07:53:36PM +0900, Yugo Nagata wrote:
> > > > > >>> On Fri, 24 Aug 2018 18:01:09 +0200
> > > > > >>> Peter Eisentraut  wrote:
> > > > >  I'm curious about this option:
> > > > > 
> > > > >   -r RELFILENODE check only relation with specified
> > > relfilenode
> > > > > 
> > > > >  but there is no facility to specify a database.
> > > > > 
> > > > >  Also, referring to the relfilenode of a mapped relation seems a
> > > bit
> > > > >  inaccurate.
> > > > > 
> > > > >  Maybe reframing this in terms of the file name of the file you
> > > want
> > > > >  checked would be better?
> > > > > >>>
> > > > > >>> If we specified 1234 to -r option, pg_verify_shceksums checks not
> > > only 1234
> > > > > >>> but also 1234_vm, 1234_fsm, and 1234.1, 1234.2, ... and so on, so
> > > I think
> > > > > >>> it makes senses to allow to specify a relfilenode instead of a
> > > file name.
> > > > > >>>
> > > > > >>> I think it is reasonable to add a option to specify a database,
> > > although
> > > > > >>> I don't know which character is good because both -d and -D are
> > > already used
> > > > > >>
> > > > > >> Maybe the -d (debug) option should be revisited as well. Mentioning
> > > > > >> every scanned block generates a huge amount of output which might 
> > > > > >> be
> > > > > >> useful during development but does not seem very useful for a 
> > > > > >> stable
> > > > > >> release. AFAICT there is no other debug output for now.
> > > > > >>
> > > > > >> So it could be renamed to -v (verbose) and only mention each 
> > > > > >> scanned
> > > > > >> file, e.g. (errors/checksum mismatches are still reported of
> > > course).
> > >
> > > I still think this should be changed as well, i.e. -v should not report
> > > every block scanned, as that really is debug output and IMO not useful
> > > in general? AFAICT your page does not change the output at all, just
> > > renames the option.
> > >
> > >
> > I agree with this (though it's my fault initially :P). Per-page output is
> > going to be useless in pretty much all production cases. It makes sense to
> > also change it to be per-file.
> 
> I updated the patch to output only per-file information in the verbose mode.
> Does this behavior match you expect?

I am sorry but I attached a wrong file in the previous post.
Attached is the correct version of the updated patch.

Regards,
-- 
Yugo Nagata 
commit 7c673e1d712c74c51c708ddc4a151b6fb8cc2a8f
Author: Yugo Nagata 
Date:   Wed Aug 29 19:37:13 2018 +0900

Raname pg_verity_checksums debug option -d to -v / verbose

Also, change to only mention each scanced file not every block.

diff --git a/doc/src/sgml/ref/pg_verify_checksums.sgml b/doc/src/sgml/ref/pg_verify_checksums.sgml
index ecc5501eae..a1ff060c2b 100644
--- a/doc/src/sgml/ref/pg_verify_checksums.sgml
+++ b/doc/src/sgml/ref/pg_verify_checksums.sgml
@@ -64,8 +64,8 @@ PostgreSQL documentation
   -d
   

-Enable debug output. Lists all checked blocks and their checksum.
-   
+Enable debug output. Lists all checked files.
+   
   
  
 
diff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c b/src/bin/pg_verify_checksums/pg_verify_checksums.c
index 938b92282a..6be138eb75 100644
--- a/src/bin/pg_verify_checksums/pg_verify_checksums.c
+++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c
@@ -31,7 +31,7 @@ static int64 badblocks = 0;
 static ControlFileData *ControlFile;
 
 static char *only_relfilenode = NULL;
-static bool debug = false;
+static bool verbose = false;
 
 static const char *progname;
 
@@ -43,7 +43,7 @@ usage()
 	printf(_("  %s [OPTION]... [DATADIR]\n"), progname);
 	printf(_("\nOptions:\n"));
 	printf(_(" [-D, --pgdata=]DATADIR  data directory\n"));
-	printf(_("  -d debug output, list all checked blocks\n"));
+	printf(_("  -v, --verbose  output verbose messages, list all checked files\n"));
 	printf(_("  -r RELFILENODE check only relation with specified relfilenode\n"));
 	printf(_("  -V, --version  output version information, then exit\n"));
 	printf(_("  -?, --help show this help, then exit\n"));
@@ -120,11 +120,11 @@ scan_file(char *fn, int segmentno)
 		progname, fn, blockno, csum, header->pd_checksum);
 			badblocks++;
 		}
-		else if (debug)
-			fprintf(stderr, _("%s: checksum verified in file \"%s\", block %d: %X\n"),
-	progname, fn, blockno, csum);
 	}
 
+	

Is child process of postmaster able to access all the databases?

2018-08-29 Thread Hubert Zhang
Hello all.

background worker can use SPI to read a database, but it can call
BackgroundWorkerInitializeConnection(dbname) only once.

I wonder if there is a way to let a child process of postmaster to access
all the databases one by one?


-- 
Thanks

Hubert Zhang


10.5 but not 10.4: backend startup during reindex system: could not read block 0 in file "base/16400/..": read only 0 of 8192 bytes

2018-08-29 Thread Justin Pryzby
I've seen this message now a handful of times recently.  It seems to happen
overnight, during a maintenance job which reindex things, including system
catalog indices.

It's easy to reproduce error under 10.5, but not under 10.3 nor 10.4.

while :; do for a in pg_class_oid_index pg_class_relname_nsp_index 
pg_class_tblspc_relfilenode_index; do psql ts -qc "REINDEX INDEX $a"; done; 
done&

[pryzbyj@database ~]$ while :; do psql ts -qc ''; done
psql: FATAL:  could not read block 1 in file "base/16400/313430687": read only 
0 of 8192 bytes
psql: FATAL:  could not read block 0 in file "base/16400/313430708": read only 
0 of 8192 bytes
psql: FATAL:  could not read block 0 in file "base/16400/313430711": read only 
0 of 8192 bytes
^C

postgres=# SELECT * FROM postgres_log WHERE error_severity ='FATAL';
log_time   | 2018-08-28 22:19:58.822-05
user_name  | telsasoft
database   | ts
pid| 22445
connection_from| 192.168.122.12:58318
session_id | 5b8610de.57ad
session_line   | 1
command_tag| startup
session_start_time | 2018-08-28 22:19:58-05
virtual_transaction_id | 26/280967
transaction_id | 0
error_severity | FATAL
sql_state_code | XX001
message| could not read block 0 in file "base/16400/313316300": 
read only 0 of 8192 bytes

Justin



Re: A strange GiST error message or fillfactor of GiST build

2018-08-29 Thread Andrey Borodin
Hi!

> 29 авг. 2018 г., в 5:32, Kyotaro HORIGUCHI  
> написал(а):
> 
> Hello.
> 
> In the discussion about cube's dimention limit [1], I found that
> the error messages looks strange.
> 
> https://www.postgresql.org/message-id/f0e1a404-a495-4f38-b817-06355b537...@yandex-team.ru
> 
>> postgres=# create table y as  select cube(array(SELECT random() as a FROM 
>> generate_series(1,1000))) from generate_series(1,1e3,1); 
>> SELECT 1000
>> postgres=# create index on y using gist(cube );
>> ERROR:  index row size 8016 exceeds maximum 8152 for index "y_cube_idx"
> 
> This is apparently strange. This is because the message doesn't
> count fill factor at the time. It is fixed by passing freespace
> to gistSplit() and that allows gistfitpage() to consider
> fillfactor as TODO comment within.
> 
> After the attached patch applied, the above messages becomes as
> follows. (And index can be built being a bit sparse by fill
> factor.)

We are passing freespace everywhere. Also, we pass GistInsertState, and 
GistState.
Maybe let's put GistState into GistInsertState, GistState already has free 
space, and pass just GistInsertState everywhere?

Best regards, Andrey Borodin.


Re: Continue work on changes to recovery.conf API

2018-08-29 Thread Peter Eisentraut
On 01/07/2018 13:45, Sergei Kornilov wrote:
> Commitfest 2018-09 is now open and, as planned, i create one entry: 
> https://commitfest.postgresql.org/19/1711/
> Also i attach new version due merge conflict with HEAD.

Could you please describe in detail what this current patch does?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Removing useless \. at the end of copy in pgbench

2018-08-29 Thread Peter Eisentraut
On 29/07/2018 01:59, Fabien COELHO wrote:
> """
> Pgbench is expected to work with all PostgreSQL supported versions at 
> the time it is released. Some options may work only with newer servers. It 
> may work with older version down to 8.2.
> """

It is generally expected (nowadays) that client programs work
independent of the server version, unless restrictions are specifically
documented (e.g., pg_dump) or there is some obvious feature dependency
(e.g., unlogged tables).  So the above paragraph does not add any useful
information.  Also, I don't find any reason why 8.2 is the cutoff, and
saying that it may work down to 8.2 (implying that it may not) is
content-free.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: some more error location support

2018-08-29 Thread Peter Eisentraut
On 27/08/2018 11:17, Fabien COELHO wrote:
> About patch 3: applies cleanly independently of the 2 others, compiles, 
> "make check" is okay.
> 
> A few comments:
> 
> There seems to be several somehow unrelated changes: one about copy,
> one about trigger and one about constraints? The two later changes do not 
> seem to impact the tests, though.

added more tests

> In "CreateTrigger", you moved "make_parsestate" but removed 
> "free_parsestate". I'd rather move it than remove it.

See also previous discussion, but I've moved it around for now.

> In "value.h", the added location field deserves a "/* token location, or 
> -1 if unknown */" comment like others in "parsenode.h", "plannode.h" and 
> "primnodes.h".

done

> Copying and comparing values are updaed, but value in/out functions are 
> not updated to read/write the location, although other objects have their 
> location serialized. ISTM that it should be done as well.

Hmm, maybe that's a problem, because the serialization of a Value node
is just a scalar.  It doesn't have any structure where to put additional
fields.  Maybe we should think about not using Value as a parse
representation for column name lists.  Let me think about that.

Attached is another patch set.  I think the first two patches are OK
now, but the third one might need to be rethought.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 1816353610c6af345f72f4753cdb629c5304123d Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 22 Aug 2018 08:42:49 +0200
Subject: [PATCH v2 1/3] Error position support for defaults and check
 constraints

Add support for error position reporting for the expressions contained
in defaults and check constraint definitions.  This currently works only
for CREATE TABLE, not ALTER TABLE, because the latter is not set up to
pass around the original query string.
---
 src/backend/catalog/heap.c | 4 +++-
 src/backend/commands/tablecmds.c   | 9 +
 src/include/catalog/heap.h | 3 ++-
 src/test/regress/output/constraints.source | 2 ++
 4 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index ac5a677c5f..910d3db063 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -2460,7 +2460,8 @@ AddRelationNewConstraints(Relation rel,
  List *newConstraints,
  bool allow_merge,
  bool is_local,
- bool is_internal)
+ bool is_internal,
+ const char *queryString)
 {
List   *cookedConstraints = NIL;
TupleDesc   tupleDesc;
@@ -2489,6 +2490,7 @@ AddRelationNewConstraints(Relation rel,
 * rangetable entry.  We need a ParseState for transformExpr.
 */
pstate = make_parsestate(NULL);
+   pstate->p_sourcetext = queryString;
rte = addRangeTableEntryForRelation(pstate,

rel,

NULL,
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 2a12d64aeb..552ad8c592 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -985,7 +985,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 */
if (rawDefaults || stmt->constraints)
AddRelationNewConstraints(rel, rawDefaults, stmt->constraints,
- true, true, 
false);
+ true, true, 
false, queryString);
 
ObjectAddressSet(address, RelationRelationId, relationId);
 
@@ -5587,7 +5587,7 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, 
Relation rel,
 * _list_ of defaults, but we just do one.
 */
AddRelationNewConstraints(rel, list_make1(rawEnt), NIL,
- false, true, 
false);
+ false, true, 
false, NULL);
 
/* Make the additional catalog changes visible */
CommandCounterIncrement();
@@ -6178,7 +6178,7 @@ ATExecColumnDefault(Relation rel, const char *colName,
 * _list_ of defaults, but we just do one.
 */
AddRelationNewConstraints(rel, list_make1(rawEnt), NIL,
- false, true, 
false);
+ false, true, 

Re: some more error location support

2018-08-29 Thread Peter Eisentraut
On 28/08/2018 08:58, Fabien COELHO wrote:
> 
>>> Even if there is some under-the-hood garbage collection, I'd suggest to
>>> add a free after the call to ComputePartitionAttrs.
>>
>> Hmm, I didn't know about free_parsestate().  It doesn't seem to be used
>> consistently.  I suppose you'll want to use it when you have a target
>> relation that will be closed by it, but otherwise, for DDL commands,
>> it's not all that useful.
> 
> Probably.
> 
> The majority rule (34 make & 22 free) suggest that it is more often use 
> than not. I'd suggest to stick to that for consistency & homogeneity.

But it's consistently not used in DDL command implementations, only in
normal query parsing.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: some pg_dump query code simplification

2018-08-29 Thread Andrew Dunstan




On 08/28/2018 06:10 PM, Stephen Frost wrote:



Andrew has a buildfarm module that does precisely that, although
I'm not sure what its test dataset is --- probably the regression
database from each branch.  I also have a habit of doing such testing
manually whenever I touch version-sensitive parts of pg_dump.

I've gotten better about doing that back-branch testing myself and
certainly prefer it, but I think we should also have buildfarm coverage.
I don't think we have the full matrix covered, or, really, anything
anywhere near it, so I'm looking for other options to at least get that
code exercised.





If by matrix you mean the version matrix, then the module in question 
covers every live branch as the source and every same or later live 
branch as the destination.


It could conceivably cover older branches as well - the branch names 
aren't actually hardcoded - I'd just have to be able to build them and 
do a standard buildfarm run once.


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: some pg_dump query code simplification

2018-08-29 Thread Andrew Dunstan




On 08/28/2018 06:05 PM, Tom Lane wrote:

Stephen Frost  writes:

I wonder- what if we had an option to pg_dump to explicitly tell it what
the server's version is and then have TAP tests to run with different
versions?

Uh ... telling it what the version is doesn't make that true, so I'd
have no confidence in a test^H^H^H^Hkluge done that way.  The way
to test is to point it at an *actual* back-branch server.

Andrew has a buildfarm module that does precisely that, although
I'm not sure what its test dataset is --- probably the regression
database from each branch.  I also have a habit of doing such testing
manually whenever I touch version-sensitive parts of pg_dump.



It's all the databases from a buildfarm run apart from TAP tests. Since 
it uses USE_MODULE_DB=1 it covers most of the contrib modules plus the 
standard regression db, as well as isolation test and pl_regression 
(which should be taught to do separate DBs for each PL if requested).



There is no reason it couldn't test more.



Dunno about the idea of running the pg_dump TAP tests against back
branches.  I find that code sufficiently unreadable that maintaining
several more copies of it doesn't sound like fun at all.





Agreed. The code could do with a lot of comments. I recently looked at 
adding something to it and decided I had more pressing things to do.


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pg_verify_checksums -d option (was: Re: pg_verify_checksums -r option)

2018-08-29 Thread Yugo Nagata
On Wed, 29 Aug 2018 13:46:38 +0200
Magnus Hagander  wrote:

> On Wed, Aug 29, 2018 at 1:37 PM, Michael Banck 
> wrote:
> 
> > Hi,
> >
> > On Wed, Aug 29, 2018 at 08:33:43PM +0900, Yugo Nagata wrote:
> > > On Wed, 29 Aug 2018 10:28:33 +0200
> > > Daniel Gustafsson  wrote:
> > > > > On 27 Aug 2018, at 14:05, Yugo Nagata  wrote:
> > > > > On Mon, 27 Aug 2018 13:34:12 +0200
> > > > > Michael Banck  wrote:
> > > > >> On Mon, Aug 27, 2018 at 07:53:36PM +0900, Yugo Nagata wrote:
> > > > >>> On Fri, 24 Aug 2018 18:01:09 +0200
> > > > >>> Peter Eisentraut  wrote:
> > > >  I'm curious about this option:
> > > > 
> > > >   -r RELFILENODE check only relation with specified
> > relfilenode
> > > > 
> > > >  but there is no facility to specify a database.
> > > > 
> > > >  Also, referring to the relfilenode of a mapped relation seems a
> > bit
> > > >  inaccurate.
> > > > 
> > > >  Maybe reframing this in terms of the file name of the file you
> > want
> > > >  checked would be better?
> > > > >>>
> > > > >>> If we specified 1234 to -r option, pg_verify_shceksums checks not
> > only 1234
> > > > >>> but also 1234_vm, 1234_fsm, and 1234.1, 1234.2, ... and so on, so
> > I think
> > > > >>> it makes senses to allow to specify a relfilenode instead of a
> > file name.
> > > > >>>
> > > > >>> I think it is reasonable to add a option to specify a database,
> > although
> > > > >>> I don't know which character is good because both -d and -D are
> > already used
> > > > >>
> > > > >> Maybe the -d (debug) option should be revisited as well. Mentioning
> > > > >> every scanned block generates a huge amount of output which might be
> > > > >> useful during development but does not seem very useful for a stable
> > > > >> release. AFAICT there is no other debug output for now.
> > > > >>
> > > > >> So it could be renamed to -v (verbose) and only mention each scanned
> > > > >> file, e.g. (errors/checksum mismatches are still reported of
> > course).
> >
> > I still think this should be changed as well, i.e. -v should not report
> > every block scanned, as that really is debug output and IMO not useful
> > in general? AFAICT your page does not change the output at all, just
> > renames the option.
> >
> >
> I agree with this (though it's my fault initially :P). Per-page output is
> going to be useless in pretty much all production cases. It makes sense to
> also change it to be per-file.

I updated the patch to output only per-file information in the verbose mode.
Does this behavior match you expect?

Regards,
-- 
Yugo Nagata 
commit ec21c608cd78f65747916487b56a197c685649c8
Author: Yugo Nagata 
Date:   Wed Aug 29 19:37:13 2018 +0900

Raname pg_verity_checksums debug option -d to -v / verbose

Also, change to only mention each scanced file not every block.

diff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c b/src/bin/pg_verify_checksums/pg_verify_checksums.c
index 938b92282a..0f931b7579 100644
--- a/src/bin/pg_verify_checksums/pg_verify_checksums.c
+++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c
@@ -31,7 +31,7 @@ static int64 badblocks = 0;
 static ControlFileData *ControlFile;
 
 static char *only_relfilenode = NULL;
-static bool debug = false;
+static bool verbose = false;
 
 static const char *progname;
 
@@ -43,7 +43,7 @@ usage()
 	printf(_("  %s [OPTION]... [DATADIR]\n"), progname);
 	printf(_("\nOptions:\n"));
 	printf(_(" [-D, --pgdata=]DATADIR  data directory\n"));
-	printf(_("  -d debug output, list all checked blocks\n"));
+	printf(_("  -v, --verbose  output verbose messages, list all checked blocks\n"));
 	printf(_("  -r RELFILENODE check only relation with specified relfilenode\n"));
 	printf(_("  -V, --version  output version information, then exit\n"));
 	printf(_("  -?, --help show this help, then exit\n"));
@@ -120,7 +120,7 @@ scan_file(char *fn, int segmentno)
 		progname, fn, blockno, csum, header->pd_checksum);
 			badblocks++;
 		}
-		else if (debug)
+		else if (verbose)
 			fprintf(stderr, _("%s: checksum verified in file \"%s\", block %d: %X\n"),
 	progname, fn, blockno, csum);
 	}
@@ -208,6 +208,7 @@ main(int argc, char *argv[])
 {
 	static struct option long_options[] = {
 		{"pgdata", required_argument, NULL, 'D'},
+		{"verbose", no_argument, NULL, 'v'},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -234,12 +235,12 @@ main(int argc, char *argv[])
 		}
 	}
 
-	while ((c = getopt_long(argc, argv, "D:r:d", long_options, _index)) != -1)
+	while ((c = getopt_long(argc, argv, "D:r:v", long_options, _index)) != -1)
 	{
 		switch (c)
 		{
-			case 'd':
-debug = true;
+			case 'v':
+verbose = true;
 break;
 			case 'D':
 DataDir = optarg;


speeding up planning with partitions

2018-08-29 Thread Amit Langote
It is more or less well known that the planner doesn't perform well with
more than a few hundred partitions even when only a handful of partitions
are ultimately included in the plan.  Situation has improved a bit in PG
11 where we replaced the older method of pruning partitions one-by-one
using constraint exclusion with a much faster method that finds relevant
partitions by using partitioning metadata.  However, we could only use it
for SELECT queries, because UPDATE/DELETE are handled by a completely
different code path, whose structure doesn't allow it to call the new
pruning module's functionality.  Actually, not being able to use the new
pruning is not the only problem for UPDATE/DELETE, more on which further
below.

While situation improved with new pruning where it could, there are still
overheads in the way planner handles partitions.  As things stand today,
it will spend cycles and allocate memory for partitions even before
pruning is performed, meaning most of that effort could be for partitions
that were better left untouched.  Currently, planner will lock, heap_open
*all* partitions, create range table entries and AppendRelInfos  for them,
and finally initialize RelOptInfos for them, even touching the disk file
of each partition in the process, in an earlier phase of planning.  All of
that processing is vain for partitions that are pruned, because they won't
be included in the final plan.  This problem grows worse as the number of
partitions grows beyond thousands, because range table grows too big.

That could be fixed by delaying all that per-partition activity to a point
where pruning has already been performed, so that we know the partitions
to open and create planning data structures for, such as somewhere
downstream to query_planner.  But before we can do that we must do
something about the fact that UPDATE/DELETE won't be able to cope with
that because the code path that currently handles the planning of
UPDATE/DELETE on partitioned tables (inheritance_planner called from
subquery_planner) relies on AppendRelInfos for all partitions having been
initialized by an earlier planning phase.  Delaying it to query_planner
would be too late, because inheritance_planner calls query_planner for
each partition, not for the parent.  That is, if query_planner, which is
downstream to inheritance_planner, was in the charge of determining which
partitions to open, the latter wouldn't know which partitions to call the
former for. :)

That would be fixed if there is no longer this ordering dependency, which
is what I propose to do with the attached patch 0001.  I've tried to
describe how the patch manages to do that in its commit message, but I'll
summarize here.  As things stand today, inheritance_planner modifies the
query for each leaf partition to make the partition act as the query's
result relation instead of the original partitioned table and calls
grouping_planner on the query.  That means anything that's joined to
partitioned table looks to instead be joined to the partition and join
paths are generated likewise.  Also, the resulting path's targetlist is
adjusted to be suitable for the result partition.  Upon studying how this
works, I concluded that the same result can be achieved if we call
grouping_planner only once and repeat the portions of query_planner's and
grouping_planner's processing that generate the join paths and appropriate
target list, respectively, for each partition.  That way, we can rely on
query_planner determining result partitions for us, which in turn relies
on the faster partprune.c based method of pruning.  That speeds things up
in two ways.  Faster pruning and we no longer repeat common processing for
each partition.


With 0001 in place, there is nothing that requires that partitions be
opened by an earlier planning phase, so, I propose patch 0002, which
refactors the opening and creation of planner data structures for
partitions such that it is now performed after pruning. However, it
doesn't do anything about the fact that partitions are all still locked in
the earlier phase.

With various overheads gone thanks to 0001 and 0002, locking of all
partitions via find_all_inheritos can be seen as the single largest
bottleneck, which 0003 tries to address.  I've kept it a separate patch,
because I'll need to think a bit more to say that it's actually to safe to
defer locking to late planning, due mainly to the concern about the change
in the order of locking from the current method.  I'm attaching it here,
because I also want to show the performance improvement we can expect with it.


I measured the gain in performance due to each patch on a modest virtual
machine.  Details of the measurement and results follow.

* Benchmark scripts

update.sql
update ht set a = 0 where b = 1;

select.sql
select * from ht where b = 1;

* Table:

create table ht (a int, b int) partition by hash (b)
create table ht_1 partition of ht for values with (modulus N, remainder 0)
..
create table 

Re: pg_verify_checksums vs windows

2018-08-29 Thread Magnus Hagander
On Wed, Aug 29, 2018 at 1:44 PM, Amit Kapila 
wrote:

> On Wed, Aug 29, 2018 at 5:05 PM Magnus Hagander 
> wrote:
> >
> > On Wed, Aug 29, 2018 at 1:31 PM, Amit Kapila 
> wrote:
> >>
> >> So, I think we need to open the file in binary mode as in other parts
> >> of the code.  The attached patch fixes the problem for me.
> >>
> >> Thoughts?
> >
> >
> > Yikes. Yes, I believe you are correct, and that looks like the correct
> fix.
> >
> > I wonder why this was not caught on the buildfarm. We do have regression
> tests for it, AFAIK?
> >
>
> I am not able to find regression tests for it, but maybe I am not
> seeing it properly.  By any chance, you have removed it during revert
> of ""Allow on-line enabling and disabling of data checksums".
>
>
Oh meh. You are right, it's in the reverted patch, I was looking in the
wrong branch :/ Sorry about that. And that certainly explains why we don't
have it.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: pg_verify_checksums -d option (was: Re: pg_verify_checksums -r option)

2018-08-29 Thread Magnus Hagander
On Wed, Aug 29, 2018 at 1:37 PM, Michael Banck 
wrote:

> Hi,
>
> On Wed, Aug 29, 2018 at 08:33:43PM +0900, Yugo Nagata wrote:
> > On Wed, 29 Aug 2018 10:28:33 +0200
> > Daniel Gustafsson  wrote:
> > > > On 27 Aug 2018, at 14:05, Yugo Nagata  wrote:
> > > > On Mon, 27 Aug 2018 13:34:12 +0200
> > > > Michael Banck  wrote:
> > > >> On Mon, Aug 27, 2018 at 07:53:36PM +0900, Yugo Nagata wrote:
> > > >>> On Fri, 24 Aug 2018 18:01:09 +0200
> > > >>> Peter Eisentraut  wrote:
> > >  I'm curious about this option:
> > > 
> > >   -r RELFILENODE check only relation with specified
> relfilenode
> > > 
> > >  but there is no facility to specify a database.
> > > 
> > >  Also, referring to the relfilenode of a mapped relation seems a
> bit
> > >  inaccurate.
> > > 
> > >  Maybe reframing this in terms of the file name of the file you
> want
> > >  checked would be better?
> > > >>>
> > > >>> If we specified 1234 to -r option, pg_verify_shceksums checks not
> only 1234
> > > >>> but also 1234_vm, 1234_fsm, and 1234.1, 1234.2, ... and so on, so
> I think
> > > >>> it makes senses to allow to specify a relfilenode instead of a
> file name.
> > > >>>
> > > >>> I think it is reasonable to add a option to specify a database,
> although
> > > >>> I don't know which character is good because both -d and -D are
> already used
> > > >>
> > > >> Maybe the -d (debug) option should be revisited as well. Mentioning
> > > >> every scanned block generates a huge amount of output which might be
> > > >> useful during development but does not seem very useful for a stable
> > > >> release. AFAICT there is no other debug output for now.
> > > >>
> > > >> So it could be renamed to -v (verbose) and only mention each scanned
> > > >> file, e.g. (errors/checksum mismatches are still reported of
> course).
>
> I still think this should be changed as well, i.e. -v should not report
> every block scanned, as that really is debug output and IMO not useful
> in general? AFAICT your page does not change the output at all, just
> renames the option.
>
>
I agree with this (though it's my fault initially :P). Per-page output is
going to be useless in pretty much all production cases. It makes sense to
also change it to be per-file.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: pg_verify_checksums vs windows

2018-08-29 Thread Amit Kapila
On Wed, Aug 29, 2018 at 5:05 PM Magnus Hagander  wrote:
>
> On Wed, Aug 29, 2018 at 1:31 PM, Amit Kapila  wrote:
>>
>> So, I think we need to open the file in binary mode as in other parts
>> of the code.  The attached patch fixes the problem for me.
>>
>> Thoughts?
>
>
> Yikes. Yes, I believe you are correct, and that looks like the correct fix.
>
> I wonder why this was not caught on the buildfarm. We do have regression 
> tests for it, AFAIK?
>

I am not able to find regression tests for it, but maybe I am not
seeing it properly.  By any chance, you have removed it during revert
of ""Allow on-line enabling and disabling of data checksums".

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: pg_verify_checksums -d option (was: Re: pg_verify_checksums -r option)

2018-08-29 Thread Michael Banck
Hi,

On Wed, Aug 29, 2018 at 08:33:43PM +0900, Yugo Nagata wrote:
> On Wed, 29 Aug 2018 10:28:33 +0200
> Daniel Gustafsson  wrote:
> > > On 27 Aug 2018, at 14:05, Yugo Nagata  wrote:
> > > On Mon, 27 Aug 2018 13:34:12 +0200
> > > Michael Banck  wrote:
> > >> On Mon, Aug 27, 2018 at 07:53:36PM +0900, Yugo Nagata wrote:
> > >>> On Fri, 24 Aug 2018 18:01:09 +0200
> > >>> Peter Eisentraut  wrote:
> >  I'm curious about this option:
> >  
> >   -r RELFILENODE check only relation with specified relfilenode
> >  
> >  but there is no facility to specify a database.
> >  
> >  Also, referring to the relfilenode of a mapped relation seems a bit
> >  inaccurate.
> >  
> >  Maybe reframing this in terms of the file name of the file you want
> >  checked would be better?
> > >>> 
> > >>> If we specified 1234 to -r option, pg_verify_shceksums checks not only 
> > >>> 1234
> > >>> but also 1234_vm, 1234_fsm, and 1234.1, 1234.2, ... and so on, so I 
> > >>> think
> > >>> it makes senses to allow to specify a relfilenode instead of a file 
> > >>> name.
> > >>> 
> > >>> I think it is reasonable to add a option to specify a database, although
> > >>> I don't know which character is good because both -d and -D are already 
> > >>> used
> > >> 
> > >> Maybe the -d (debug) option should be revisited as well. Mentioning
> > >> every scanned block generates a huge amount of output which might be
> > >> useful during development but does not seem very useful for a stable
> > >> release. AFAICT there is no other debug output for now.
> > >> 
> > >> So it could be renamed to -v (verbose) and only mention each scanned
> > >> file, e.g. (errors/checksum mismatches are still reported of course).

I still think this should be changed as well, i.e. -v should not report
every block scanned, as that really is debug output and IMO not useful
in general? AFAICT your page does not change the output at all, just
renames the option.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz



Re: Two proposed modifications to the PostgreSQL FDW

2018-08-29 Thread Etsuro Fujita

(2018/08/21 16:03), Amit Langote wrote:

On 2018/08/20 23:43, Tom Lane wrote:

Chris Travers  writes:

I am looking at trying to make two modifications to the PostgreSQL FDW and
would like feedback on this before I do.



1.  INSERTMETHOD=[insert|copy] option on foreign table.



One significant limitation of the PostgreSQL FDW is that it does a prepared
statement insert on each row written which imposes a per-row latency.  This
hits environments where there is significant latency or few latency
guarantees particularly hard, for example, writing to a foreign table that
might be physically located on another continent.  The idea is that
INSERTMETHOD would default to insert and therefore have no changes but
where needed people could specify COPY which would stream the data out.
Updates would still be unaffected.



A different thing we could think about is enabling COPY TO/FROM a
foreign table.


Fwiw, the following commit did introduce COPY FROM support for foreign
tables, although using a FDW INSERT interface, so not exactly optimized
for bulk-loading yet.

commit 3d956d9562aa4811b5eaaaf5314d361c61be2ae0
Author: Robert Haas
Date:   Fri Apr 6 19:16:11 2018 -0400

 Allow insert and update tuple routing and COPY for foreign tables.


That's right.  To improve the efficiency, I plan to work on COPY FROM/TO 
a foreign table for PG12 (In [1], I proposed new FDW APIs for COPY FROM).


Sorry, I'm late to the party.

Best regards,
Etsuro Fujita

[1] 
https://www.postgresql.org/message-id/23990375-45a6-5823-b0aa-a6a7a6a957f0%40lab.ntt.co.jp




Re: pg_verify_checksums vs windows

2018-08-29 Thread Magnus Hagander
On Wed, Aug 29, 2018 at 1:31 PM, Amit Kapila 
wrote:

> While trying to debug a recent bug report on hash indexes [1], I
> noticed that pg_verify_checksums don't work on Windows (or at least in
> my environment).
>
> initdb -k ..\..\data
> pg_verify_checksums.exe  ..\..\Data
> pg_verify_checksums: short read of block 0 in file
> "..\..\Data/global/1136", got only 15 bytes
>
> I have debugged and found that below code is the culprit.
>
> scan_file(char *fn, int segmentno)
> {
> ..
> f = open(fn, 0);
> ..
> int r = read(f, buf, BLCKSZ);
>
> if (r == 0)
> break;
>
> if (r != BLCKSZ)
> {
> fprintf(stderr, _("%s: short read of block %d in file \"%s\", got only
> %d bytes\n"),
> progname, blockno, fn, r);
> exit(1);
> }
> ..
> }
>
> We are opening the file in text mode and trying to read the BLCKSZ
> bytes, however, if there is any Control-Z char, it is treated as EOF.
> This problem has been mentioned in the comments in c.h as follows:
> /*
>  * NOTE:  this is also used for opening text files.
>  * WIN32 treats Control-Z as EOF in files opened in text mode.
>  * Therefore, we open files in binary mode on Win32 so we can read
>  * literal control-Z.  The other affect is that we see CRLF, but
>  * that is OK because we can already handle those cleanly.
>  */
>
> So, I think we need to open the file in binary mode as in other parts
> of the code.  The attached patch fixes the problem for me.
>
> Thoughts?
>

Yikes. Yes, I believe you are correct, and that looks like the correct fix.

I wonder why this was not caught on the buildfarm. We do have regression
tests for it, AFAIK? Or maybe we just lucked out there because there was no
^Z char in the files there?


-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: pg_verify_checksums -d option (was: Re: pg_verify_checksums -r option)

2018-08-29 Thread Yugo Nagata
On Wed, 29 Aug 2018 10:28:33 +0200
Daniel Gustafsson  wrote:

> > On 27 Aug 2018, at 14:05, Yugo Nagata  wrote:
> > 
> > On Mon, 27 Aug 2018 13:34:12 +0200
> > Michael Banck  wrote:
> > 
> >> Hi,
> >> 
> >> On Mon, Aug 27, 2018 at 07:53:36PM +0900, Yugo Nagata wrote:
> >>> On Fri, 24 Aug 2018 18:01:09 +0200
> >>> Peter Eisentraut  wrote:
>  I'm curious about this option:
>  
>   -r RELFILENODE check only relation with specified relfilenode
>  
>  but there is no facility to specify a database.
>  
>  Also, referring to the relfilenode of a mapped relation seems a bit
>  inaccurate.
>  
>  Maybe reframing this in terms of the file name of the file you want
>  checked would be better?
> >>> 
> >>> If we specified 1234 to -r option, pg_verify_shceksums checks not only 
> >>> 1234
> >>> but also 1234_vm, 1234_fsm, and 1234.1, 1234.2, ... and so on, so I think
> >>> it makes senses to allow to specify a relfilenode instead of a file name.
> >>> 
> >>> I think it is reasonable to add a option to specify a database, although
> >>> I don't know which character is good because both -d and -D are already 
> >>> used
> >> 
> >> Maybe the -d (debug) option should be revisited as well. Mentioning
> >> every scanned block generates a huge amount of output which might be
> >> useful during development but does not seem very useful for a stable
> >> release. AFAICT there is no other debug output for now.
> >> 
> >> So it could be renamed to -v (verbose) and only mention each scanned
> >> file, e.g. (errors/checksum mismatches are still reported of course).
> >> 
> >> Then -d could (in the future, I guess that is too late for v11) be used
> >> for -d/--dbname (or make that only a long option, if the above does not
> >> work).
> > 
> > I realized after sending the previous post that we can not specify a 
> > database
> > by name because pg_verify_checksum is run in offline and this can not get 
> > the
> > OID from the database name.  Also, there are global and pg_tblspc 
> > directories
> > not only base/. So, it seems to me good to specify a 
> > directories
> > to scan which is under PGDATA. We would be able to use -d ( or --directory 
> > ?)
> > for this purpose.
> 
> Changing functionality to the above discussed is obviously 12 material, but
> since we are discussing changing the command line API of the tool by
> repurposing -d; do we want to rename the current use of -d to -v (with the
> accompanying ―-verbose) before 11 ships?  It’s clearly way way too late in the
> cycle but it seems worth to at least bring up since 11 will be the first
> version pg_verify_checksums ship in. I’m happy to do the work asap if so.

I agree with this.  Almost other commands doesn't use -d as debug mode
although there a few exception, and instead -v is used for verbose mode.
If we can change the command line of pg_veriry_checksums, before release of
PG 11 is best.  Attached is the patch to do this.

Regards,
-- 
Yugo Nagata 
commit 452f981c30c9d9c7263e6006dc4cda51278ff376
Author: Yugo Nagata 
Date:   Wed Aug 29 19:37:13 2018 +0900

Raname pg_verity_checksums debug option -d to -v / verbose

diff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c b/src/bin/pg_verify_checksums/pg_verify_checksums.c
index 938b92282a..0f931b7579 100644
--- a/src/bin/pg_verify_checksums/pg_verify_checksums.c
+++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c
@@ -31,7 +31,7 @@ static int64 badblocks = 0;
 static ControlFileData *ControlFile;
 
 static char *only_relfilenode = NULL;
-static bool debug = false;
+static bool verbose = false;
 
 static const char *progname;
 
@@ -43,7 +43,7 @@ usage()
 	printf(_("  %s [OPTION]... [DATADIR]\n"), progname);
 	printf(_("\nOptions:\n"));
 	printf(_(" [-D, --pgdata=]DATADIR  data directory\n"));
-	printf(_("  -d debug output, list all checked blocks\n"));
+	printf(_("  -v, --verbose  output verbose messages, list all checked blocks\n"));
 	printf(_("  -r RELFILENODE check only relation with specified relfilenode\n"));
 	printf(_("  -V, --version  output version information, then exit\n"));
 	printf(_("  -?, --help show this help, then exit\n"));
@@ -120,7 +120,7 @@ scan_file(char *fn, int segmentno)
 		progname, fn, blockno, csum, header->pd_checksum);
 			badblocks++;
 		}
-		else if (debug)
+		else if (verbose)
 			fprintf(stderr, _("%s: checksum verified in file \"%s\", block %d: %X\n"),
 	progname, fn, blockno, csum);
 	}
@@ -208,6 +208,7 @@ main(int argc, char *argv[])
 {
 	static struct option long_options[] = {
 		{"pgdata", required_argument, NULL, 'D'},
+		{"verbose", no_argument, NULL, 'v'},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -234,12 +235,12 @@ main(int argc, char *argv[])
 		}
 	}
 
-	while ((c = getopt_long(argc, argv, "D:r:d", long_options, _index)) != -1)
+	while ((c = getopt_long(argc, argv, "D:r:v", long_options, _index)) != -1)
 	{
 		switch (c)
 		{
-			case 'd':
-

Re: Catalog corruption

2018-08-29 Thread Andrew Gierth
> "Mariel" == Mariel Cherkassky  writes:

 Mariel> Hi,

 Mariel> I sent already an email about this topic to pgsql-admins but I
 Mariel> think that it might be more relevant to this mailing list.

The -hackers mailing list is about the development of postgresql.

 Mariel> I'm trying to investigate a corruption that happened on a
 Mariel> machine of one of our clients.

A good place to get help with that is actually the IRC channel
(#postgresql on chat.freenode.net, or http://webchat.freenode.net for
the web interface - note that currently you need to register a nickname,
see http://freenode.net/kb/answer/registration because of some annoying
spammers, but we can help you with that if you just go ahead and try and
join the channel anyway).

-- 
Andrew (irc:RhodiumToad)



Re: pg_verify_checksums failure with hash indexes

2018-08-29 Thread Yugo Nagata
On Wed, 29 Aug 2018 16:01:53 +0530
Amit Kapila  wrote:

> > By the way, I think we can fix this also by clearing the header information 
> > of the last
> > page instead of setting a checksum to the unused page although I am not 
> > sure which way
> > is better.
> >
> 
> I think that can complicate the WAL logging of this operation which we
> are able to deal easily with log_newpage and it sounds quite hacky.
> The fix I have posted seems better, but I am open to suggestions.

Thank you for your explanation.  I understood  this way could make the
codes complicated, so I think the way you posted is better.


Regards,
-- 
Yugo Nagata 



Re: pg_verify_checksums failure with hash indexes

2018-08-29 Thread Dilip Kumar
On Wed, Aug 29, 2018 at 3:39 PM, Dilip Kumar  wrote:
> On Tue, Aug 28, 2018 at 8:33 PM, Bernd Helmle  wrote:
>> Am Dienstag, den 28.08.2018, 11:21 +0200 schrieb Peter Eisentraut:
>>> This is reproducible with PG11 and PG12:
>>>
>>> initdb -k data
>>> postgres -D data
>>>
>>> make installcheck
>>> # shut down postgres with Ctrl-C
>>>
>>
>> I tried to reproduce this and by accident i had a blocksize=4 in my
>> configure script, and i got immediately failed installcheck results.
>> They seem hash index related and can easily be reproduced:
>>
>> SHOW block_size ;
>>  block_size
>> 
>>  4096
>>
>> CREATE TABLE foo(val text);
>> INSERT INTO foo VALUES('bernd');
>>
>> CREATE INDEX ON foo USING hash(val);
>> ERROR:  index "foo_val_idx" contains corrupted page at block 0
>> HINT:  Please REINDEX it.
>>
>> I have no idea wether this could be related, but  i thought it won't
>> harm to share this here.
>>
>
> This issue seems different than the one got fixed in this thread.  The
> reason for this issue is that the size of the hashm_mapp in
> HashMetaPageData is 4096, irrespective of the block size.  So when the
> block size is big enough (i.e. 8192) then there is no problem, but
> when you set it to 4096, in that case, the hashm_mapp of the meta page
> is overwriting the special space of the meta page.  That's the reason
> its showing corrupted page while checking the hash_page.

Just to verify this I just hacked it like below and it worked.  I
think we need a more thoughtful value for HASH_MAX_BITMAPS.

diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 543d802..9909f69 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -232,7 +232,7 @@ typedef HashScanOpaqueData *HashScanOpaque;
  * needing to fit into the metapage.  (With 8K block size, 1024 bitmaps
  * limit us to 256 GB of overflow space...)
  */
-#define HASH_MAX_BITMAPS   1024
+#define HASH_MAX_BITMAPS   Min(BLCKSZ / 8, 1024)

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: pg_verify_checksums failure with hash indexes

2018-08-29 Thread Amit Kapila
On Wed, Aug 29, 2018 at 3:30 PM Yugo Nagata  wrote:
>
> On Wed, 29 Aug 2018 14:39:10 +0530
> Amit Kapila  wrote:
>
> > On Tue, Aug 28, 2018 at 2:51 PM Peter Eisentraut
> >  wrote:
> > >
> > > This is reproducible with PG11 and PG12:
> > >
> > > initdb -k data
> > > postgres -D data
> > >
> > > make installcheck
> > > # shut down postgres with Ctrl-C
> > >
> > ..
> > >
> > > The files in question correspond to
> > >
> > > hash_i4_index
> > > hash_name_index
> > > hash_txt_index
> > >
> >
> > I have looked into this problem and found the cause of it.  This
> > problem is happening for the empty page in the hash index.  On a
> > split, we allocate a new splitpoint's worth of bucket pages wherein we
> > initialize the last page with zero's, this is all fine, but we forgot
> > to set the checksum for that last page.  Attached patch fixes the
> > problem for me.
> >
> > Can someone try and share their findings?
>
> I confirmed that this patch fixed the problem by setting a checksum in the 
> last
> page in hash indexes, and pg_veviry_checksum is done successfully.
>

Thanks.

> regression=# select * from page_header(get_raw_page('hash_i4_index',65));
> lsn| checksum | flags | lower | upper | special | pagesize | version 
> | prune_xid
> ---+--+---+---+---+-+--+-+---
>  0/41CACF0 |18720 | 0 |24 |  8176 |8176 | 8192 |   4 
> | 0
> (1 row)
>
> By the way, I think we can fix this also by clearing the header information 
> of the last
> page instead of setting a checksum to the unused page although I am not sure 
> which way
> is better.
>

I think that can complicate the WAL logging of this operation which we
are able to deal easily with log_newpage and it sounds quite hacky.
The fix I have posted seems better, but I am open to suggestions.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Stored procedures and out parameters

2018-08-29 Thread Vladimir Sitnikov
David>JDBC driver or similar drivers to use the CALL command always from
PG11 on, then the meaning of {call f1(a, b)} will have changed and a

Note: technically speaking, JDBC has two flavours of syntax (however
standard does not clarify the distinction):
S1) {? := call my_proc(?,?) }
S2) { call my_proc(?, ?) }

Theoretically speaking, we could presume that S1 is for calling functions
while S2 is for calling procedures.
However, does not looks like a way out since
1) It does not provide a way to call void returning functions. Users might
have lots of void-returning functions since procedures were missing.
2) Other languages might happen to have single syntax only. For
instance, CommandType in .NET seems to be in (StoredProcedure, TableDirect,
Text). There's no room for procedure vs function.

Peter>JDBC driver or similar drivers to use the CALL command always from
Peter>PG11 on, then the meaning of {call f1(a, b)} will have changed and a
Peter>lot of things will break in dangerous ways.

PG10 did not have procedures, so only "functions" could theoretically break.
If CALL can be used to call functions, and the drivers could play their way
to make the result-set look like before, then no breakage happens.

Note: driver does not blindly rewrite {call f1(a,b)} to call f1(a,b). It
does send prepare/describe/etc/etc messages. At the end of the day, the
purpose of having a DB driver is to have a consistent API that works across
DB versions/DB products.

David>Not implementing this optimization in pg11 but supporting functions
via call is something I could live with.

+1

Vladimir


Re: Use C99 designated initializers for some structs

2018-08-29 Thread Peter Eisentraut
On 29/08/2018 12:13, Peter Eisentraut wrote:
> Here is a patch to change some struct initializations to use C99-style
> designated initializers.  These are just a few particularly egregious
> cases that were hard to read and write, and error prone because of many
> similar adjacent types.
> 
> (The PL/Python changes currently don't compile with Python 3 because of
> the situation described in the parallel thread "PL/Python: Remove use of
> simple slicing API".)
> 
> Thoughts?

and with the actual patch

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 8af20d62778b44fc5bfe91f2f8fe7991ffc09bb9 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 29 Aug 2018 08:36:30 +0200
Subject: [PATCH 1/2] Use C99 designated initializers for some structs

---
 src/backend/access/transam/xact.c|  23 +-
 src/backend/catalog/heap.c   |  98 --
 src/bin/psql/tab-complete.c  | 434 ---
 src/pl/plpython/plpy_cursorobject.c  |  39 +--
 src/pl/plpython/plpy_planobject.c|  37 +--
 src/pl/plpython/plpy_plpymodule.c|  22 +-
 src/pl/plpython/plpy_resultobject.c  |  57 +---
 src/pl/plpython/plpy_subxactobject.c |  37 +--
 8 files changed, 242 insertions(+), 505 deletions(-)

diff --git a/src/backend/access/transam/xact.c 
b/src/backend/access/transam/xact.c
index cd8270d5fb..875be180fe 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -199,27 +199,8 @@ typedef TransactionStateData *TransactionState;
  * transaction at all, or when in a top-level transaction.
  */
 static TransactionStateData TopTransactionStateData = {
-   0,  /* transaction 
id */
-   0,  /* 
subtransaction id */
-   NULL,   /* savepoint name */
-   0,  /* savepoint 
level */
-   TRANS_DEFAULT,  /* transaction state */
-   TBLOCK_DEFAULT, /* transaction block state from 
the client
-* perspective 
*/
-   0,  /* transaction 
nesting depth */
-   0,  /* GUC context 
nesting depth */
-   NULL,   /* cur transaction 
context */
-   NULL,   /* cur transaction 
resource owner */
-   NULL,   /* subcommitted child 
Xids */
-   0,  /* # of 
subcommitted child Xids */
-   0,  /* allocated 
size of childXids[] */
-   InvalidOid, /* previous 
CurrentUserId setting */
-   0,  /* previous 
SecurityRestrictionContext */
-   false,  /* entry-time xact r/o 
state */
-   false,  /* startedInRecovery */
-   false,  /* didLogXid */
-   0,  /* 
parallelModeLevel */
-   NULL/* link to parent state 
block */
+   .state = TRANS_DEFAULT,
+   .blockState = TBLOCK_DEFAULT,
 };
 
 /*
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index ac5a677c5f..ecbe88569f 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -145,39 +145,87 @@ static List *insert_ordered_unique_oid(List *list, Oid 
datum);
  */
 
 static FormData_pg_attribute a1 = {
-   0, {"ctid"}, TIDOID, 0, sizeof(ItemPointerData),
-   SelfItemPointerAttributeNumber, 0, -1, -1,
-   false, 'p', 's', true, false, false, '\0', false, true, 0
+   .attname = {"ctid"},
+   .atttypid = TIDOID,
+   .attlen = sizeof(ItemPointerData),
+   .attnum = SelfItemPointerAttributeNumber,
+   .attcacheoff = -1,
+   .atttypmod = -1,
+   .attbyval = false,
+   .attstorage = 'p',
+   .attalign = 's',
+   .attnotnull = true,
+   .attislocal = true,
 };
 
 static FormData_pg_attribute a2 = {
-   0, {"oid"}, OIDOID, 0, sizeof(Oid),
-   ObjectIdAttributeNumber, 0, -1, -1,
-   true, 'p', 'i', true, false, false, '\0', false, true, 0
+   .attname = {"oid"},
+   .atttypid = OIDOID,
+   .attlen = sizeof(Oid),
+   .attnum = ObjectIdAttributeNumber,
+   .attcacheoff = -1,
+   .atttypmod = -1,
+   .attbyval = true,
+   .attstorage = 'p',
+   .attalign = 'i',
+   .attnotnull = true,
+   .attislocal = true,
 };
 
 static 

Use C99 designated initializers for some structs

2018-08-29 Thread Peter Eisentraut
Here is a patch to change some struct initializations to use C99-style
designated initializers.  These are just a few particularly egregious
cases that were hard to read and write, and error prone because of many
similar adjacent types.

(The PL/Python changes currently don't compile with Python 3 because of
the situation described in the parallel thread "PL/Python: Remove use of
simple slicing API".)

Thoughts?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_verify_checksums failure with hash indexes

2018-08-29 Thread Dilip Kumar
On Tue, Aug 28, 2018 at 8:33 PM, Bernd Helmle  wrote:
> Am Dienstag, den 28.08.2018, 11:21 +0200 schrieb Peter Eisentraut:
>> This is reproducible with PG11 and PG12:
>>
>> initdb -k data
>> postgres -D data
>>
>> make installcheck
>> # shut down postgres with Ctrl-C
>>
>
> I tried to reproduce this and by accident i had a blocksize=4 in my
> configure script, and i got immediately failed installcheck results.
> They seem hash index related and can easily be reproduced:
>
> SHOW block_size ;
>  block_size
> 
>  4096
>
> CREATE TABLE foo(val text);
> INSERT INTO foo VALUES('bernd');
>
> CREATE INDEX ON foo USING hash(val);
> ERROR:  index "foo_val_idx" contains corrupted page at block 0
> HINT:  Please REINDEX it.
>
> I have no idea wether this could be related, but i thought it won't
> harm to share this here.
>

This issue seems different than the one got fixed in this thread.  The
reason for this issue is that the size of the hashm_mapp in
HashMetaPageData is 4096, irrespective of the block size.  So when the
block size is big enough (i.e. 8192) then there is no problem, but
when you set it to 4096, in that case, the hashm_mapp of the meta page
is overwriting the special space of the meta page.  That's the reason
its showing corrupted page while checking the hash_page.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: pg_verify_checksums failure with hash indexes

2018-08-29 Thread Yugo Nagata
On Wed, 29 Aug 2018 14:39:10 +0530
Amit Kapila  wrote:

> On Tue, Aug 28, 2018 at 2:51 PM Peter Eisentraut
>  wrote:
> >
> > This is reproducible with PG11 and PG12:
> >
> > initdb -k data
> > postgres -D data
> >
> > make installcheck
> > # shut down postgres with Ctrl-C
> >
> ..
> >
> > The files in question correspond to
> >
> > hash_i4_index
> > hash_name_index
> > hash_txt_index
> >
> 
> I have looked into this problem and found the cause of it.  This
> problem is happening for the empty page in the hash index.  On a
> split, we allocate a new splitpoint's worth of bucket pages wherein we
> initialize the last page with zero's, this is all fine, but we forgot
> to set the checksum for that last page.  Attached patch fixes the
> problem for me.
> 
> Can someone try and share their findings?

I confirmed that this patch fixed the problem by setting a checksum in the last
page in hash indexes, and pg_veviry_checksum is done successfully.  

regression=# select * from page_header(get_raw_page('hash_i4_index',65));
lsn| checksum | flags | lower | upper | special | pagesize | version | 
prune_xid 
---+--+---+---+---+-+--+-+---
 0/41CACF0 |18720 | 0 |24 |  8176 |8176 | 8192 |   4 |  
   0
(1 row)

By the way, I think we can fix this also by clearing the header information of 
the last
page instead of setting a checksum to the unused page although I am not sure 
which way
is better.

Regards,

-- 
Yugo Nagata 



Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-08-29 Thread Etsuro Fujita

(2018/08/29 0:21), Jonathan S. Katz wrote:

On Aug 24, 2018, at 8:38 AM, Etsuro Fujita  wrote:
(2018/08/24 11:47), Michael Paquier wrote:

On Thu, Aug 23, 2018 at 10:00:49PM +0900, Etsuro Fujita wrote:

I tried this today, but doing git behind the corporate firewall doesn't
work.  I don't know the clear cause of that, so I'll investigate that
tomorrow.


You may be able to tweak that by using https as origin point or proper
git proxy settings?


Yeah, my proxy settings were not correct.  With the help of my colleagues 
Horiguchi-san and Yamada-san, I corrected them but still can't clone the master 
repository.  Running git with GIT_CURL_VERBOSE shows that there is another 
issue in my terminal environment, so I'm trying to resolve that.


Are there any updates on getting this patch committed?


That investigation has shown that the cause is my company firewall, not 
my terminal environment; that firewall has to be configured to allow me 
to access to that repository.  So, I'm asking my company about that.


Best regards,
Etsuro Fujita



PL/Python: Remove use of simple slicing API

2018-08-29 Thread Peter Eisentraut
I have found some dying code in PL/Python.

The simple slicing API (sq_slice, sq_ass_slice) has been deprecated
since Python 2.0 and has been removed altogether in Python 3, so we can
remove those functions from the PLyResult class.  Instead, the non-slice
mapping functions mp_subscript and mp_ass_subscript can take slice
objects as an index.  Since we just pass the index through to the
underlying list object, we already support that.  Test coverage was
already in place.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 2676469f45fbc2a442a251d2b0ef8184b5cd5e2b Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 29 Aug 2018 11:10:17 +0200
Subject: [PATCH] PL/Python: Remove use of simple slicing API

The simple slicing API (sq_slice, sq_ass_slice) has been deprecated
since Python 2.0 and has been removed altogether in Python 3, so remove
those functions from the PLyResult class.  Instead, the non-slice
mapping functions mp_subscript and mp_ass_subscript can take slice
objects as an index.  Since we just pass the index through to the
underlying list object, we already support that.  Test coverage was
already in place.
---
 src/pl/plpython/plpy_resultobject.c | 24 ++--
 1 file changed, 2 insertions(+), 22 deletions(-)

diff --git a/src/pl/plpython/plpy_resultobject.c 
b/src/pl/plpython/plpy_resultobject.c
index ca70e25689..4064f6a7a2 100644
--- a/src/pl/plpython/plpy_resultobject.c
+++ b/src/pl/plpython/plpy_resultobject.c
@@ -20,8 +20,6 @@ static PyObject *PLy_result_nrows(PyObject *self, PyObject 
*args);
 static PyObject *PLy_result_status(PyObject *self, PyObject *args);
 static Py_ssize_t PLy_result_length(PyObject *arg);
 static PyObject *PLy_result_item(PyObject *arg, Py_ssize_t idx);
-static PyObject *PLy_result_slice(PyObject *arg, Py_ssize_t lidx, Py_ssize_t 
hidx);
-static int PLy_result_ass_slice(PyObject *arg, Py_ssize_t lidx, Py_ssize_t 
hidx, PyObject *slice);
 static PyObject *PLy_result_str(PyObject *arg);
 static PyObject *PLy_result_subscript(PyObject *arg, PyObject *item);
 static int PLy_result_ass_subscript(PyObject *self, PyObject *item, 
PyObject *value);
@@ -35,9 +33,9 @@ static PySequenceMethods PLy_result_as_sequence = {
NULL,   /* sq_concat */
NULL,   /* sq_repeat */
PLy_result_item,/* sq_item */
-   PLy_result_slice,   /* sq_slice */
+   NULL,   /* sq_slice */
NULL,   /* sq_ass_item */
-   PLy_result_ass_slice,   /* sq_ass_slice */
+   NULL,   /* sq_ass_slice */
 };
 
 static PyMappingMethods PLy_result_as_mapping = {
@@ -254,24 +252,6 @@ PLy_result_item(PyObject *arg, Py_ssize_t idx)
return rv;
 }
 
-static PyObject *
-PLy_result_slice(PyObject *arg, Py_ssize_t lidx, Py_ssize_t hidx)
-{
-   PLyResultObject *ob = (PLyResultObject *) arg;
-
-   return PyList_GetSlice(ob->rows, lidx, hidx);
-}
-
-static int
-PLy_result_ass_slice(PyObject *arg, Py_ssize_t lidx, Py_ssize_t hidx, PyObject 
*slice)
-{
-   int rv;
-   PLyResultObject *ob = (PLyResultObject *) arg;
-
-   rv = PyList_SetSlice(ob->rows, lidx, hidx, slice);
-   return rv;
-}
-
 static PyObject *
 PLy_result_str(PyObject *arg)
 {
-- 
2.18.0



Re: pg_verify_checksums failure with hash indexes

2018-08-29 Thread Yugo Nagata
On Tue, 28 Aug 2018 15:02:56 +0200
Michael Banck  wrote:

> Hi,
> 
> On Tue, Aug 28, 2018 at 11:21:34AM +0200, Peter Eisentraut wrote:
> > This is reproducible with PG11 and PG12:
> > 
> > initdb -k data
> > postgres -D data
> > 
> > make installcheck
> > # shut down postgres with Ctrl-C
> > 
> > pg_verify_checksums data
> > 
> > pg_verify_checksums: checksum verification failed in file
> > "data/base/16384/28647", block 65: calculated checksum DC70 but expected 0
> > pg_verify_checksums: checksum verification failed in file
> > "data/base/16384/28649", block 65: calculated checksum 89D8 but expected 0
> > pg_verify_checksums: checksum verification failed in file
> > "data/base/16384/28648", block 65: calculated checksum 9636 but expected 0
> > Checksum scan completed
> > Data checksum version: 1
> > Files scanned:  2493
> > Blocks scanned: 13172
> > Bad checksums:  3
> > 
> > The files in question correspond to
> > 
> > hash_i4_index
> > hash_name_index
> > hash_txt_index
> > 
> > Discuss. ;-)
> 
> I took a look at hash_name_index, assuming the others are similar.
> 
> Page 65 is the last page, pageinspect barfs on it as well:
> 
> regression=# SELECT get_raw_page('hash_name_index', 'main', 65);
> WARNING:  page verification failed, calculated checksum 18066 but expected 0
> ERROR:  invalid page in block 65 of relation base/16384/28638
> 
> The pages before that one from page 35 on are empty:
> 
> regression=# SELECT * FROM page_header(get_raw_page('hash_name_index', 
> 'main', 1));
> lsn| checksum | flags | lower | upper | special | pagesize | version 
> | prune_xid 
> ---+--+---+---+---+-+--+-+---
>  0/422D890 | 8807 | 0 |   664 |  5616 |8176 | 8192 |   4 
> | 0
> (1 Zeile)
> [...]
> regression=# SELECT * FROM page_header(get_raw_page('hash_name_index', 
> 'main', 34));
> lsn| checksum | flags | lower | upper | special | pagesize | version 
> | prune_xid 
> ---+--+---+---+---+-+--+-+---
>  0/422C690 |18153 | 0 |   580 |  5952 |8176 | 8192 |   4 
> | 0
> (1 Zeile)
> regression=# SELECT * FROM page_header(get_raw_page('hash_name_index', 
> 'main', 35));
>  lsn | checksum | flags | lower | upper | special | pagesize | version | 
> prune_xid 
> -+--+---+---+---+-+--+-+---
>  0/0 |0 | 0 | 0 | 0 |   0 |0 |   0 |  
>0
> [...]
> regression=# SELECT * FROM page_header(get_raw_page('hash_name_index', 
> 'main', 64));
>  lsn | checksum | flags | lower | upper | special | pagesize | version | 
> prune_xid 
> -+--+---+---+---+-+--+-+---
>  0/0 |0 | 0 | 0 | 0 |   0 |0 |   0 |  
>0
> regression=# SELECT * FROM page_header(get_raw_page('hash_name_index', 
> 'main', 65));
> WARNING:  page verification failed, calculated checksum 18066 but expected 0
> ERROR:  invalid page in block 65 of relation base/16384/28638
> 
> Running pg_filedump on the last two pages results in (not sure the
> "Invalid header information." are legit; neither about the checksum
> failure on block 64):
> 
> mba@fock:~/[...]postgresql/build/src/test/regress$ ~/tmp/bin/pg_filedump -R 
> 64 65 -k -f tmp_check/data/base/16384/28638 
> 
> --8<--
> ***
> * PostgreSQL File/Block Formatted Dump Utility - Version 10.1
> *
> * File: tmp_check/data/base/16384/28638
> * Options used: -R 64 65 -k -f 
> *
> * Dump created on: Tue Aug 28 14:53:37 2018
> ***
> 
> Block   64 
>  -
>  Block Offset: 0x0008 Offsets: Lower   0 (0x)
>  Block: Size0  Version0Upper   0 (0x)
>  LSN:  logid  0 recoff 0x  Special 0 (0x)
>  Items:0  Free Space:0
>  Checksum: 0x  Prune XID: 0x  Flags: 0x ()
>  Length (including item array): 24
> 
>  Error: Invalid header information.
> 
>  Error: checksum failure: calculated 0xc66a.
> 
>   :      
>   0010:  
> 
>  -- 
>  Empty block - no items listed 
> 
>  -
>  Error: Invalid special section encountered.
>  Error: Special section points off page. Unable to dump contents.
> 
> Block   65 
>  -
>  Block Offset: 0x00082000 Offsets: Lower  24 (0x0018)
>  Block: Size 8192  Version4Upper8176 (0x1ff0)
>  LSN:  logid  0 recoff 0x04229c20  Special  8176 (0x1ff0)
>  Items:0  Free Space: 8152
>  Checksum: 0x  Prune XID: 0x  

Re: pg_verify_checksums failure with hash indexes

2018-08-29 Thread Amit Kapila
On Tue, Aug 28, 2018 at 2:51 PM Peter Eisentraut
 wrote:
>
> This is reproducible with PG11 and PG12:
>
> initdb -k data
> postgres -D data
>
> make installcheck
> # shut down postgres with Ctrl-C
>
..
>
> The files in question correspond to
>
> hash_i4_index
> hash_name_index
> hash_txt_index
>

I have looked into this problem and found the cause of it.  This
problem is happening for the empty page in the hash index.  On a
split, we allocate a new splitpoint's worth of bucket pages wherein we
initialize the last page with zero's, this is all fine, but we forgot
to set the checksum for that last page.  Attached patch fixes the
problem for me.

Can someone try and share their findings?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


setchecksum_empty_pages_v1.patch
Description: Binary data


Re: Reopen logfile on SIGHUP

2018-08-29 Thread Alexander Korotkov
Hi!

On Wed, Aug 29, 2018 at 5:05 AM Kyotaro HORIGUCHI
 wrote:
> At Tue, 28 Aug 2018 18:50:31 +0300, Alexander Korotkov 
>  wrote in 
> 
> > Also I found that this new pg_ctl isn't covered with tests at all.  So
> > I've added very simple tap tests, which ensures that when log file was
> > renamed, it reappeared again after pg_ctl logrotate.  I wonder how
> > that would work on Windows.  Thankfully commitfest.cputube.org have
> > Windows checking facility now.
>
> Thanks for the test. Documentaion and help message looks fine
> including the changed ordering. (180 seconds retry may be a bit
> too long but I'm fine with it.)

Thank you for the comments.  My idea about retry logic was to provide
the similar behavior to poll_query_until().

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Flexible configuration for full-text search

2018-08-29 Thread Aleksandr Parfenov
On Tue, 28 Aug 2018 12:40:32 +0700
Aleksandr Parfenov  wrote:

>On Fri, 24 Aug 2018 18:50:38 +0300
>Alexander Korotkov  wrote:
>>Agreed, backward compatibility is important here.  Probably we should
>>leave old dictionaries for that.  But I just meant that if we
>>introduce new (better) way of stop words handling and encourage users
>>to use it, then it would look strange if default configurations work
>>the old way...  
>
>I agree with Alexander. The only drawback I see is that after addition
>of new dictionaries, there will be 3 dictionaries for each language:
>old one, stop-word filter for the language, and stemmer dictionary.

During work on the new version of the patch, I found an issue in
proposed syntax. At the beginning of the conversation, there was a
suggestion to split stop word filtering and words normalization. At this
stage of development, we can use a different dictionary for stop word
detection, but if we drop the word, the word counter wouldn't increase
and the stop word will be processed as an unknown word.

Currently, I see two solutions:

1) Keep the old way of stop word filtering. The drawback of this
approach is the mixing of word normalization and stop word detection
logic inside of a dictionary. It can be solved by the usage of 'simple'
dictionary in accept=false mode as a stop word filter.

2) Add an action STOPWORD to KEEP and DROP (which is not implemented in
previous patch, but I think it is good to have both of them) in the
meaning of "increase word counter but don't add lexeme to vector".

Any suggestions on the issue?

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-29 Thread Andres Freund
On 2018-08-28 20:29:08 -0700, Andres Freund wrote:
> On 2018-08-28 20:27:14 -0700, Andres Freund wrote:
> > Locally that triggers the problem within usually a few seconds.
> 
> FWIW, it does so including versions as old as 9.2.
> 
> Now I need to look for power for my laptop and some for me ;)

A bit of food, a coke and a talk later, here's a first draft *prototype*
of how this could be solved.

It's not OK to rebuild relcache entries in the middle of
ReceiveSharedInvalidMessages() - a later entry in the invalidation queue
might be relmapper invalidation, and thus immediately processing a
relcache invalidation might attempt to scan a relation that does not
exist anymore.

Instead, receiving a relcache invalidation now just queues an entry onto
a new "pending rebuilds" list, that is then processed in a second stage,
after emptying the entire sinval queue.  At that stage we're guaranteed
that the relmapper is up2date.

This might actually have performance benefits in some realistic
scenarios - while RelationFlushRelation() marks the relcache entry
invalid for each of the received entries,
ProcessPendingRelcacheRebuilds() can skip rebuilding if the entry has
since been rebuild.


Obviously this is far from clean enough, but what do you think about the
basic approach?  It does, in my limited testing, indeed solve the "could
not read block" issue.


Greetings,

Andres Freund
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5f4ae1291c6..d7037d02a75 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2096,6 +2096,7 @@ ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	for (i = 0; i < txn->ninvalidations; i++)
 		LocalExecuteInvalidationMessage(>invalidations[i]);
+	ProcessPendingRelcacheRebuilds();
 }
 
 /*
diff --git a/src/backend/storage/ipc/sinval.c b/src/backend/storage/ipc/sinval.c
index 563e2906e36..4528a17a9f7 100644
--- a/src/backend/storage/ipc/sinval.c
+++ b/src/backend/storage/ipc/sinval.c
@@ -127,6 +127,8 @@ ReceiveSharedInvalidMessages(
 		 */
 	} while (nummsgs == MAXINVALMSGS);
 
+	ProcessPendingRelcacheRebuilds();
+
 	/*
 	 * We are now caught up.  If we received a catchup signal, reset that
 	 * flag, and call SICleanupQueue().  This is not so much because we need
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 16d86a29390..e4f70428999 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -456,6 +456,8 @@ ProcessInvalidationMessages(InvalidationListHeader *hdr,
 {
 	ProcessMessageList(hdr->cclist, func(msg));
 	ProcessMessageList(hdr->rclist, func(msg));
+
+	ProcessPendingRelcacheRebuilds();
 }
 
 /*
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 6125421d39a..927b8cef4d9 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -144,6 +144,14 @@ bool		criticalSharedRelcachesBuilt = false;
  */
 static long relcacheInvalsReceived = 0L;
 
+/*
+ * List of oids for entries that need to be rebuilt. Entries get queued onto
+ * it while receiving invalidations, and are processed at the end. This both
+ * avoids errors due to relcache rebuilds relying on an up2date relmapper, and
+ * avoids redundant rebuilds.
+ */
+static List *pending_rebuilds = NIL;
+
 /*
  * eoxact_list[] stores the OIDs of relations that (might) need AtEOXact
  * cleanup work.  This list intentionally has limited size; if it overflows,
@@ -251,7 +259,7 @@ static void RelationClearRelation(Relation relation, bool rebuild);
 
 static void RelationReloadIndexInfo(Relation relation);
 static void RelationReloadNailed(Relation relation);
-static void RelationFlushRelation(Relation relation);
+static void RelationFlushRelation(Relation relation, bool immed);
 static void RememberToFreeTupleDescAtEOX(TupleDesc td);
 static void AtEOXact_cleanup(Relation relation, bool isCommit);
 static void AtEOSubXact_cleanup(Relation relation, bool isCommit,
@@ -2556,7 +2564,7 @@ RelationClearRelation(Relation relation, bool rebuild)
  *	 This is used when we receive a cache invalidation event for the rel.
  */
 static void
-RelationFlushRelation(Relation relation)
+RelationFlushRelation(Relation relation, bool immed)
 {
 	if (relation->rd_createSubid != InvalidSubTransactionId ||
 		relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
@@ -2581,10 +2589,69 @@ RelationFlushRelation(Relation relation)
 		 */
 		bool		rebuild = !RelationHasReferenceCountZero(relation);
 
-		RelationClearRelation(relation, rebuild);
+		/*
+		 * Can't immediately rebuild entry - in the middle of receiving
+		 * invalidations relmapper might be out of date and point to an older
+		 * version of required catalogs. Defer rebuilds until the queue has
+		 * been emptied.
+		 */
+		if (rebuild && !immed)
+		{
+			MemoryContext oldcontext;
+
+			oldcontext = 

Re: pg_verify_checksums -d option (was: Re: pg_verify_checksums -r option)

2018-08-29 Thread Daniel Gustafsson
> On 27 Aug 2018, at 14:05, Yugo Nagata  wrote:
> 
> On Mon, 27 Aug 2018 13:34:12 +0200
> Michael Banck  wrote:
> 
>> Hi,
>> 
>> On Mon, Aug 27, 2018 at 07:53:36PM +0900, Yugo Nagata wrote:
>>> On Fri, 24 Aug 2018 18:01:09 +0200
>>> Peter Eisentraut  wrote:
 I'm curious about this option:
 
  -r RELFILENODE check only relation with specified relfilenode
 
 but there is no facility to specify a database.
 
 Also, referring to the relfilenode of a mapped relation seems a bit
 inaccurate.
 
 Maybe reframing this in terms of the file name of the file you want
 checked would be better?
>>> 
>>> If we specified 1234 to -r option, pg_verify_shceksums checks not only 1234
>>> but also 1234_vm, 1234_fsm, and 1234.1, 1234.2, ... and so on, so I think
>>> it makes senses to allow to specify a relfilenode instead of a file name.
>>> 
>>> I think it is reasonable to add a option to specify a database, although
>>> I don't know which character is good because both -d and -D are already 
>>> used
>> 
>> Maybe the -d (debug) option should be revisited as well. Mentioning
>> every scanned block generates a huge amount of output which might be
>> useful during development but does not seem very useful for a stable
>> release. AFAICT there is no other debug output for now.
>> 
>> So it could be renamed to -v (verbose) and only mention each scanned
>> file, e.g. (errors/checksum mismatches are still reported of course).
>> 
>> Then -d could (in the future, I guess that is too late for v11) be used
>> for -d/--dbname (or make that only a long option, if the above does not
>> work).
> 
> I realized after sending the previous post that we can not specify a database
> by name because pg_verify_checksum is run in offline and this can not get the
> OID from the database name.  Also, there are global and pg_tblspc directories
> not only base/. So, it seems to me good to specify a directories
> to scan which is under PGDATA. We would be able to use -d ( or --directory ?)
> for this purpose.

Changing functionality to the above discussed is obviously 12 material, but
since we are discussing changing the command line API of the tool by
repurposing -d; do we want to rename the current use of -d to -v (with the
accompanying —-verbose) before 11 ships?  It’s clearly way way too late in the
cycle but it seems worth to at least bring up since 11 will be the first
version pg_verify_checksums ship in. I’m happy to do the work asap if so.

FWIW, personally I think verbose makes more sense for the output than debug.

cheers ./daniel


A strange GiST error message or fillfactor of GiST build

2018-08-29 Thread Kyotaro HORIGUCHI
Hello.

In the discussion about cube's dimention limit [1], I found that
the error messages looks strange.

https://www.postgresql.org/message-id/f0e1a404-a495-4f38-b817-06355b537...@yandex-team.ru

> postgres=# create table y as  select cube(array(SELECT random() as a FROM 
> generate_series(1,1000))) from generate_series(1,1e3,1); 
> SELECT 1000
> postgres=# create index on y using gist(cube );
> ERROR:  index row size 8016 exceeds maximum 8152 for index "y_cube_idx"

This is apparently strange. This is because the message doesn't
count fill factor at the time. It is fixed by passing freespace
to gistSplit() and that allows gistfitpage() to consider
fillfactor as TODO comment within.

After the attached patch applied, the above messages becomes as
follows. (And index can be built being a bit sparse by fill
factor.)

> ERROR:  index row size 8016 exceeds maximum 7333 for index "y_cube_idx"

I'm not sure why 277807bd9e didn't do that completely so I may be
missing something. Is there any thoughts?


There's another issue that ununderstandable message is issued
when (the root) page cannot store two tuples. I'll send a fix for
that sooner if no one objects to check it separately.

> =# create table y as select cube(array(SELECT random() as a FROM 
> generate_series(1,900))) from generate_series(1,1e3,1);
> =# create index on y using gist (cube);
> ERROR:  failed to add item to index page in "y_cube_idx"

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42effdf7..7773a2 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -293,7 +293,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 memmove(itvec + pos, itvec + pos + 1, sizeof(IndexTuple) * (tlen - pos));
 		}
 		itvec = gistjoinvector(itvec, , itup, ntup);
-		dist = gistSplit(rel, page, itvec, tlen, giststate);
+		dist = gistSplit(rel, page, itvec, tlen, freespace, giststate);
 
 		/*
 		 * Check that split didn't produce too many pages.
@@ -1352,6 +1352,7 @@ gistSplit(Relation r,
 		  Page page,
 		  IndexTuple *itup,		/* contains compressed entry */
 		  int len,
+		  int freespace,
 		  GISTSTATE *giststate)
 {
 	IndexTuple *lvectup,
@@ -1374,7 +1375,7 @@ gistSplit(Relation r,
 		ereport(ERROR,
 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
  errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-		IndexTupleSize(itup[0]), GiSTPageSize,
+		IndexTupleSize(itup[0]), GiSTPageSize - freespace,
 		RelationGetRelationName(r;
 
 	memset(v.spl_lisnull, true, sizeof(bool) * giststate->tupdesc->natts);
@@ -1392,9 +1393,10 @@ gistSplit(Relation r,
 		rvectup[i] = itup[v.splitVector.spl_right[i] - 1];
 
 	/* finalize splitting (may need another split) */
-	if (!gistfitpage(rvectup, v.splitVector.spl_nright))
+	if (!gistfitpage(rvectup, v.splitVector.spl_nright, freespace))
 	{
-		res = gistSplit(r, page, rvectup, v.splitVector.spl_nright, giststate);
+		res = gistSplit(r, page, rvectup, v.splitVector.spl_nright, freespace,
+		giststate);
 	}
 	else
 	{
@@ -1404,12 +1406,13 @@ gistSplit(Relation r,
 		res->itup = gistFormTuple(giststate, r, v.spl_rattr, v.spl_risnull, false);
 	}
 
-	if (!gistfitpage(lvectup, v.splitVector.spl_nleft))
+	if (!gistfitpage(lvectup, v.splitVector.spl_nleft, freespace))
 	{
 		SplitedPageLayout *resptr,
    *subres;
 
-		resptr = subres = gistSplit(r, page, lvectup, v.splitVector.spl_nleft, giststate);
+		resptr = subres = gistSplit(r, page, lvectup, v.splitVector.spl_nleft,
+	freespace, giststate);
 
 		/* install on list's tail */
 		while (resptr->next)
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index dddfe0ae2c..09c7f621bc 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -74,7 +74,7 @@ gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size f
 }
 
 bool
-gistfitpage(IndexTuple *itvec, int len)
+gistfitpage(IndexTuple *itvec, int len, int freespace)
 {
 	int			i;
 	Size		size = 0;
@@ -82,8 +82,7 @@ gistfitpage(IndexTuple *itvec, int len)
 	for (i = 0; i < len; i++)
 		size += IndexTupleSize(itvec[i]) + sizeof(ItemIdData);
 
-	/* TODO: Consider fillfactor */
-	return (size <= GiSTPageSize);
+	return (size <= GiSTPageSize - freespace);
 }
 
 /*
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 36ed7244ba..cac3d0b8e5 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -407,7 +407,7 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 bool markleftchild);
 
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
-		  int len, GISTSTATE *giststate);
+int len, int freespace, GISTSTATE *giststate);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,

Re: logical decoding: ABI break in 10.5 et al

2018-08-29 Thread Christoph Berg
Re: Alvaro Herrera 2018-08-28 <20180828153806.fgfnul2imeltzmib@alvherre.pgsql>
> This commit made modules dependent on structs ReorderBuffer and
> ReorderBufferTXN compiled for prior minor releases no longer work with
> the new minors, because some new struct members were inserted in the
> middle of the struct.  Third-party modules such as pglogical, BDR,
> wal2json are affected and need to be recompiled; failing to do so can
> make them misbehave or crash.

Fwiw, I haven't seen any problems with the wal2json packages in
apt.postgresql.org that were compiled with the old PG packages. They
still pass the regression tests with the new ABI.

https://pgdgbuild.dus.dg-i.net/view/Testsuite/job/wal2json-adt/lastSuccessfulBuild/architecture=amd64,distribution=sid/console

Christoph



Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2018-08-29 Thread Ashutosh Bapat
On Thu, Aug 23, 2018 at 4:01 PM, Dmitry Dolgov <9erthali...@gmail.com> wrote:
>> On Fri, 27 Jul 2018 at 20:13, Robert Haas  wrote:
>>
>> On Fri, Jul 27, 2018 at 3:17 AM, Ashutosh Bapat
>>  wrote:
>> > Apart from the complexity there's also a possibility that this
>> > skipping will reduce the efficiency actually in normal cases. Consider
>> > a case where A and B have exactly matching partitions. Current
>> > partition matching algorithm compare any given range/bound only once
>> > and we will have O(n) algorithm. If we use a binary search, however,
>> > for every range comparison, it will be O(n log n) algorithm. There
>> > will be unnecessary comparisons during binary search. The current
>> > algorithm is always O(n), whereas binary search would be O(n log(n))
>> > with a possibility of having sub-O(n) complexity in some cases. I
>> > would go for an algorithm which has a consistent time complexity
>> > always and which is efficient in normal cases, rather than worrying
>> > about some cases which are not practical.
>>
>> Yeah, I think that's a good point.  The normal case here will be that
>> the partition bounds are equal, or that there are a few extra
>> partitions on one side that don't exist on the other.  We don't want
>> other cases to be crazily inefficient, but I suspect in practice that
>> if the partitioning bounds aren't pretty close to matching up exactly,
>> we're going to end up failing to be able to do a partition-wise join
>> at all.  It's not very likely that somebody happens to have a case
>> where they've partitioned two tables in randomly different ways, but
>> then they decide to join them anyway, but then it turns out that the
>> partition bounds happen to be compatible enough that this algorithm
>> works.
>
>> On Mon, 23 Jul 2018 at 10:38, Ashutosh Bapat 
>>  wrote:
>> 1. those cases are rare enough that we can ignore those right now. How
>> many times we would encounter the case you have quoted, for example?
>> Usually the ranges will be continuous only differing in the first or
>> last partition e.g time-series data.
>
> Ok, but what about list partitioning? I believe the area of application for it
> can be more diverse than mostly just for time-series, and in the patch almost
> the same approach is used to merge list partitions.

Same arguments hold for list partitioning as well. For a list
partitioned table the bounds are ordered by list datums and not
partitions, so it will be rather odd to have large runs of mismatching
datums, the case where binary search fares, from one of side of join.
So, my following argument still holds true

---
>> > I would go for an algorithm which has a consistent time complexity
>> > always and which is efficient in normal cases, rather than worrying
>> > about some cases which are not practical.
---


>
> Other than that everything seems fine from functionality point of view, and so
> far I couldn't find any situation that produces a wrong plan. Some of the 
> joins
> were not that intuitive from the first glance, but at the end everything was
> according to the documentation.

Thanks a lot for your tests and I am glad that you didn't find any failures.

> Taking this into account I wonder if it's possible somehow to give any hints 
> in
> an explain result about why it wasn't possible to apply partition wise join? 
> If
> something wasn't clear for me I ended up looking at the value of "merged" 
> flag,
> and to figure out actual reason one needs to trace the entire algorithm.

That's kind of tricky in PostgreSQL. The optimizer doesn't always
report why a particular path was not chosen. Said that we could add
trace logs printing that information, however, the main difficulty is
reporting the relations being joined.See 0004 for example.

>
> Besides that I checked the performance in some simple cases, no problems on
> this side (but I'll also do some checks for more complicated joins).

Thanks a lot.

>
> And I still have some complaints about readability, although I can imagine 
> that
> it's just me:
>
> * Many functions carry some unrelated arguments just to pass them through,
>   which obscures the purpose of a function.

Can you please provide some examples?

>
> * I want to suggest to introduce a new data structure for partitions mapping
>   instead of just keeping them in arrays (was it discussed already before?).

The best I could think of was a structure with just two arrays. So,
instead of encapsulating the arrays within a structure, I thought it
best to pass bare arrays around. If you have any other ideas, please
let me know.

>
> * What is the reason that almost everywhere in the patch there is a naming
>   convention "outer/inner" partition maps, and sometimes (e.g. in
>   generate_matching_part_pairs) it's "map1/map2"?

You will find that the optimizer in general uses outer/inner,
rel1/rel2 nomenclature interchangeably. This patch-set just inherits
that. But yes, we should do more to straighten it out.

I won't be working on this 

Re: Catalog corruption

2018-08-29 Thread Mariel Cherkassky
Yes indeed.
I took a cold backup - shutdown the database and copy all the data dir.

‫בתאריך יום ג׳, 28 באוג׳ 2018 ב-22:34 מאת ‪Asim R P‬‏ <‪aprav...@pivotal.io
‬‏>:‬

> On Tue, Aug 28, 2018 at 7:36 AM, Mariel Cherkassky
>  wrote:
> > Afterwards I vacuumed all the databases but nothing helped. I tried to
> > reindex the databases but I got the next error :
> >
> > 2018-08-28 21:50:03 +08 db2 24360  ERROR:  could not access status of
> > transaction 32212695
> > 2018-08-28 21:50:03 +08 db2 24360  DETAIL:  Could not open file
> > "pg_subtrans/01EB": No such file or directory.
> >
>
> Are you sure you created a checkpoint before copying the data
> directory over to your PC?
>


Re: Postmaster doesn't send SIGTERM to bgworker during fast shutdown when pmState == PM_STARTUP

2018-08-29 Thread Alexander Kukushkin
Hi,

2018-08-29 1:24 GMT+02:00 Michael Paquier :

> I have been studying your patch, but it seems to me that this is not
> complete as other processes could have been started before switching
> from PM_STARTUP to PM_RECOVERY.  I am talking here about the bgwriter
> and the checkpointer as well.  Shouldn't we switch pmState to
> PM_WAIT_BACKENDS?  Your patch is missing that.

Yeah, good catch, it starts checkpointer, bgwriter and in some cases
even archiver processes (when archive_mode=always) while pmState is
still equaled PM_START.
Please find attached the new version of the fix.


Regards,
--
Alexander Kukushkin
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a4b53b33cd..2215ebbb5a 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2685,7 +2685,7 @@ pmdie(SIGNAL_ARGS)
 signal_child(BgWriterPID, SIGTERM);
 			if (WalReceiverPID != 0)
 signal_child(WalReceiverPID, SIGTERM);
-			if (pmState == PM_RECOVERY)
+			if (pmState == PM_STARTUP || pmState == PM_RECOVERY)
 			{
 SignalSomeChildren(SIGTERM, BACKEND_TYPE_BGWORKER);
 


Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)

2018-08-29 Thread Peter Eisentraut
Do you plan to submit this patch to the upcoming commit fest perhaps?  I
have done some testing on it and it seems worth pursuing further.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services