Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-03-28 Thread Craig Ringer
On 28 March 2018 at 11:53, Tom Lane  wrote:

> Craig Ringer  writes:
> > TL;DR: Pg should PANIC on fsync() EIO return.
>
> Surely you jest.
>

No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as
well to avoid similar lost-page-write issues.

It's not necessary on ext3/ext4 with errors=remount-ro, but that's only
because the FS stops us dead in our tracks.

I don't pretend it's sane. The kernel behaviour is IMO crazy. If it's going
to lose a write, it should at minimum mark the FD as broken so no further
fsync() or anything else can succeed on the FD, and an app that cares about
durability must repeat the whole set of work since the prior succesful
fsync(). Just reporting it once and forgetting it is madness.

But even if we convince the kernel folks of that, how do other platforms
behave? And how long before these kernels are out of use? We'd better deal
with it, crazy or no.

Please see my StackOverflow post for the kernel-level explanation. Note
also the test case link there. https://stackoverflow.com/a/42436054/398670

> Retrying fsync() is not OK at
> > least on Linux. When fsync() returns success it means "all writes since
> the
> > last fsync have hit disk" but we assume it means "all writes since the
> last
> > SUCCESSFUL fsync have hit disk".
>
> If that's actually the case, we need to push back on this kernel brain
> damage, because as you're describing it fsync would be completely useless.
>

It's not useless, it's just telling us something other than what we think
it means. The promise it seems to give us is that if it reports an error
once, everything *after* that is useless, so we should throw our toys,
close and reopen everything, and redo from the last known-good state.

Though as Tomas posted below, it provides rather weaker guarantees than I
thought in some other areas too. See that lwn.net article he linked.


> Moreover, POSIX is entirely clear that successful fsync means all
> preceding writes for the file have been completed, full stop, doesn't
> matter when they were issued.
>

I can't find anything that says so to me. Please quote relevant spec.

I'm working from
http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html which
states that

"The fsync() function shall request that all data for the open file
descriptor named by fildes is to be transferred to the storage device
associated with the file described by fildes. The nature of the transfer is
implementation-defined. The fsync() function shall not return until the
system has completed that action or until an error is detected."

My reading is that POSIX does not specify what happens AFTER an error is
detected. It doesn't say that error has to be persistent and that
subsequent calls must also report the error. It also says:

"If the fsync() function fails, outstanding I/O operations are not
guaranteed to have been completed."

but that doesn't clarify matters much either, because it can be read to
mean that once there's been an error reported for some IO operations
there's no guarantee those operations are ever completed even after a
subsequent fsync returns success.

I'm not seeking to defend what the kernel seems to be doing. Rather, saying
that we might see similar behaviour on other platforms, crazy or not. I
haven't looked past linux yet, though.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: pg_class.reltuples of brin indexes

2018-03-28 Thread Masahiko Sawada
On Tue, Mar 27, 2018 at 11:28 PM, Alvaro Herrera
 wrote:
> Tom Lane wrote:
>> Tomas Vondra  writes:
>> > I think number of index tuples makes sense, as long as that's what the
>> > costing needs. That is, it's up to the index AM to define it. But it
>> > clearly should not flap like this ...
>>
>> > And it's not just BRIN. This is what I get with a GIN index:
>>
>> Sounds like the same kind of thing we just fixed for SP-GiST :-(
>
> Most likely I modelled the BRIN code after GIN.
>

It's better to create a new index AM that estimates the number of
index tuples, and to update the index stats using that returned value?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-03-28 Thread Craig Ringer
On 29 March 2018 at 10:30, Michael Paquier  wrote:

> On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
> > Craig Ringer  writes:
> >> TL;DR: Pg should PANIC on fsync() EIO return.
> >
> > Surely you jest.
>
> Any callers of pg_fsync in the backend code are careful enough to check
> the returned status, sometimes doing retries like in mdsync, so what is
> proposed here would be a regression.



I covered this in my original post.

Yes, we check the return value. But what do we do about it? For fsyncs of
heap files, we ERROR, aborting the checkpoint. We'll retry the checkpoint
later, which will retry the fsync(). **Which will now appear to succeed**
because the kernel forgot that it lost our writes after telling us the
first time. So we do check the error code, which returns success, and we
complete the checkpoint and move on.

But we only retried the fsync, not the writes before the fsync.

So we lost data. Or rather, failed to detect that the kernel did so, so our
checkpoint was bad and could not be completed.

The problem is that we keep retrying checkpoints *without* repeating the
writes leading up to the checkpoint, and retrying fsync.

I don't pretend the kernel behaviour is sane, but we'd better deal with it
anyway.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-03-28 Thread Craig Ringer
On 29 March 2018 at 10:48, Thomas Munro 
wrote:

> On Thu, Mar 29, 2018 at 3:30 PM, Michael Paquier 
> wrote:
> > On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
> >> Craig Ringer  writes:
> >>> TL;DR: Pg should PANIC on fsync() EIO return.
> >>
> >> Surely you jest.
> >
> > Any callers of pg_fsync in the backend code are careful enough to check
> > the returned status, sometimes doing retries like in mdsync, so what is
> > proposed here would be a regression.
>
> Craig, is the phenomenon you described the same as the second issue
> "Reporting writeback errors" discussed in this article?
>
> https://lwn.net/Articles/724307/


A variant of it, by the looks.

The problem in our case is that the kernel only tells us about the error
once. It then forgets about it. So yes, that seems like a variant of the
statement:


> "Current kernels might report a writeback error on an fsync() call,
> but there are a number of ways in which that can fail to happen."
>
> That's... I'm speechless.


Yeah.

It's a bit nuts.

I was astonished when I saw the behaviour, and that it appears undocumented.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-03-28 Thread Craig Ringer
On 29 March 2018 at 13:06, Thomas Munro 
wrote:

> On Thu, Mar 29, 2018 at 6:00 PM, Justin Pryzby 
> wrote:
> > The retries are the source of the problem ; the first fsync() can return
> EIO,
> > and also *clears the error* causing a 2nd fsync (of the same data) to
> return
> > success.
>
> What I'm failing to grok here is how that error flag even matters,
> whether it's a single bit or a counter as described in that patch.  If
> write back failed, *the page is still dirty*.  So all future calls to
> fsync() need to try to try to flush it again, and (presumably) fail
> again (unless it happens to succeed this time around).
> 
>

You'd think so. But it doesn't appear to work that way. You can see
yourself with the error device-mapper destination mapped over part of a
volume.

I wrote a test case here.

https://github.com/ringerc/scrapcode/blob/master/testcases/fsync-error-clear.c

I don't pretend the kernel behaviour is sane. And it's possible I've made
an error in my analysis. But since I've observed this in the wild, and seen
it in a test case, I strongly suspect that's what I've described is just
what's happening, brain-dead or no.

Presumably the kernel marks the page clean when it dispatches it to the I/O
subsystem and doesn't dirty it again on I/O error? I haven't dug that deep
on the kernel side. See the stackoverflow post for details on what I found
in kernel code analysis.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-03-28 Thread Thomas Munro
On Thu, Mar 29, 2018 at 6:00 PM, Justin Pryzby  wrote:
> The retries are the source of the problem ; the first fsync() can return EIO,
> and also *clears the error* causing a 2nd fsync (of the same data) to return
> success.

What I'm failing to grok here is how that error flag even matters,
whether it's a single bit or a counter as described in that patch.  If
write back failed, *the page is still dirty*.  So all future calls to
fsync() need to try to try to flush it again, and (presumably) fail
again (unless it happens to succeed this time around).

-- 
Thomas Munro
http://www.enterprisedb.com



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-03-28 Thread Justin Pryzby
On Thu, Mar 29, 2018 at 11:30:59AM +0900, Michael Paquier wrote:
> On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
> > Craig Ringer  writes:
> >> TL;DR: Pg should PANIC on fsync() EIO return.
> > 
> > Surely you jest.
> 
> Any callers of pg_fsync in the backend code are careful enough to check
> the returned status, sometimes doing retries like in mdsync, so what is
> proposed here would be a regression.

The retries are the source of the problem ; the first fsync() can return EIO,
and also *clears the error* causing a 2nd fsync (of the same data) to return
success.

(Note, I can see that it might be useful to PANIC on EIO but retry for ENOSPC).

On Thu, Mar 29, 2018 at 03:48:27PM +1300, Thomas Munro wrote:
> Craig, is the phenomenon you described the same as the second issue
> "Reporting writeback errors" discussed in this article?
> https://lwn.net/Articles/724307/

Worse, the article acknowledges the behavior without apparently suggesting to
change it:

 "Storing that value in the file structure has an important benefit: it makes
it possible to report a writeback error EXACTLY ONCE TO EVERY PROCESS THAT
CALLS FSYNC()  In current kernels, ONLY THE FIRST CALLER AFTER AN ERROR
OCCURS HAS A CHANCE OF SEEING THAT ERROR INFORMATION."

I believe I reproduced the problem behavior using dmsetup "error" target, see
attached.

strace looks like this:

kernel is Linux 4.10.0-28-generic #32~16.04.2-Ubuntu SMP Thu Jul 20 10:19:48 
UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

 1  open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3
 2  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
8192
 3  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
8192
 4  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
8192
 5  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
8192
 6  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
8192
 7  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
8192
 8  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
2560
 9  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
-1 ENOSPC (No space left on device)
10  dup(2)  = 4
11  fcntl(4, F_GETFL)   = 0x8402 (flags 
O_RDWR|O_APPEND|O_LARGEFILE)
12  brk(NULL)   = 0x1299000
13  brk(0x12ba000)  = 0x12ba000
14  fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
15  write(4, "write(1): No space left on devic"..., 34write(1): No space 
left on device
16  ) = 34
17  close(4)= 0
18  fsync(3)= -1 EIO (Input/output error)
19  dup(2)  = 4
20  fcntl(4, F_GETFL)   = 0x8402 (flags 
O_RDWR|O_APPEND|O_LARGEFILE)
21  fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
22  write(4, "fsync(1): Input/output error\n", 29fsync(1): Input/output 
error
23  ) = 29
24  close(4)= 0
25  close(3)= 0
26  open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3
27  fsync(3)= 0
28  write(3, "\0", 1)   = 1
29  fsync(3)= 0
30  exit_group(0)   = ?

2: EIO isn't seen initially due to writeback page cache;
9: ENOSPC due to small device
18: original IO error reported by fsync, good
25: the original FD is closed
26: ..and file reopened
27: fsync on file with still-dirty data+EIO returns success BAD

10, 19: I'm not sure why there's dup(2), I guess glibc thinks that perror
should write to a separate FD (?)

Also note, close() ALSO returned success..which you might think exonerates the
2nd fsync(), but I think may itself be problematic, no?  In any case, the 2nd
byte certainly never got written to DM error, and the failure status was lost
following fsync().

I get the exact same behavior if I break after one write() loop, such as to
avoid ENOSPC.

Justin
/*
% make CFLAGS='-Wall -Wextra -O3' /tmp/eio
% sudo lvcreate -L 9M -n tst data

echo '
0 1 linear /dev/data/tst 0
1 1 error 1
2 99 linear /dev/data/tst 2' |sudo dmsetup create eio

*/

#include 

#include 
#include 
#include 

#include 
#include 

int main()
{
	char buf[8<<10];
	int fd;

	if (-1==(fd=open("/dev/mapper/eio", O_CREAT|O_RDWR, 00600))) {
	//if (-1==(fd=open("/mnt/t", O_CREAT|O_RDWR, 00600))) {
		perror("open");
		exit(1);
	}

	while (1) {
		// if (sizeof(buf)!=(write(fd, buf, sizeof(buf {
		if (-1==write(fd, buf, sizeof(buf))) {
			perror("write(1)");
			

Re: PQHost() undefined behavior if connecting string contains both host and hostaddr types

2018-03-28 Thread Kyotaro HORIGUCHI
At Wed, 28 Mar 2018 10:34:49 +0900, Michael Paquier  wrote 
in <20180328013449.gc1...@paquier.xyz>
> On Wed, Mar 28, 2018 at 11:06:23AM +1100, Haribabu Kommi wrote:
> > On Wed, Mar 28, 2018 at 3:35 AM, Peter Eisentraut <
> > peter.eisentr...@2ndquadrant.com> wrote:
> >>
> >> Committed after fixing up the documentation a bit as suggested by others.
> > 
> > Thanks.
> 
> +1.  Thanks for working on this Hari, Peter for the commit and all
> others for your input!

Me too.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Problem while setting the fpw with SIGHUP

2018-03-28 Thread Kyotaro HORIGUCHI
At Wed, 28 Mar 2018 15:59:48 +0900, Michael Paquier  wrote 
in <20180328065948.gm1...@paquier.xyz>
> On Wed, Mar 28, 2018 at 03:40:59PM +0900, Kyotaro HORIGUCHI wrote:
> > The attached does that. I don't like that it uses ControlFileLock
> > to exlucde concurrent UpdateFullPageWrites and StartupXLOG but
> > WALInsertLock cannot be used since UpdateFullPageWrites may take
> > the same lock.
> 
> You visibly forgot your patch.

Mmm, someone must have eaten that. I'm sure it is attached this
time.

I don't like UpdateFullPageWrites is using ControlFileLock to
exclusion..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cb9c2a29cb..d2bb5e16ac 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7558,9 +7558,11 @@ StartupXLOG(void)
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * record is written, ignoreing the change of full_page_write GUC so far.
 	 */
+	WALInsertLockAcquireExclusive();
 	Insert->fullPageWrites = lastFullPageWrites;
+	WALInsertLockRelease();
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
@@ -7783,6 +7785,9 @@ StartupXLOG(void)
 	 * itself, we use the info_lck here to ensure that there are no race
 	 * conditions concerning visibility of other recent updates to shared
 	 * memory.
+	 *
+	 * ControlFileLock excludes concurrent update of shared fullPageWrites in
+	 * addition to its ordinary use.
 	 */
 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 	ControlFile->state = DB_IN_PRODUCTION;
@@ -7790,11 +7795,25 @@ StartupXLOG(void)
 
 	SpinLockAcquire(>info_lck);
 	XLogCtl->SharedRecoveryInProgress = false;
+	lastFullPageWrites = Insert->fullPageWrites;
 	SpinLockRelease(>info_lck);
 
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Shared fullPageWrites at the end of recovery differs to our last known
+	 * value. The change has been made by checkpointer but we haven't made
+	 * corresponding XLOG_FPW_CHANGE record. We reread GUC change if any and
+	 * try update shared fullPageWrites by myself. It ends with doing nothing
+	 * if checkpointer already did the same thing.
+	 */
+	if (lastFullPageWrites != fullPageWrites)
+	{
+		HandleStartupProcInterrupts();
+		UpdateFullPageWrites();
+	}
+
 	/*
 	 * If there were cascading standby servers connected to us, nudge any wal
 	 * sender processes to notice that we've been promoted.
@@ -9525,8 +9544,10 @@ XLogReportParameters(void)
  * Update full_page_writes in shared memory, and write an
  * XLOG_FPW_CHANGE record if necessary.
  *
- * Note: this function assumes there is no other process running
- * concurrently that could update it.
+ * This function assumes called concurrently from multiple processes and
+ * called concurrently with changing of shared fullPageWrites. See
+ * StartupXLOG().
+ * 
  */
 void
 UpdateFullPageWrites(void)
@@ -9536,13 +9557,25 @@ UpdateFullPageWrites(void)
 	/*
 	 * Do nothing if full_page_writes has not been changed.
 	 *
-	 * It's safe to check the shared full_page_writes without the lock,
-	 * because we assume that there is no concurrently running process which
-	 * can update it.
+	 * We acquire ControlFileLock to exclude other concurrent call and change
+	 * of shared fullPageWrites.
 	 */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	WALInsertLockAcquireExclusive();
 	if (fullPageWrites == Insert->fullPageWrites)
+	{
+		WALInsertLockRelease();
+		LWLockRelease(ControlFileLock);
 		return;
+	}
+	WALInsertLockRelease();
 
+	/*
+	 * Need to set up XLogInsert working area before entering critical section
+	 * if we are not in recovery mode.
+	 */
+	(void) RecoveryInProgress();
+		
 	START_CRIT_SECTION();
 
 	/*
@@ -9578,6 +9611,8 @@ UpdateFullPageWrites(void)
 		WALInsertLockRelease();
 	}
 	END_CRIT_SECTION();
+
+	LWLockRelease(ControlFileLock);
 }
 
 /*


Re: [HACKERS] why not parallel seq scan for slow functions

2018-03-28 Thread Amit Kapila
On Thu, Mar 29, 2018 at 2:31 AM, Robert Haas  wrote:
> On Wed, Mar 28, 2018 at 3:06 AM, Amit Kapila  wrote:
>>
>> The above block takes 43700.0289 ms on Head and 45025.3779 ms with the
>> patch which is approximately 3% regression.
>
> Thanks for the analysis -- the observation that this seemed to affect
> cases where CP_LABEL_TLIST gets passed to create_projection_plan
> allowed me to recognize that I was doing an unnecessary copyObject()
> call.  Removing that seems to have reduced this regression below 1% in
> my testing.
>

I think that is under acceptable range.  I am seeing few regression
failures with the patch series.  The order of targetlist seems to have
changed for Remote SQL.  Kindly find the failure report attached.  I
have requested my colleague Ashutosh Sharma to cross-verify this and
he is also seeing the same failures.

Few comments/suggestions:

1.
typedef enum UpperRelationKind
 {
  UPPERREL_SETOP, /* result of UNION/INTERSECT/EXCEPT, if any */
+ UPPERREL_TLIST, /* result of projecting final scan/join rel */
  UPPERREL_PARTIAL_GROUP_AGG, /* result of partial grouping/aggregation, if
  * any */
  UPPERREL_GROUP_AGG, /* result of grouping/aggregation, if any */
...
...
  /*
  * Save the various upper-rel PathTargets we just computed into
@@ -2003,6 +2004,7 @@ grouping_planner(PlannerInfo *root, bool
inheritance_update,
  root->upper_targets[UPPERREL_FINAL] = final_target;
  root->upper_targets[UPPERREL_WINDOW] = sort_input_target;
  root->upper_targets[UPPERREL_GROUP_AGG] = grouping_target;
+ root->upper_targets[UPPERREL_TLIST] = scanjoin_target;

It seems UPPERREL_TLIST is redundant in the patch now.  I think we can
remove it unless you have something else in mind.

2.
+ /*
+ * If the relation is partitioned, recurseively apply the same changes to
+ * all partitions and generate new Append paths. Since Append is not
+ * projection-capable, that might save a separate Result node, and it also
+ * is important for partitionwise aggregate.
+ */
+ if (rel->part_scheme && rel->part_rels)
  {


I think the handling of partitioned rels looks okay, but we might want
to once check the overhead of the same unless you are sure that this
shouldn't be a problem.  If you think, we should check it once, then
is it possible that we can do it as a separate patch as this doesn't
look to be directly linked to the main patch.  It can be treated as an
optimization for partitionwise aggregates.  I think we can treat it
along with the main patch as well, but it might be somewhat simpler to
verify it if we do it separately.


> Also, keep in mind that we're talking about extremely small amounts of
> time here.  On a trivial query that you're not even executing, you're
> seeing a difference of (45025.3779 - 43700.0289)/100 = 0.00132 ms
> per execution.  Sure, it's still 3%, but it's 3% of the time in an
> artificial case where you don't actually run the query.  In real life,
> nobody runs EXPLAIN in a tight loop a million times without ever
> running the query, because that's not a useful thing to do.
>

Agreed, but this was to ensure that we should not do this optimization
at the cost of adding significant cycles to the planner time.

>  The
> overhead on a realistic test case will be smaller.  Furthermore, at
> least in my testing, there are now cases where this is faster than
> master.  Now, I welcome further ideas for optimization, but a patch
> that makes some cases slightly slower while making others slightly
> faster, and also improving the quality of plans in some cases, is not
> to my mind an unreasonable thing.
>

Agreed.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


regression.diffs
Description: Binary data


Re: Parallel Aggregates for string_agg and array_agg

2018-03-28 Thread David Rowley
On 29 March 2018 at 03:05, Tomas Vondra  wrote:
> On 03/28/2018 03:54 PM, Tom Lane wrote:
>> I had in mind to look at exprType() of the argument.
>
> Right, I'm fine with that.

Attached is v9, which is based on Tom's v8 but includes the new tests
and I think the required fix to disable use of the serial/deserial
function for array_agg().

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


combinefn_for_string_and_array_aggs_v9.patch
Description: Binary data


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2018-03-28 Thread Masahiko Sawada
On Thu, Mar 29, 2018 at 2:27 AM, David Steele  wrote:
> On 2/27/18 2:21 AM, Masahiko Sawada wrote:
>>
>> Hmm, although I've thought concern in case where we don't consider
>> local xids of un-resolved fdwxact in GetOldestXmin, I could not find
>> problem. Could you share your concern if you have? I'll try to find a
>> possibility based on it.
>
> It appears that this entry should be marked Waiting on Author so I have
> done that.
>
> I also think it may be time to move this patch to the next CF.
>

I agree to move this patch to the next CF.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-03-28 Thread Thomas Munro
On Thu, Mar 29, 2018 at 3:30 PM, Michael Paquier  wrote:
> On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
>> Craig Ringer  writes:
>>> TL;DR: Pg should PANIC on fsync() EIO return.
>>
>> Surely you jest.
>
> Any callers of pg_fsync in the backend code are careful enough to check
> the returned status, sometimes doing retries like in mdsync, so what is
> proposed here would be a regression.

Craig, is the phenomenon you described the same as the second issue
"Reporting writeback errors" discussed in this article?

https://lwn.net/Articles/724307/

"Current kernels might report a writeback error on an fsync() call,
but there are a number of ways in which that can fail to happen."

That's... I'm speechless.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Small proposal to improve out-of-memory messages

2018-03-28 Thread Michael Paquier
On Thu, Mar 29, 2018 at 09:29:59AM +0800, Craig Ringer wrote:
> This would have been significantly useful to me in the past, so +1 from me.

As long as that does not cost more memory, +1.
--
Michael


signature.asc
Description: PGP signature


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-03-28 Thread Michael Paquier
On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
> Craig Ringer  writes:
>> TL;DR: Pg should PANIC on fsync() EIO return.
> 
> Surely you jest.

Any callers of pg_fsync in the backend code are careful enough to check
the returned status, sometimes doing retries like in mdsync, so what is
proposed here would be a regression.
--
Michael


signature.asc
Description: PGP signature


Re: ALTER TABLE ADD COLUMN fast default

2018-03-28 Thread Haribabu Kommi
On Mon, Mar 26, 2018 at 9:32 AM, Andrew Dunstan  wrote:
>
>
> Thanks for this, all looks good. Here is the consolidate patch
> rebased. If there are no further comments I propose to commit this in
> a few days time.


I have some comments with the committed patch.

@@ -663,7 +671,23 @@ ExecFetchSlotTuple(TupleTableSlot *slot)
 * If we have a regular physical tuple then just return it.
 */
if (TTS_HAS_PHYSICAL_TUPLE(slot))
-   return slot->tts_tuple;
+   {
+   if (HeapTupleHeaderGetNatts(slot->tts_tuple->t_data) <
+   slot->tts_tupleDescriptor->natts)
+   {
+   MemoryContext oldContext =
MemoryContextSwitchTo(slot->tts_mcxt);
+
+   slot->tts_tuple = heap_expand_tuple(slot->tts_tuple,
+
 slot->tts_tupleDescriptor);
+   slot->tts_shouldFree = true;
+   MemoryContextSwitchTo(oldContext);
+   return slot->tts_tuple;
+   }
+   else
+   {
+   return slot->tts_tuple;
+   }
+   }

In the above scenario, directly replacing the slot->tts_tuple without
freeing the exisitng
tuple will unnecessarily increase the slot context memory size, this may
lead to a problem
if the same slot is used for many tuples. Better to use ExecStoreTuple()
function to update
the slot tuple.

Regards,
Hari Babu
Fujitsu Australia


Re: Flexible permissions for REFRESH MATERIALIZED VIEW

2018-03-28 Thread David G. Johnston
On Wed, Mar 28, 2018 at 6:38 PM, Isaac Morland 
wrote:

> ​​
> One question I would have is: what proposals exist or have existed for
> additional privilege bits? How much pressure is there to use some of the
> remaining bits? I actually looked into the history of the permission bits
> and found that we can summarize and approximate the history as 10 years of
> expansion from 4 to 12, then nothing added in the last 10 years.
>

​I made an argument for an "ANALYZE" grant a little while back, and it
kinda leads one to want one for VACUUM as well.

https://www.postgresql.org/message-id/CAKFQuwZ6dhjTFV7Bwmehe1N3%3Dk484y4mM22zuYjVEU2dq9V1aQ%40mail.gmail.com

​David J.​


Re: Flexible permissions for REFRESH MATERIALIZED VIEW

2018-03-28 Thread Isaac Morland
Thanks for taking the time to look at this. I think I was unclear in a
couple of places so I think my proposal may have appeared worse than it is.
Details below:

On 18 March 2018 at 20:25, Tom Lane  wrote:

> Isaac Morland  writes:
> > The original idea was to allow access to REFRESH MATERIALIZED VIEW to be
> a
> > grantable permission, rather than being reserved to the table owner.
>
> I'm not really on board with making that a separately grantable
> permission.  You can do what you need today by having the matview be owned
> by a single-purpose group role and granting membership in that role as
> needed.  We do not have an infinite supply of available privilege type
> bits --- in fact, without breaking on-disk compatibility, there are only
> four free bits in AclMode.  So I can't see using one of them up for
> REFRESH, let alone using three of them up for REFRESH, CLUSTER and
> REINDEX.


I didn't mean to suggest introducing 3 new permissions, just one, which
would allow all three operations (clustering and reindexing for tables,
including matviews, and refreshing matviews). So that still leaves 3
remaining bits for future use. I recognize that using up the limited supply
of privilege bits is a legitimate concern. However, I think of "refresh" as
a general concept that may apply differently to different objects. A future
permissions proposal may well be able to use it, similar to how USAGE was
re-used when it was decided to have a permission on types. So we aren't
fully using up even that bit, although clearly it is gone as to
table-related permissions, and it shouldn't be used for non-table objects
for something that doesn't feel like it can be described with the word
REFRESH.

One question I would have is: what proposals exist or have existed for
additional privilege bits? How much pressure is there to use some of the
remaining bits? I actually looked into the history of the permission bits
and found that we can summarize and approximate the history as 10 years of
expansion from 4 to 12, then nothing added in the last 10 years.

1996-07-09 - first git commit with append/read/write/rule - total 4
2001-05-27 - split write into update and delete, rename append to insert
and read to select, add references and trigger - total 7
2002-04-21 - add execute, usage, create, temp - total 11 - expand AclMode
from uint8 to uint32
2004-01-14 - add connect - total 12
2006-09-05 - remove rule, leaving gap - total 11
2008-09-08 - add truncate - total 12

If we do end up running out, how hard would it be to change AclItem to have
2 uint64 fields, one for the actual permissions and one for the grant
permissions? I haven't done a thorough study, but looking at the various
macros defined in utils/acl.h and where they are used it looks to me like
it might be not too bad (e.g. the only direct references to
AclItem.ai_privs are in acl.c and acl.h). I'm concerned about changing the
disk format, however - I don't know enough to say how painful that would be.

Also consider that these are the only non-DDL table-related actions that
are not controlled by permission bits but instead can only be done by owner
(at least I think they are - am I forgetting something?). Filling in that
gap feels to me like a reasonable thing to want to do.

> The patch so far uses TRUNCATE permission to control access as a
> > proof-of-concept.
>
> I can't see doing that either, TBH.  It's just ugly, and it surely doesn't
> scale to cases where the conflated operations all apply to the same kind
> of object.  (For instance, including CLUSTER permissions in TRUNCATE
> wouldn't be very sane.)
>

My original idea was to say that some combination of existing privileges
would grant the ability to refresh a materialized view. But I came to
prefer introducing a new privilege, especially once I realized it would be
sensible to include clustering and reindexing. The only reason I used an
existing privilege for this initial trivial version of the patch was to
check that the concept actually works without going to the effort of
actually writing in a new privilege. So I agree that using TRUNCATE in
particular and any existing set of privileges more generally would not be
my preference.


> It's conceivable that we could somehow allow new bits to get overlaid
> onto bits currently used only for other object types, without that being
> user-visible.  But I believe there are significant implementation issues
> that'd have to be surmounted; for instance aclitemout has no idea what
> sort of object the ACL it's printing is for.  Also, the ACL machinery
> doesn't currently think that "matview" is a distinct type of object
> from "table" anyway; it just sees 'em all as "relation".
>

This sounds harder than changing AclItem!

Or, maybe I'm wrong and people feel that REFRESH-by-non-owner is important
> enough to be worth consuming an AclMode bit for.  But I'm dubious.
>
> > I think backward compatibility is pretty good.
>
> 

Re: Small proposal to improve out-of-memory messages

2018-03-28 Thread Craig Ringer
On 29 March 2018 at 09:07, Tom Lane  wrote:

> I was looking at
> https://www.postgresql.org/message-id/CAG_=8kAYKjhQX3FmAWQBC95Evh3+
> qszoqxknmm1q4w1qo7+...@mail.gmail.com
>
> in which the most salient issue is
>
> > 2018-03-28 19:20:33.264 UTC [10580] cory@match ERROR:  out of memory
> > 2018-03-28 19:20:33.264 UTC [10580] cory@match DETAIL:  Failed on
> request of size 1610612736.
>
> and it suddenly struck me that this'd be noticeably more useful, at least
> for experts, if the errdetail included the name of the memory context
> we were trying to allocate in.  In this case it'd be nice to know right
> off the bat whether the failure occurred in MessageContext, which looked
> bloated already, or someplace else.
>
> In the wake of commit 442accc3f one might think that the message should
> also include the context "identifier" if any.  But I'm specifically not
> proposing that, because of the point made in the discussion of that patch
> that some identifier strings might contain security-sensitive query
> excerpts.  Now that that commit has required all context "names" to be
> compile-time constants, it's hard to see how there could be any security
> downside to showing them in a user-visible message.
>
> Of course, by the same token, maybe this change wouldn't be as useful
> as I'm thinking.  But I can think of past cases where being able to
> distinguish, say, allocation failures in a query's global ExecutorState
> versus ones in an ExprState would save some effort.
>
>
This would have been significantly useful to me in the past, so +1 from me.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: ALTER TABLE ADD COLUMN fast default

2018-03-28 Thread Andrew Dunstan
On Thu, Mar 29, 2018 at 10:19 AM, Tom Lane  wrote:
> Andrew Dunstan  writes:
>> On Wed, Mar 28, 2018 at 5:30 PM, Andres Freund  wrote:
>>> +   /*
>>> +* Missing value for added columns. This is a one element array 
>>> which lets
>>> +* us store a value of the attribute type here.
>>> +*/
>>> +   anyarrayattmissingval BKI_DEFAULT(_null_);
>>> #endif
>>> } FormData_pg_attribute;
>>>
>>> Still think this is a bad location, and it'll reduce cache hit ratio for
>>> catalog lookups.
>
>> As I think I mentioned before, this was discussed previously and as I
>> understood it this was the consensus location for it.
>
> I don't have a problem with putting that in pg_attribute (and I certainly
> agree with not putting it in pg_attrdef).  But "anyarray" seems like a
> damn strange, and bulky, choice.  Why not just make it a bytea holding the
> bits for the value, nothing more?
>


That's what we started with and Andres suggested we change it. One
virtue of the array is that is displays nicely when examining
pg_attribute. But changing it back would be easy enough.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-03-28 Thread Michael Paquier
On Wed, Mar 28, 2018 at 08:50:21PM -0400, Robert Haas wrote:
> On Tue, Mar 27, 2018 at 11:53 PM, Pavan Deolasee  
> wrote:
>> Yeah, we should not do that. The patch surely does not intend to replay any
>> more WAL than what we do today. The idea is to just use a different
>> mechanism to find the prior checkpoint. But we should surely find the latest
>> prior checkpoint. In the rare scenario that Tom showed, we should just throw
>> an error and fix the patch if it's not doing that already.
> 
> It's not clear to me that there is any reasonable fix short of giving
> up on this approach altogether.  But I might be missing something.

One way to get things done right is to scan recursively segments instead
of records.  Not impossible but this introduces more code churns in
pg_rewind's findLastCheckpoint which is now dead-simple as it just need
to look at things backwards.  This way you can know as well what is the
real last checkpoint as you know as well what's the point where WAL has
forked.  So for multiple checkpoints on the same segment, the lookup
just needs to happen up to the point where WAL has forked, while keeping
track of the last checkpoint record found.

If we would want to keep compatibility with backward searches, it may be
nice to keep a mostly-compatible API layer in xlogreader.c.  Now I have
honestly not thought about that much.  And there may be other tools
which rely on being able to do backward searches.  Breaking
compatibility for them could cause wraith because that would mean less
availability when running a rewind-like operation.

This thread makes me wonder as well if we are overlooking other
things...
--
Michael


signature.asc
Description: PGP signature


Re: JIT compiling with LLVM v12

2018-03-28 Thread Amit Langote
On 2018/03/29 9:35, Michael Paquier wrote:
> On Wed, Mar 28, 2018 at 06:24:53PM -0400, David Steele wrote:
>> On 3/28/18 6:09 PM, Andres Freund wrote:
>>> Hah! Happy to, if there's enough people interested.  I've a talk about
>>> it too (state of jit, 2018 edition), but I wasn't planning to go into
>>> too low level details. More about what is good, what is bad, and how we
>>> make it better ;)
>>
>> +1 for an unconference session.  This is some seriously cool stuff.
> 
> Take room for two sessions then, with a break in-between to give enough
> time to people to recover from the damage of the first session :)
> 
> Jokes apart, an unconference session at PGcon would be great.

+1

Thanks,
Amit




Re: [HACKERS] A design for amcheck heapam verification

2018-03-28 Thread Peter Geoghegan
On Wed, Mar 28, 2018 at 5:47 AM, Pavan Deolasee
 wrote:
> Mostly a nitpick, but I guess we should leave a comment after
> IndexBuildHeapScan() saying heap_endscan() is not necessary since
> IndexBuildHeapScan() does that internally. I stumbled upon that while
> looking for any potential leaks. I know at least one other caller of
> IndexBuildHeapScan() doesn't bother to say anything either, but it's
> helpful.

Fair point. Again, I'm going to suggest deferring to the committer. I
seem to have decision fatigue this week.

> FWIW I also looked at the 0001 patch and it looks fine to me.

I'm grateful that you didn't feel any need to encourage me to use
whatever the novel/variant filter du jour is! :-)

-- 
Peter Geoghegan



Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-03-28 Thread Robert Haas
On Tue, Mar 27, 2018 at 11:53 PM, Pavan Deolasee
 wrote:
> Yeah, we should not do that. The patch surely does not intend to replay any
> more WAL than what we do today. The idea is to just use a different
> mechanism to find the prior checkpoint. But we should surely find the latest
> prior checkpoint. In the rare scenario that Tom showed, we should just throw
> an error and fix the patch if it's not doing that already.

It's not clear to me that there is any reasonable fix short of giving
up on this approach altogether.  But I might be missing something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] A design for amcheck heapam verification

2018-03-28 Thread Peter Geoghegan
On Wed, Mar 28, 2018 at 5:04 AM, Pavan Deolasee
 wrote:
> Hmm Ok. It still sounds backwards to me, but then English is not my first or
> even second language. "A heap scan later verifies the presence in the heap
> of all index tuples fingerprinted" sounds as if we expect to find all
> fingerprinted index tuples in the heap. But in reality, we check if all heap
> tuples have a fingerprinted index tuple.

Why don't we leave this to the committer that picks the patch up? I
don't actually mind very much. I suspect that I am too close to the
problem to be sure that I've explained it in the clearest possible
way.

>> You're right - there is a narrow window for REPEATABLE READ and
>> SERIALIZABLE transactions. This is a regression in v6, the version
>> removed the TransactionXmin test.
>>
>> I am tempted to fix this by calling GetLatestSnapshot() instead of
>> GetTransactionSnapshot(). However, that has a problem of its own -- it
>> won't work in parallel mode, and we're dealing with parallel
>> restricted functions, not parallel unsafe functions. I don't want to
>> change that to fix such a narrow issue. IMV, a better fix is to treat
>> this as a serialization failure. Attached patch, which applies on top
>> of v7, shows what I mean.
>
>
> Ok. I am happy with that.

Cool.

> Well pg_index entry will be visible and should be visible. Otherwise how do
> you even maintain a newly created index. I am not sure, but I guess we take
> fresh MVCC snapshots for catalog searches, even in RR transactions.

True, but that isn't what would happen with an SQL query that queries
the system catalogs. That only applies to how the system catalogs are
accessed internally, not how they'd almost certainly be accessed when
using amcheck.

>> > Are there any interesting cases around
>> > INSERT_IN_PROGRESS/DELETE_IN_PROGRESS
>> > tuples, especially if those tuples were inserted/deleted by our own
>> > transaction? It probably worth thinking.
>
>
> Anything here that you would like to check? I understand that you may see
> such tuples only for catalog tables.

I think that the WARNING ought to be fine. We shouldn't ever see it,
but if we do it's probably due to a bug in an extension or something.
It doesn't seem particularly likely that someone could ever insert
into the table concurrently despite our having a ShareLock on the
relation. Even with corruption.

>> IndexBuildHeapRangeScan() doesn't mention anything about CIC's heap
>> ShareUpdateExclusiveLock (it just says SharedLock), because that lock
>> strength doesn't have anything to do with IndexBuildHeapRangeScan()
>> when it operates with an MVCC snapshot. I think that this means that
>> this patch doesn't need to update comments within
>> IndexBuildHeapRangeScan(). Perhaps that's a good idea, but it seems
>> independent.
>
>
> Ok, I agree. But note that we are now invoking that code with
> AccessShareLock() whereas the existing callers either use ShareLock or
> ShareUpdateExclusiveLock. That's probably does not matter, but it's a change
> worth noting.

Fair point, even though the ShareUpdateExclusiveLock case isn't
actually acknowledged by IndexBuildHeapRangeScan(). Can we leave this
one up to the committer, too? I find it very hard to argue either for
or against this, and I want to avoid "analysis paralysis" at this
important time.

> While that's true, I am thinking if there is anything we can do to stop this
> a consistency-checking utility to create other non-trivial side effects on
> the state of the database, even if that means we can not check all heap
> tuples. For example, can there be a way by which we allow concurrent vacuum
> or HOT prune to continue to prune away dead tuples, even if that means
> running bt_check_index() is some specialised way (such as not allowing in a
> transaction block) and the heap scan  might miss out some tuples? I don't
> know answer to that, but given that sometimes bt_check_index() may take
> hours if not days to finish, it seems worth thinking or at least documenting
> the side-effects somewhere.

That seems like a worthwhile goal for a heap checker that only checks
the structure of the heap, rather than something that checks the
consistency of an index against its table. Especially one that can run
without a connection to the database, for things like backup tools,
where performance is really important. There is certainly room for
that. For this particular enhancement, the similarity to CREATE INDEX
seems like an asset.

-- 
Peter Geoghegan



Re: [HACKERS] Pluggable storage

2018-03-28 Thread Michael Paquier
On Wed, Mar 28, 2018 at 12:23:56PM -0400, David Steele wrote:
> I think this entry should be moved the the next CF.  I'll do that
> tomorrow unless there are objections.

Instead of moving things to the next CF by default, perhaps it would
make more sense to mark things as reviewed with feedback as this is the
last CF?  There is a 5-month gap between this commit fest and the next
one, I am getting afraid of flooding the beginning of v12 development
cycle with entries which keep rotting over time.  If the author(s) claim
that they will be able to work on it, then that's of course fine.

Sorry for the digression, patches ignored across CFs contribute to the
bloat we see, and those eat the time of the CFM.
--
Michael


signature.asc
Description: PGP signature


Re: JIT compiling with LLVM v12

2018-03-28 Thread Michael Paquier
On Wed, Mar 28, 2018 at 06:24:53PM -0400, David Steele wrote:
> On 3/28/18 6:09 PM, Andres Freund wrote:
>> Hah! Happy to, if there's enough people interested.  I've a talk about
>> it too (state of jit, 2018 edition), but I wasn't planning to go into
>> too low level details. More about what is good, what is bad, and how we
>> make it better ;)
> 
> +1 for an unconference session.  This is some seriously cool stuff.

Take room for two sessions then, with a break in-between to give enough
time to people to recover from the damage of the first session :)

Jokes apart, an unconference session at PGcon would be great.
--
Michael


signature.asc
Description: PGP signature


Re: Proposal: http2 wire format

2018-03-28 Thread Peter Eisentraut
On 3/28/18 12:09, Andres Freund wrote:
> Yea, not the most descriptive... Returning multiple different resultsets
> from a function / procedure. Inability to do so is a serious limitation
> of postgres in comparison to some other language with procedures.

This is already possible as far as the protocol is concerned.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Fix a typo in walsender.c

2018-03-28 Thread atorikoshi

On 2018/03/29 7:23, Bruce Momjian wrote:


Thanks, patch applied.


Thank you for committing!





Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-03-28 Thread Dean Rasheed
On 28 March 2018 at 15:50, Tomas Vondra  wrote:
> After thinking about this a bit more, I'm not sure if updating the info
> based on recursive calls makes sense. The fullmatch flag was supposed to
> answer a simple question - can there be just a single matching item?
>
> If there are equality conditions on all columns, there can be just a
> single matching item - if we have found it in the MCV (i.e. s1 > 0.0),
> then we don't need to inspect the non-MCV part.
>
> But handling this in recursive manner breaks this entirely, because with
> something like
>
>(a=1) AND (b=1 OR b=2)
>
> you suddenly can have multiple matching items. Which makes the fullmatch
> flag somewhat useless.
>
> So I think we should be looking at top-level equality clauses only, just
> like number_of_groups() does.
>

I'm not quite sure what you mean by that, but it sounds a bit limiting
in terms of the kinds of user queries that would be supported.


> I think we can remove the fullmatch flag from mcv_update_bitmap
> entirely. All we need to know is the presence of equality clauses and
> whether there was a match in MCV (which we know from s1 > 0.0).
>

I agree with removing the fullmatch flag, but I don't think we
actually need to know about the presence of equality clauses:

The way that mcv_update_bitmap() recursively computes the set of
matching MCVs seems to be correct. That gives us a value (call it
mcv_matchsel) for the proportion of the table's rows that are in the
MCV list and satisfy the clauses in stat_clauses.

We can also estimate that there are (1-mcv_totalsel)*N rows that are
not in the MCV list, for which the MCV stats therefore tell us
nothing. The best way to estimate those rows would seem to be to use
the logic from the guts of clauselist_selectivity(), without
consulting any extended MCV stats (but still consulting other extended
stats, I think). Doing that would return a selectivity value (call it
nonmcv_sel) for those remaining rows. Then a reasonable estimate for
the overall selectivity would seem to be

  mcv_matchsel + (1-mcv_totalsel) * nonmcv_sel

and there would be no need for mcv_update_bitmap() to track eqmatches
or return fullmatch, and it wouldn't actually matter whether or not we
had equality clauses or if all the MCV columns were used.

Regards,
Dean



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-03-28 Thread Tom Lane
Claudio Freire  writes:
> v10 counted the number of blocks with updated free space to vacuum the
> FSM only after a lot of changes to it were made. This will vacuum the
> FSM after *scanning* a lot of pages, even if little modifications were
> made to them.

Yes, that's exactly the point.  We still need to update the upper pages of
the FSM, regardless of exactly how many pages were changed by VACUUM
underneath.  The pages VACUUM didn't change might still be out of date in
the upper pages, since retail updates from other operations don't
immediately propagate; so there's a fixed amount of work that needs to be
done there and we might as well do it in more-or-less-predictable quanta.
I don't see any point in adding counting overhead/complexity for that.
In fact, I seriously considered just making it update after every
VACUUM_FSM_EVERY_PAGES period, but decided that for the case with indexes,
doing it right after each cleanup cycle was sensible, and then we might as
well make the no-index case look as much like that as we conveniently can.
The no-index case seems vestigial anyway; how often is that really going
to apply?  So spending a lot of complexity on it doesn't seem warranted,
especially when the argument for more complexity is at best dubious.

> This doesn't seem correct.

[ thinks for a bit... ]  Yeah, you're right, we need to round up not down
when determining the last slot to scan on the upper level.

I wonder how hard it is to avoid rounding up when the stop point actually
does fall right at a FSM page boundary.  OTOH, that might not happen often
enough to be worth sweating over.

regards, tom lane



Re: [bug fix] pg_rewind creates corrupt WAL files, and the standby cannot catch up the primary

2018-03-28 Thread Michael Paquier
On Thu, Mar 29, 2018 at 04:04:58AM +0900, Fujii Masao wrote:
> Pushed. Thanks!

Nice, thanks.
--
Michael


signature.asc
Description: PGP signature


Re: Trigger file behavior with the standby

2018-03-28 Thread Michael Paquier
On Wed, Mar 28, 2018 at 12:23:31PM -0700, Keiko Oda wrote:
> Thanks a lot for the answer, Michael (and sorry for the slow response)!

No problem.

> So, if I understand what you're saying correctly, I'm seeing this behavior
> because wal-e keeps fetching wal files from s3 regardless of this
> trigger_file, and these fetched wal files are in pg_wal (or pg_xlog),
> therefore Postgres just tries to restore whatever available in pg_wal
> before the failover. Or, even if there is no file in pg_wal, it still tries
> to fetch from the "archive" (s3).
> In other words, if I would like to do "immediate failover" (and do not care
> about WAL files available in archive or in pg_wal), I should be tweaking
> restore_command so that no further fetching/restoring happens.
> Is it... accurate?

Per the code and the documentation, the current behavior is clearly
intentional.  If you think about it, it can be relatively important
especially in the case of a base backup taken without WAL segments in
pg_wal while relying on a separate archive: this gives more guarantees
that the consistent point will be reached.  That also covers a bit what
people can look for in some cases with recovery_target = 'immediate'.

You could indeed tweak the restore command.  If a failure happens while
attempting to fetch a WAL segment, then the recovery would immediately
stop.  If you try to trigger a promotion without reaching a consistency
point, then you would get a complain from the startup process.  There
are some safeguards for this purpose.

Please don't take me wrong.  There is room for a feature which does more
efficiently what you are looking for, but that would be a separate
problem.

(Sakura season here by the way, they are blooming this week)
--
Michael


signature.asc
Description: PGP signature


Re: Using base backup exclusion filters to reduce data transferred with pg_rewind

2018-03-28 Thread Michael Paquier
On Thu, Mar 29, 2018 at 05:01:40AM +0900, Fujii Masao wrote:
> Committed. Thanks!

Thanks for including as well the documentation changes and committing
it.  The result looks good to me.
--
Michael


signature.asc
Description: PGP signature


Re: ALTER TABLE ADD COLUMN fast default

2018-03-28 Thread Tom Lane
Andrew Dunstan  writes:
> On Wed, Mar 28, 2018 at 5:30 PM, Andres Freund  wrote:
>> +   /*
>> +* Missing value for added columns. This is a one element array 
>> which lets
>> +* us store a value of the attribute type here.
>> +*/
>> +   anyarrayattmissingval BKI_DEFAULT(_null_);
>> #endif
>> } FormData_pg_attribute;
>> 
>> Still think this is a bad location, and it'll reduce cache hit ratio for
>> catalog lookups.

> As I think I mentioned before, this was discussed previously and as I
> understood it this was the consensus location for it.

I don't have a problem with putting that in pg_attribute (and I certainly
agree with not putting it in pg_attrdef).  But "anyarray" seems like a
damn strange, and bulky, choice.  Why not just make it a bytea holding the
bits for the value, nothing more?

regards, tom lane



Re: ALTER TABLE ADD COLUMN fast default

2018-03-28 Thread Andrew Dunstan
On Wed, Mar 28, 2018 at 5:30 PM, Andres Freund  wrote:
> Hi,
>
> On 2018-03-26 09:02:10 +1030, Andrew Dunstan wrote:
>> Thanks for this, all looks good. Here is the consolidate patch
>> rebased. If there are no further comments I propose to commit this in
>> a few days time.
>
> Here's a bit of post commit review:
>
> @@ -1310,13 +1679,11 @@ slot_getsomeattrs(TupleTableSlot *slot, int attnum)
>
> /*
>  * If tuple doesn't have all the atts indicated by tupleDesc, read the
> -* rest as null
> +* rest as NULLs or missing values
>  */
> -   for (; attno < attnum; attno++)
> -   {
> -   slot->tts_values[attno] = (Datum) 0;
> -   slot->tts_isnull[attno] = true;
> -   }
> +   if (attno < attnum)
> +   slot_getmissingattrs(slot, attno, attnum);
> +
> slot->tts_nvalid = attnum;
>  }
>
> It's worthwhile to note that this'll re-process *all* missing values,
> even if they've already been deformed. I.e. if
> slot_getmissingattrs(.., first-missing)
> slot_getmissingattrs(.., first-missing + 1)
> slot_getmissingattrs(.., first-missing + 2)
> is called, all three missing values will be copied every time. That's
> because tts_nvalid isn't taken into account.  I wonder if slot_getmissingattrs
> could take tts_nvalid into account?
>
> I also wonder if this doesn't deserve an unlikely(), to avoid the cost
> of spilling registers in the hot branch of not missing any values.
>
>


One of us at least is very confused about this function.
slot_getmissingattrs() gets called at most once by slot_getsomeattrs
and never for any attribute that's already been deformed.



> +   else
> +   {
> +   /* if there is a missing values array we must process them 
> one by one */
> +   for (missattnum = lastAttNum - 1;
> +missattnum >= startAttNum;
> +missattnum--)
> +   {
> +   slot->tts_values[missattnum] = 
> attrmiss[missattnum].ammissing;
> +   slot->tts_isnull[missattnum] =
> +   !attrmiss[missattnum].ammissingPresent;
> +   }
> +   }
> +}
>
> Why is this done backwards? It's noticeably slower to walk arrays
> backwards on some CPU microarchitectures.
>
>

I'll change it.


>
> @@ -176,6 +179,23 @@ CreateTupleDescCopyConstr(TupleDesc tupdesc)
> }
> }
>
>
>
> @@ -469,6 +503,29 @@ equalTupleDescs(TupleDesc tupdesc1, TupleDesc tupdesc2)
> if (strcmp(defval1->adbin, defval2->adbin) != 0)
> return false;
> }
> +   if (constr1->missing)
> +   {
> +   if (!constr2->missing)
> +   return false;
> +   for (i = 0; i < tupdesc1->natts; i++)
> +   {
> +   AttrMissing *missval1 = constr1->missing + i;
> +   AttrMissing *missval2 = constr2->missing + i;
>
> It's a bit odd to not use array indexing here?
>


*shrug* Maybe. I'll change it if you like, doesn't seem that important or odd.

>
> @@ -625,7 +625,15 @@ ExecCopySlotMinimalTuple(TupleTableSlot *slot)
> if (slot->tts_mintuple)
> return heap_copy_minimal_tuple(slot->tts_mintuple);
> if (slot->tts_tuple)
> -   return minimal_tuple_from_heap_tuple(slot->tts_tuple);
> +   {
> +   if (TTS_HAS_PHYSICAL_TUPLE(slot) &&
> +   HeapTupleHeaderGetNatts(slot->tts_tuple->t_data)
> +   < slot->tts_tupleDescriptor->natts)
> +   return minimal_expand_tuple(slot->tts_tuple,
> + 
>   slot->tts_tupleDescriptor);
> +   else
> +   return minimal_tuple_from_heap_tuple(slot->tts_tuple);
> +   }
>
>
> What's the TTS_HAS_PHYSICAL_TUPLE doing here? How can this be relevant
> given the previous tts_mintuple check? Am I missing something?
>
>

Hmm, that dates back to the original patch. My bad, I should have
picked that up. I'll just remove it.


>
> @@ -563,10 +569,63 @@ RelationBuildTupleDesc(Relation relation)
> 
> MemoryContextAllocZero(CacheMemoryContext,
>   
>  relation->rd_rel->relnatts *
>   
>  sizeof(AttrDefault));
> -   attrdef[ndef].adnum = attp->attnum;
> +   attrdef[ndef].adnum = attnum;
> attrdef[ndef].adbin = NULL;
> +
> ndef++;
> }
> +
> +   /* Likewise for a missing value */
> +   if 

Re: csv format for psql

2018-03-28 Thread David G. Johnston
On Wed, Mar 28, 2018 at 4:19 PM, Isaac Morland 
wrote:

> On 28 March 2018 at 15:43, Joshua D. Drake  wrote:
>
>> On 03/28/2018 12:35 PM, David G. Johnston wrote:
>>
>> I like to call it "Character Separated Values" now for just that reason.
>>
>>
>> Isn't the actual wording Character Delimited Values? I may be picking at
>> hairs here but every single time I use anything to import a CSV or other
>> delimited file (TAB or | usually) that is what the import screen says.
>>
>
> CSV stands for Comma Separated Values, and not anything else common as far
> as I can tell.
>

​Language evolves - Wikipedia just hasn't recognized this particular
evolution yet :)

The actual reason I'm posting this is because some of the discussion has
> made me a bit confused: there is already a CSV format defined for the COPY
> command and used by the psql \copy. I just want to check that what is being
> discussed here would use the exact same format as the existing CSV COPY
> format; and the configurability of them should be the exact same options
> (which already includes being able to change the delimiter).
>

​Nope, the \copy meta-command, and corresponding COPY SQL command, have
additional options that I don't believe will be accessible here.​
Specifically the "OIDS", "QUOTE", and various "FORCE_" options are not
controllable, nor are ESCAPE and ENCODING (this last I think, or at least
not as flexible).


> I think it's important that Postgres not have two CSVs with slightly
> different behaviours. Indeed, psql could use COPY behind the scenes to
> generate the CSV output, which would guarantee no nasty surprises.
>
> If this is already widely accepted or if I'm horribly misunderstanding the
> discussion then I'm sorry for the noise.
>

​I agree this is a possible approach but the way the thinking is presently
is that this is just another "\pset format" option with forced defaults
excepting the delimiter (DELIMITER),presence of the column names (HEADER)​,
NULL (I think...)

​Issuing \copy with the equivalent settings should result in the same
output...but the two mechanisms are distinct for better and worse.

David J.


Re: csv format for psql

2018-03-28 Thread Isaac Morland
On 28 March 2018 at 15:43, Joshua D. Drake  wrote:

> On 03/28/2018 12:35 PM, David G. Johnston wrote:
>
> I like to call it "Character Separated Values" now for just that reason.
>
>
> Isn't the actual wording Character Delimited Values? I may be picking at
> hairs here but every single time I use anything to import a CSV or other
> delimited file (TAB or | usually) that is what the import screen says.
>

CSV stands for Comma Separated Values, and not anything else common as far
as I can tell. A Google search for "csv" turns up the Wikipedia page
describing the file format as the first hit, followed by the Wikipedia
disambiguation page for CSV, which links to the aforementioned Wikipedia
page as the only file-format-related link.

The actual reason I'm posting this is because some of the discussion has
made me a bit confused: there is already a CSV format defined for the COPY
command and used by the psql \copy. I just want to check that what is being
discussed here would use the exact same format as the existing CSV COPY
format; and the configurability of them should be the exact same options
(which already includes being able to change the delimiter). I think it's
important that Postgres not have two CSVs with slightly different
behaviours. Indeed, psql could use COPY behind the scenes to generate the
CSV output, which would guarantee no nasty surprises.

If this is already widely accepted or if I'm horribly misunderstanding the
discussion then I'm sorry for the noise.


Re: disable SSL compression?

2018-03-28 Thread Andres Freund


On March 28, 2018 4:15:02 PM PDT, Peter Eisentraut 
 wrote:
>On 3/28/18 13:26, Konstantin Knizhnik wrote:
>> If SSL compression is deprecated, should we provide own compression?
>> I have implemented some prototype implementation of it (patch is
>attached).
>> I have added compression=on/off parameter to connection string and -Z
>> option to psql and pgbench utilities.
>
>What I'd like to see here is extensive protocol documentation that
>describes the compression method negotiation, and the interaction with
>SSL, and a test suite to support that.
>
>Maybe start a new thread.

+analysis of whether that's safe to do from a cryptographic POV. There's a 
reason compression has been disabled for just about all SSL/TLS libraries.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: committing inside cursor loop

2018-03-28 Thread Peter Eisentraut
On 3/28/18 11:34, Ildus Kurbangaliev wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   tested, passed
> Documentation:tested, passed
> 
> I have checked new version. Although I can miss something,  the patch looks 
> good to me.
> 
> The new status of this patch is: Ready for Committer

Committed, thanks!

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: disable SSL compression?

2018-03-28 Thread Peter Eisentraut
On 3/28/18 13:26, Konstantin Knizhnik wrote:
> If SSL compression is deprecated, should we provide own compression?
> I have implemented some prototype implementation of it (patch is attached).
> I have added compression=on/off parameter to connection string and -Z
> option to psql and pgbench utilities.

What I'd like to see here is extensive protocol documentation that
describes the compression method negotiation, and the interaction with
SSL, and a test suite to support that.

Maybe start a new thread.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PROPOSAL] Shared Ispell dictionaries

2018-03-28 Thread Arthur Zakirov
Here is the new version of the patch.

Now RemoveTSDictionaryById() and AlterTSDictionary() unpin the
dictionary DSM segment. So if all attached backends disconnect allocated
DSM segments will be released.

lookup_ts_dictionary_cache() may unping DSM mapping for all invalid
dictionary cache entries.

I added xmax in DictPointerData. It is used as a lookup key now too. It
helps to reload a dictionary after roll back DROP command.

There was a bug in ts_dict_shmem_location(), I fixed it.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/src/backend/tsearch/spell.c b/src/backend/tsearch/spell.c
index b9fdd77e19..e071994523 100644
--- a/src/backend/tsearch/spell.c
+++ b/src/backend/tsearch/spell.c
@@ -78,6 +78,8 @@
 #define tmpalloc(sz)  MemoryContextAlloc(Conf->buildCxt, (sz))
 #define tmpalloc0(sz)  MemoryContextAllocZero(Conf->buildCxt, (sz))
 
+#define tmpstrdup(str) MemoryContextStrdup(Conf->buildCxt, (str))
+
 /*
  * Prepare for constructing an ISpell dictionary.
  *
@@ -498,7 +500,7 @@ NIAddSpell(IspellDict *Conf, const char *word, const char 
*flag)
Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + 
strlen(word) + 1);
strcpy(Conf->Spell[Conf->nspell]->word, word);
Conf->Spell[Conf->nspell]->p.flag = (*flag != '\0')
-   ? cpstrdup(Conf, flag) : VoidString;
+   ? tmpstrdup(flag) : VoidString;
Conf->nspell++;
 }
 
@@ -1040,7 +1042,7 @@ setCompoundAffixFlagValue(IspellDict *Conf, 
CompoundAffixFlag *entry,
entry->flag.i = i;
}
else
-   entry->flag.s = cpstrdup(Conf, s);
+   entry->flag.s = tmpstrdup(s);
 
entry->flagMode = Conf->flagMode;
entry->value = val;
@@ -1536,6 +1538,9 @@ nextline:
return;
 
 isnewformat:
+   pfree(recoded);
+   pfree(pstr);
+
if (oldformat)
ereport(ERROR,
(errcode(ERRCODE_CONFIG_FILE_ERROR),
diff --git a/contrib/dict_int/dict_int.c b/contrib/dict_int/dict_int.c
index 56ede37089..8dd4959028 100644
--- a/contrib/dict_int/dict_int.c
+++ b/contrib/dict_int/dict_int.c
@@ -30,7 +30,7 @@ PG_FUNCTION_INFO_V1(dintdict_lexize);
 Datum
 dintdict_init(PG_FUNCTION_ARGS)
 {
-   List   *dictoptions = (List *) PG_GETARG_POINTER(0);
+   DictInitData *init_data = (DictInitData *) PG_GETARG_POINTER(0);
DictInt*d;
ListCell   *l;
 
@@ -38,7 +38,7 @@ dintdict_init(PG_FUNCTION_ARGS)
d->maxlen = 6;
d->rejectlong = false;
 
-   foreach(l, dictoptions)
+   foreach(l, init_data->dict_options)
{
DefElem*defel = (DefElem *) lfirst(l);
 
diff --git a/contrib/dict_xsyn/dict_xsyn.c b/contrib/dict_xsyn/dict_xsyn.c
index a79ece240c..0b8a32d459 100644
--- a/contrib/dict_xsyn/dict_xsyn.c
+++ b/contrib/dict_xsyn/dict_xsyn.c
@@ -140,7 +140,7 @@ read_dictionary(DictSyn *d, const char *filename)
 Datum
 dxsyn_init(PG_FUNCTION_ARGS)
 {
-   List   *dictoptions = (List *) PG_GETARG_POINTER(0);
+   DictInitData *init_data = (DictInitData *) PG_GETARG_POINTER(0);
DictSyn*d;
ListCell   *l;
char   *filename = NULL;
@@ -153,7 +153,7 @@ dxsyn_init(PG_FUNCTION_ARGS)
d->matchsynonyms = false;
d->keepsynonyms = true;
 
-   foreach(l, dictoptions)
+   foreach(l, init_data->dict_options)
{
DefElem*defel = (DefElem *) lfirst(l);
 
diff --git a/contrib/unaccent/unaccent.c b/contrib/unaccent/unaccent.c
index 247c202755..2a2fbee5fa 100644
--- a/contrib/unaccent/unaccent.c
+++ b/contrib/unaccent/unaccent.c
@@ -267,12 +267,12 @@ PG_FUNCTION_INFO_V1(unaccent_init);
 Datum
 unaccent_init(PG_FUNCTION_ARGS)
 {
-   List   *dictoptions = (List *) PG_GETARG_POINTER(0);
+   DictInitData *init_data = (DictInitData *) PG_GETARG_POINTER(0);
TrieChar   *rootTrie = NULL;
boolfileloaded = false;
ListCell   *l;
 
-   foreach(l, dictoptions)
+   foreach(l, init_data->dict_options)
{
DefElem*defel = (DefElem *) lfirst(l);
 
diff --git a/src/backend/commands/tsearchcmds.c 
b/src/backend/commands/tsearchcmds.c
index 3a843512d1..3753e32b2c 100644
--- a/src/backend/commands/tsearchcmds.c
+++ b/src/backend/commands/tsearchcmds.c
@@ -386,17 +386,25 @@ verify_dictoptions(Oid tmplId, List *dictoptions)
}
else
{
+   DictInitData init_data;
+
/*
 * Copy the options just in case init method thinks it can 
scribble on
 * them ...
 */
dictoptions = copyObject(dictoptions);
 
+   init_data.dict_options = dictoptions;
+   init_data.dict.id = InvalidOid;
+   init_data.dict.xmin = InvalidTransactionId;
+   init_data.dict.xmax = InvalidTransactionId;
+   

Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-03-28 Thread Claudio Freire
On Wed, Mar 28, 2018 at 6:54 PM, Tom Lane  wrote:
> Claudio Freire  writes:
>> Attached patches, rebased and modified as discussed:
>> 1 no longer does tree pruning, it just vacuums a range of the FSM
>> 2 reintroduces tree pruning for the initial FSM vacuum
>> 3 and 4 remain as they were, but rebased
>
> I reviewed and cleaned up 0001.  The API for FreeSpaceMapVacuumRange was
> underspecified, and the use of it seemed to have off-by-one errors.  Also,
> you still had vacuum doing a full FreeSpaceMapVacuum call at the end;
> I thought the point was to get rid of that.  We then need to make sure
> we clean up after a truncation, but we can do that by introducing a
> FreeSpaceMapVacuumRange call into FreeSpaceMapTruncateRel.  I think the
> attached 0001 is committable, but feel free to take another look.

+
+ /*
+  * Periodically do incremental FSM vacuuming to make newly-freed
+  * space visible on upper FSM pages.  Note: although we've cleaned
+  * the current block, we haven't yet updated its FSM entry (that
+  * happens further down), so passing end == blkno is correct.
+  */
+ if (blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
+ {
+ FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum,
+ blkno);
+ next_fsm_block_to_vacuum = blkno;
+ }
  }

Using next_fsm_block_to_vacuum there seems strange.

v10 counted the number of blocks with updated free space to vacuum the
FSM only after a lot of changes to it were made. This will vacuum the
FSM after *scanning* a lot of pages, even if little modifications were
made to them. Not sure which is better, but the logic in v10 sounded
right to me.

! if (fsm_end.logpageno == addr.logpageno)
! end_slot = fsm_end_slot;
! else if (fsm_end.logpageno > addr.logpageno)
! end_slot = SlotsPerFSMPage;
! else
! end_slot = -1;
!
! for (slot = start_slot; slot < end_slot; slot++)

This doesn't seem correct.

The +1 in v10 was intentional. Suppose a leaf node of the FSM has a
parent in slot 3, and that one has a parent at slot 10.

This would vacuum slots 0-9 on the upper level, but not 10. That would
leave a whole branch (the one where the end block belongs) unvisited.

That's why the end_slot has to be inclusive of the end block. We have
work to do recursing the end_slot.


> I still don't like 0002.  It's adding a lot of complexity, and
> not-negligible overhead, to solve yesterday's problem.

Are you sure it's non-negligible?

Most of my benchmarks couldn't measure any change whatsoever after
this patch in regards to run/cpu time.

The size of the FSM is so much smaller than the table, that the cost
of vacuuming it is drowned by all the other work done and buried under
the noise.

Maybe under very special cases where vacuum does nothing, skipping all
rows, it might be measurable. A heavily bloated-then-cleaned table
with few live rows per page, but all frozen, that would be a total
worst-case. But reading the visibility map to skip rows is comparable
work to vacuuming the FSM. There's no reason to think it would worsen
by *that* much. I might try to benchmark that a bit after the long
weekend.

Anyway, it's a separate patch to be independently committed/vetted.

> After 0001,
> there's no reason to assume that vacuum is particularly likely to get
> cancelled between having made cleanups and having updated the upper FSM
> levels.  (Maybe the odds are a bit more for the no-indexes case, but
> that doesn't seem like it's common enough to justify a special mechanism
> either.)

Why not?

Any kind of DDL (even those that don't rewrite the heap) would cancel
autovacuum.

You might think DDL isn't common enough to worry about, but I've seen
cases where regular reindex were required to keep index bloat in check
(and were cron'd), and those cancel autovacuum all the time.

> Not sure about 0004 either.  The fact that we can't localize what part of
> the index needs to be updated means that repeated IndexFreeSpaceMapVacuum
> calls are a roughly quadratic cost.

Well, the index bulk delete itself already is a huge elephant-sized
quadratic cost.

The FSM is only the cherry on top.

The updated part can't be localize because it isn't. All blocks could
potentially be changed. Even in correlated indexes, upper levels need
not be physically correlated and would screw with the "vacuum block
range" optimization.

I could try to optimize the case where it's possible, by recording the
first and last blocks modified, but that would be a hugely invasive
patch (it would have to touch a lot of places in btree code).

And index bloat is a very real issue, as bad as heap bloat is.

>  Maybe in proportion to the other work
> we have to do, they're not much, but on the other hand how much 

Re: Updating parallel.sgml's treatment of parallel joins

2018-03-28 Thread Thomas Munro
On Fri, Mar 23, 2018 at 6:26 AM, Robert Haas  wrote:
> On Fri, Feb 23, 2018 at 10:30 PM, Thomas Munro
>  wrote:
>> Here is an attempt at updating parallel.sgml to cover Parallel Hash.
>> I will be neither surprised nor offended if Robert would like to put
>> it differently!
>
> Looks nice.  Committed.

Thanks.

One thing that is slightly odd about that page[1] is that some places
use the style "parallel sequential scan" (lower
case, emphasis on first mention, head noun = "scan" or "join") and
other places use the style "Partial Aggregate node"
(title case, fixed width typeface, head noun = "node").

I think that page also needs a new  for "Parallel Append".
Perhaps the authors of commit ab727167 would like to write a patch for
that?  I'll be interested to see which style they go for...

[1] https://www.postgresql.org/docs/devel/static/parallel-plans.html

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Fix for pg_stat_activity putting client hostaddr into appname field

2018-03-28 Thread Edmund Horner
I sent the original in haste, and now I need to make some corrections... sigh.

> Subject: Fix for pg_stat_activity putting client hostaddr into appname field

Actually, it's the hostname appears in the appname field.

> I noticed when querying pg_stat_activity (in 10.1):

10.1 was where I first noticed the bug, but it's still present in master.

> I've tracked this down to bootstrap/pgstat.c.

Should be postmaster/pgstat.c.

> In the case of my query, the pointers for st_appname in the aux processes 
> happen to point into the BackendClientHostnameBuffer segment.

To be clear, I think this is a memory error.  These rogue pointers
could do a lot more damage than merely pointing to the wrong strings.

> It's an extra 3 kB ...

A rough estimate, that was also wrong.  7 aux processes * (1024 bytes
activity + 64 for hostname + 64 for appname) = about 8 kB.

I do apologise for the confusion!

Edmund



Re: JIT compiling with LLVM v12

2018-03-28 Thread David Steele

On 3/28/18 6:09 PM, Andres Freund wrote:


On 2018-03-28 18:06:24 -0400, Peter Eisentraut wrote:

On 3/28/18 17:27, Andres Freund wrote:

I've pushed these three.


Great, now the only thing remaining is to prepare an unconference
session explaining all this to the rest of us. ;-)


Hah! Happy to, if there's enough people interested.  I've a talk about
it too (state of jit, 2018 edition), but I wasn't planning to go into
too low level details. More about what is good, what is bad, and how we
make it better ;)


+1 for an unconference session.  This is some seriously cool stuff.

--
-David
da...@pgmasters.net



Re: Fix a typo in walsender.c

2018-03-28 Thread Bruce Momjian
On Tue, Feb 27, 2018 at 07:22:20PM +0900, atorikoshi wrote:
> Hi,
> 
> Attached a minor patch for a typo: s/log/lag
> 
> Regards,
> 
> -- 
> Atsushi Torikoshi
> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> NTT Open Source Software Center

> diff --git a/src/backend/replication/walsender.c 
> b/src/backend/replication/walsender.c
> index d46374d..8bb1919 100644
> --- a/src/backend/replication/walsender.c
> +++ b/src/backend/replication/walsender.c
> @@ -1245,7 +1245,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr 
> lsn, TransactionId xid,
>  /*
>   * LogicalDecodingContext 'update_progress' callback.
>   *
> - * Write the current position to the log tracker (see XLogSendPhysical).
> + * Write the current position to the lag tracker (see XLogSendPhysical).
>   */
>  static void
>  WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, 
> TransactionId xid)

Thanks, patch applied.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: JIT compiling with LLVM v12

2018-03-28 Thread Andres Freund
Hi,

On 2018-03-28 18:06:24 -0400, Peter Eisentraut wrote:
> On 3/28/18 17:27, Andres Freund wrote:
> > I've pushed these three.
> 
> Great, now the only thing remaining is to prepare an unconference
> session explaining all this to the rest of us. ;-)

Hah! Happy to, if there's enough people interested.  I've a talk about
it too (state of jit, 2018 edition), but I wasn't planning to go into
too low level details. More about what is good, what is bad, and how we
make it better ;)

Greetings,

Andres Freund



Re: WIP: Covering + unique indexes.

2018-03-28 Thread Peter Eisentraut
On 1/25/18 23:19, Thomas Munro wrote:
> +  PRIMARY KEY (  class="parameter">column_name [, ... ] )  class="parameter">index_parameters INCLUDE
> (column_name [,
> ...]) |
> 
> I hadn't seen that use of "" before.  Almost everywhere else
> we use explicit [ and ] characters, but I see that there are other
> examples, and it is rendered as [ and ] in the output.

I think this will probably not come out right in the generated psql
help.  Check that please.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: JIT compiling with LLVM v12

2018-03-28 Thread Peter Eisentraut
On 3/28/18 17:27, Andres Freund wrote:
> I've pushed these three.

Great, now the only thing remaining is to prepare an unconference
session explaining all this to the rest of us. ;-)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-03-28 Thread Tom Lane
Claudio Freire  writes:
> Attached patches, rebased and modified as discussed:
> 1 no longer does tree pruning, it just vacuums a range of the FSM
> 2 reintroduces tree pruning for the initial FSM vacuum
> 3 and 4 remain as they were, but rebased

I reviewed and cleaned up 0001.  The API for FreeSpaceMapVacuumRange was
underspecified, and the use of it seemed to have off-by-one errors.  Also,
you still had vacuum doing a full FreeSpaceMapVacuum call at the end;
I thought the point was to get rid of that.  We then need to make sure
we clean up after a truncation, but we can do that by introducing a
FreeSpaceMapVacuumRange call into FreeSpaceMapTruncateRel.  I think the
attached 0001 is committable, but feel free to take another look.

I still don't like 0002.  It's adding a lot of complexity, and
not-negligible overhead, to solve yesterday's problem.  After 0001,
there's no reason to assume that vacuum is particularly likely to get
cancelled between having made cleanups and having updated the upper FSM
levels.  (Maybe the odds are a bit more for the no-indexes case, but
that doesn't seem like it's common enough to justify a special mechanism
either.)

Not sure what to think about 0003.  At this point I'd be inclined
to flush UpdateFreeSpaceMap altogether and use FreeSpaceMapVacuumRange
in its place.  I don't think the design of that function is any better
chosen than its name, and possible bugs in its subroutines don't make
it more attractive.

Not sure about 0004 either.  The fact that we can't localize what part of
the index needs to be updated means that repeated IndexFreeSpaceMapVacuum
calls are a roughly quadratic cost.  Maybe in proportion to the other work
we have to do, they're not much, but on the other hand how much benefit is
there?  Should we make the call conditional on how many index pages got
updated?  Also, I wonder why you only touched nbtree and spgist.  (For
that matter, I wonder why BRIN doesn't go through IndexFreeSpaceMapVacuum
like the rest of the index AMs.  Or perhaps it has the right idea and we
should get rid of IndexFreeSpaceMapVacuum as a useless layer.)

regards, tom lane

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index f9da24c..d2a0066 100644
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***
*** 85,90 
--- 85,99 
  #define VACUUM_TRUNCATE_LOCK_TIMEOUT			5000	/* ms */
  
  /*
+  * When a table has no indexes, vacuum the FSM after every 8GB, approximately
+  * (it won't be exact because we only vacuum FSM after processing a heap page
+  * that has some removable tuples).  When there are indexes, this is ignored,
+  * and we vacuum FSM after each index/heap cleaning pass.
+  */
+ #define VACUUM_FSM_EVERY_PAGES \
+ 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
+ 
+ /*
   * Guesstimation of number of dead tuples per page.  This is used to
   * provide an upper limit to memory allocated when vacuuming small
   * tables.
*** lazy_vacuum_rel(Relation onerel, int opt
*** 285,293 
  	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
   PROGRESS_VACUUM_PHASE_FINAL_CLEANUP);
  
- 	/* Vacuum the Free Space Map */
- 	FreeSpaceMapVacuum(onerel);
- 
  	/*
  	 * Update statistics in pg_class.
  	 *
--- 294,299 
*** lazy_scan_heap(Relation onerel, int opti
*** 465,471 
  	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
  	TransactionId relminmxid = onerel->rd_rel->relminmxid;
  	BlockNumber empty_pages,
! vacuumed_pages;
  	double		num_tuples,		/* total number of nonremovable tuples */
  live_tuples,	/* live tuples (reltuples estimate) */
  tups_vacuumed,	/* tuples cleaned up by vacuum */
--- 471,478 
  	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
  	TransactionId relminmxid = onerel->rd_rel->relminmxid;
  	BlockNumber empty_pages,
! vacuumed_pages,
! next_fsm_block_to_vacuum;
  	double		num_tuples,		/* total number of nonremovable tuples */
  live_tuples,	/* live tuples (reltuples estimate) */
  tups_vacuumed,	/* tuples cleaned up by vacuum */
*** lazy_scan_heap(Relation onerel, int opti
*** 501,506 
--- 508,514 
  		relname)));
  
  	empty_pages = vacuumed_pages = 0;
+ 	next_fsm_block_to_vacuum = (BlockNumber) 0;
  	num_tuples = live_tuples = tups_vacuumed = nkeep = nunused = 0;
  
  	indstats = (IndexBulkDeleteResult **)
*** lazy_scan_heap(Relation onerel, int opti
*** 752,757 
--- 760,772 
  			vacrelstats->num_dead_tuples = 0;
  			vacrelstats->num_index_scans++;
  
+ 			/*
+ 			 * Vacuum the Free Space Map to make newly-freed space visible on
+ 			 * upper-level FSM pages.  Note we have not yet processed blkno.
+ 			 */
+ 			FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
+ 			next_fsm_block_to_vacuum = blkno;
+ 
  			/* Report that we are once again scanning the heap 

Re: new buildfarm with gcc & clang "trunk" -Werror?

2018-03-28 Thread Peter Eisentraut
On 3/28/18 04:06, Fabien COELHO wrote:
> Would the project feel appropriate to:
> 
>   - fix these warnings (other than putting -Wno-format-truncation or
> similar workarounds...).

I've been tracking gcc-8, and I have a patch ready, but these things
change a bit with every compiler snapshot, so I'm waiting until things
settle down.

>   - add permanent gcc/clang trunk beasts with -Werror
> (if so, I'd suggest cannonball & seanettle for the names).

I would not like that.  The build farm success should ideally be a
function of correct PostgreSQL code, not external factors.  If an
in-progress compiler does funny things, what are we supposed to do about
that?

Generally, fixing up PostgreSQL for a new compiler release isn't a major
effort and can be done briefly before the release of the compiler.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: JIT compiling with LLVM v12

2018-03-28 Thread Andres Freund
Hi,

On 2018-03-28 14:27:51 -0700, Andres Freund wrote:
> > 7/10 committed. Inlining, Explain, Docs remain.
> 
> I've pushed these three.

One tiny pending commit I have is to add a few pg_noinline annotations
to slow-path functions, to avoid very common spurious inlines. I'll play
a littlebit more with the set that I think make sense there, and will
send a separate email about that.

Greetings,

Andres Freund



Re: JIT compiling with LLVM v12

2018-03-28 Thread Andres Freund
On 2018-03-27 10:34:26 -0700, Andres Freund wrote:
> On 2018-03-27 10:05:47 -0400, Peter Eisentraut wrote:
> > On 3/13/18 19:40, Andres Freund wrote:
> > > I've pushed a revised and rebased version of my JIT patchset.
> > 
> > What is the status of this item as far as the commitfest is concerned?
> 
> 7/10 committed. Inlining, Explain, Docs remain.

I've pushed these three.

As explained in the inline commit, I've found an edge case where I could
hit an assert in LLVM when using a more efficient interaction with
on-disk files.  That appears to be a spurious assert, but I don't want
to ignore it until that's confirmed from the LLVM side of things.

For now LLVM is enabled by default when compiled --with-llvm. I'm mildly
inclined to leave it like that until shortly before the release, and
then disable it by default (i.e. change the default of jit=off). But I
think we can make that decision based on experience during the testing
window. I'm opening an open items entry for that.

Yay. Also: Tired.

Greetings,

Andres Freund



Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2018-03-28 Thread Dmitry Dolgov
> On 22 March 2018 at 14:18, Ashutosh Bapat  
> wrote:
> On Thu, Mar 22, 2018 at 4:32 AM, Dmitry Dolgov <9erthali...@gmail.com> wrote:
>>> On 12 March 2018 at 06:00, Ashutosh Bapat  
>>> wrote:
>>> Thanks for the note. Here are rebased patches.
>>
>> Since I started to look at this patch, I can share few random notes (although
>> it's not a complete review, I'm in the progress now), maybe it can be 
>> helpful.
>>
>> In `partition_range_bounds_merge`
>>
>> + if (!merged)
>> + break;
>>
>> is a bit redundant I think, because every time `merged` set to false it
>> followed by break.
>
> Yes, right now. May be I should turn it into Assert(merged); What do you 
> think?

Thank you for reply. Yes, that sounds good. But actually you also mentioned
another topic that bothers me about this patch. Different parts of the
algorithm implementation (at least for functions that build maps of matching
partitions) are quite dependent on each other in terms of shared state. At
first glance in `partition_range_bounds_merge` we have about a dozen of
variables of different mutability level, that affect the control flow:

outer_lb_index
inner_lb_index
merged
merged_index
overlap
merged_lb
merged_ub
finished_outer
finished_inner
ub_cmpval
lb_cmpval
inner_has_default
outer_has_default
jointype

It looks a bit too much for me, and would require commentaries like "if you
changed the logic here, also take a look there". But I'm not saying that I have
any specific suggestions how to change it (although I'll definitely try to do
so, at least to get some better understanding of the underlying algorithm).

>>
>> I've noticed that in some places `IS_PARTITIONED_REL` was replaced
>>
>> - if (!IS_PARTITIONED_REL(joinrel))
>> + if (joinrel->part_scheme == NULL)
>>
>> but I'm not quite follow why? Is it because `boundinfo` is not available
>> anymore at this point? If so, maybe it makes sense to update the commentary 
>> for
>> this macro and mention to not use for joinrel.
>
>
> This is done in try_partitionwise_join(). As explained in the comment
>
>  * Get the list of matching partitions to be joined along with the
>  * partition bounds of the join relation. Because of the restrictions
>  * imposed by partition matching algorithm, not every pair of joining
>  * relations for this join will be able to use partition-wise join. But 
> all
>  * those pairs which can use partition-wise join will produce the same
>  * partition bounds for the join relation.
>
> boundinfo for the join relation is built in this function. So, we
> don't have join relation's partitioning information fully set up yet.
> So we can't use IS_PARTITIONED_REL() there. joinrel->part_scheme if
> set tells that the joining relations have matching partition schemes
> and thus the join relation can possibly use partition-wise join
> technique. If it's not set, then we can't use partition-wise join.
>
> But IS_PARTITIONED_REL() is still useful at a number of other places,
> where it's known to encounter a RelOptInfo whose partitioning
> properties are fully setup. So, I don't think we should change the
> macro or the comments above it.

Just to make myself clear, I wanted to suggest not to change the commentary for
`IS_PARTITIONED_REL` significantly, but just add a sentence that you need to
check if given relation is fully set up.

Also, few more random notes (mostly related to readability, since I found some
parts of the patch hard to read, but of course it's arguable).

```
PartitionRangeBound outer_lb;
PartitionRangeBound outer_ub;
PartitionRangeBound inner_lb;
PartitionRangeBound inner_ub;
PartitionRangeBound *merged_lb = NULL;
PartitionRangeBound *merged_ub = NULL;
```

Maybe it would be better to not repeat the type here? Let's say:

```
PartitionRangeBound outer_lb,
outer_ub,
...
```

It's just too long and distracting.

```
partition_range_bounds_merge(int partnatts, FmgrInfo *partsupfuncs,
 Oid *partcollations, PartitionBoundInfo outer_bi,
 int outer_nparts, PartitionBoundInfo inner_bi,
 int inner_nparts, JoinType jointype,
 List **outer_parts, List **inner_parts)
```

>From what I see in `partition.c` there are a lot functions that accept
`partnatts`, `partcollations` only to pass it down to, e.g.
`partition_rbound_cmp`.
What do you think about introducing a data structure to keep these arguments,
and pass only an instance of this structure instead?



Re: reorganizing partitioning code

2018-03-28 Thread Robert Haas
On Wed, Mar 28, 2018 at 12:07 AM, Amit Langote
 wrote:
> On 2018/03/22 11:45, Amit Langote wrote:
>> FWIW, I did manage to rebase it this morning and posting it here.
>
> Rebased again.
>
> I started wondering if we should separate out stuff related to partition
> bounds.  That is create a utils/partbound.h and put PartitionBoundInfo and
> related comparison and search functions into a utils/adt/partbound.c.  I
> had started thinking about that when I looked at the code added by the
> patch submitted on the "advanced partition matching algorithm for
> partition-wise join" thread [1].  I haven't done anything about that though.

adt = Abstract Data Type, which I think we've interpreted up until now
to mean an SQL-visible data type, so that seems like an odd choice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] why not parallel seq scan for slow functions

2018-03-28 Thread Robert Haas
On Wed, Mar 28, 2018 at 3:06 AM, Amit Kapila  wrote:
> Good idea, such an optimization will ensure that the cases reported
> above will not have regression.  However isn't it somewhat beating the
> idea you are using in the patch which is to avoid modifying the path
> in-place?

Sure, but you can't have everything.  I don't think modifying the
sortgroupref data in place is really quite the same thing as changing
the pathtarget in place; the sortgroupref stuff is some extra
information about the targets being computed, not really a change in
targets per se.  But in any case if you want to eliminate extra work
then we've gotta eliminate it...

> In any case, I think one will still see regression in cases
> where this optimization doesn't apply.  For example,
>
> DO $$
> DECLARE count integer;
> BEGIN
> For count In 1..100 Loop
> Execute 'explain Select sum(thousand)from tenk1 group by ten';
> END LOOP;
> END;
> $$;
>
> The above block takes 43700.0289 ms on Head and 45025.3779 ms with the
> patch which is approximately 3% regression.

Thanks for the analysis -- the observation that this seemed to affect
cases where CP_LABEL_TLIST gets passed to create_projection_plan
allowed me to recognize that I was doing an unnecessary copyObject()
call.  Removing that seems to have reduced this regression below 1% in
my testing.

Also, keep in mind that we're talking about extremely small amounts of
time here.  On a trivial query that you're not even executing, you're
seeing a difference of (45025.3779 - 43700.0289)/100 = 0.00132 ms
per execution.  Sure, it's still 3%, but it's 3% of the time in an
artificial case where you don't actually run the query.  In real life,
nobody runs EXPLAIN in a tight loop a million times without ever
running the query, because that's not a useful thing to do.  The
overhead on a realistic test case will be smaller.  Furthermore, at
least in my testing, there are now cases where this is faster than
master.  Now, I welcome further ideas for optimization, but a patch
that makes some cases slightly slower while making others slightly
faster, and also improving the quality of plans in some cases, is not
to my mind an unreasonable thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


0003-Rewrite-the-code-that-applies-scan-join-targets-to-p.patch
Description: Binary data


0002-Postpone-generate_gather_paths-for-topmost-scan-join.patch
Description: Binary data


0001-Teach-create_projection_plan-to-omit-projection-wher.patch
Description: Binary data


Re: csv format for psql

2018-03-28 Thread David G. Johnston
On Wednesday, March 28, 2018, Fabien COELHO  wrote:

>
>
> And if we introduce csv-specific fieldsep, then we multiply this wrong
>> design. The fix in this direction is renaming fieldsep to fieldsep-unaliagn
>> - but it is probably too big change too. So this design is nothing what I
>> can mark as good solution.
>>
>
> Good, we somehow agree on something!
>
> I can live with because it is much better than using pipe as separator for
>> csv, and because real impact is small - for almost people it will be
>> invisible - but have not good feeling from it.
>
>
Concretely...I'm in favor of the "\pset fieldsep_csv ," solution and csv
format should always use its existing value.  Teach \pset fieldsep to fail
if the current format is csv.  Being able to specify the csv fieldsep like
 "\pset format csv ;" would be a plus.

Unaligned format could grow its own fieldsep if it wanted to but it can
also just use the default provided fieldsep var whose default value is
pipe.  If it did grow one it would need to understand "not set" in order to
preserve existing behavior.  We don't have that problem with csv.

I don't believe we can modify fieldsep without causing unwanted grief.

David J.


Re: Using base backup exclusion filters to reduce data transferred with pg_rewind

2018-03-28 Thread Fujii Masao
On Wed, Mar 28, 2018 at 7:54 AM, Michael Paquier  wrote:
> On Wed, Mar 28, 2018 at 04:13:25AM +0900, Fujii Masao wrote:
>> This code is almost ok in practice, but using the check of
>> "strstr(path, localpath) == path" is more robust here?
>
> No problems with that either.
>
>> Using the following code instead is more robust?
>> This original code is almost ok in practice, though.
>>
>> filename = last_dir_separator(path);
>> if (filename == NULL)
>> filename = path;
>> else
>> filename++;
>> if (strcmp(filename, excludeFiles[excludeIdx]) == 0)
>
> Indeed, using last_dir_separator is a better idea for files.  I was
> looking for something like that actually.
>
>> +  (everything except the relation files). Similarly to base backups,
>> +  the contents of the directories pg_dynshmem/,
>> +  pg_notify/, pg_replslot/,
>> +  pg_serial/, pg_snapshots/,
>> +  pg_stat_tmp/, and
>> +  pg_subtrans/ are omitted from the data copied
>> +  from the source cluster. Any file or directory beginning with
>> +  pgsql_tmp is omitted, as well as are
>> +  backup_label,
>> +  tablespace_map,
>> +  pg_internal.init,
>> +  postmaster.opts and
>> +  postmaster.pid.
>>
>> I don't think this description is necessary. The doc for pg_basebackup
>> also doesn't seem to have this kind of description.
>
> Those are listed in backup.sgml.  And I really think that we should at
> least document that the same type of exclusion filters as base backups
> are used in pg_rewind.

Okay, I revived that change in the doc.

>> So attached is the patch that I updated based on your patch and
>> am thinking to apply.
>
> Thanks.  Except for the documentation part, I am OK for the changes
> proposed.

Committed. Thanks!

Regards,

-- 
Fujii Masao



Re: csv format for psql

2018-03-28 Thread Joshua D. Drake

On 03/28/2018 12:35 PM, David G. Johnston wrote:
On Monday, March 26, 2018, Daniel Verite > wrote:



We could even support only the comma and make it non-configurable
based on the fact it's Comma-Separated-Values, not
Whatever-Separated-Values, except that won't do much
to serve the users interests, as the reality is that
people use various separators.


I like to call it "Character Separated Values" now for just that reason.


Isn't the actual wording Character Delimited Values? I may be picking at 
hairs here but every single time I use anything to import a CSV or other 
delimited file (TAB or | usually) that is what the import screen says.


JD



David J.



--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
* Unless otherwise stated, opinions are my own.   *



Re: Sample values for pg_stat_statements

2018-03-28 Thread David Steele
On 3/21/18 1:31 PM, David Steele wrote:
> 
> On 3/10/18 9:02 AM, Tomas Vondra wrote:
>>
>> I've looked at this patch today. I like the idea / intent in general, as
>> it helps with some investigation tasks. That being said, I have a couple
>> of questions/comments based on read through the patch:
> 
> It looks like there are some privacy concerns with this patch as well as
> the suggestions in the review from Tomas.
> 
> Will you be providing an updated patch for this CF?  If not, I think we
> should mark the entry Returned with Feedback for now and let you
> resubmit when you have an updated patch.

Since it doesn't appear that a new patch will be ready for this CF I
have marked this entry Returned with Feedback.

Regards,
-- 
-David
da...@pgmasters.net



Re: csv format for psql

2018-03-28 Thread David G. Johnston
On Monday, March 26, 2018, Daniel Verite  wrote:

>
> We could even support only the comma and make it non-configurable
> based on the fact it's Comma-Separated-Values, not
> Whatever-Separated-Values, except that won't do much
> to serve the users interests, as the reality is that
> people use various separators.
>

I like to call it "Character Separated Values" now for just that reason.

David J.


Re: csv format for psql

2018-03-28 Thread David G. Johnston
On Wednesday, March 28, 2018, David G. Johnston 
wrote:

> On Wednesday, March 28, 2018, Pavel Stehule 
> wrote:
>
>>
>> Are there some possible alternatives?

>>>
>>> Given the date and the fact that the cf end is 3 days away, the proposed
>>> short term alternative is Daniel's version, that I feel is reasonable. Ok,
>>> people have to do two pset to get comma-separated csv, otherwise they get
>>> pipe-separated csv in one pset.
>>>
>>
>>
> Could someone post how captions, rows-only, and footer pset settings
> factor into this?  Specifically are they fixed to on/off or will they
> hide/show if users request them explicitly?
>
>
I found the original post that covers that indeed we simply fix these
values, which is why I was thinking.

 and something like --csv-fieldsep should be provided (I like the prefixing
> to tie the option lexically to the master --csv option).
>

Or, really, just make --csv take an optional argument which, if present, is
the delimiter.  No separate argument needed, and we ignore the pset
fieldsep argument like we ignore everything else.

Need to decide whether to "not ignore" --tuples-only, which doesn't seem
controversial aside from being a pset argument that isn't being ignored in
this design...

David J.


Re: Trigger file behavior with the standby

2018-03-28 Thread Keiko Oda
Thanks a lot for the answer, Michael (and sorry for the slow response)!

So, if I understand what you're saying correctly, I'm seeing this behavior
because wal-e keeps fetching wal files from s3 regardless of this
trigger_file, and these fetched wal files are in pg_wal (or pg_xlog),
therefore Postgres just tries to restore whatever available in pg_wal
before the failover. Or, even if there is no file in pg_wal, it still tries
to fetch from the "archive" (s3).
In other words, if I would like to do "immediate failover" (and do not care
about WAL files available in archive or in pg_wal), I should be tweaking
restore_command so that no further fetching/restoring happens.
Is it... accurate?

Thanks,
Keiko

On Mon, Mar 19, 2018 at 9:28 PM, Michael Paquier 
wrote:

> On Mon, Mar 19, 2018 at 01:27:21PM -0700, Keiko Oda wrote:
> > I'm seeing the following behavior with a trigger file which is very
> > confusing to me, I'd like to get some advice of what is the expected
> > behavior of the trigger file with the standby.
>
> This portion from the docs includes your answer:
> https://www.postgresql.org/docs/devel/static/warm-
> standby.html#STANDBY-SERVER-OPERATION
> "Standby mode is exited and the server switches to normal operation when
> pg_ctl promote is run or a trigger file is found (trigger_file). Before
> failover, any WAL immediately available in the archive or in pg_wal will
> be restored, but no attempt is made to connect to the master.
>
> So when creating a trigger file or signaling for promotion, any WAL
> files available are first fetched, and then promotion happens.  In your
> case all the WAL segments from the archives are retrieved first.
> --
> Michael
>


Re: csv format for psql

2018-03-28 Thread David G. Johnston
On Wednesday, March 28, 2018, Pavel Stehule  wrote:

>
> Are there some possible alternatives?
>>>
>>
>> Given the date and the fact that the cf end is 3 days away, the proposed
>> short term alternative is Daniel's version, that I feel is reasonable. Ok,
>> people have to do two pset to get comma-separated csv, otherwise they get
>> pipe-separated csv in one pset.
>>
>
>
Could someone post how captions, rows-only, and footer pset settings factor
into this?  Specifically are they fixed to on/off or will they hide/show if
users request them explicitly?

My take on this is that --csv mode is/should be an alternate output mode
from the existing pset controlled one, and functions basically like "\copy
to stdout" and all echoing and metadata outputs are disabled and only query
results, with header and the user specified delimiter, are output.  No
control other than the delimiter seems to be provided in the current design
but that could be expanded upon.  In that specification the existing
fieldsep argument that is tied to pset should not be used and something
like --csv-fieldsep should be provided (I like the prefixing to tie the
option lexically to the master --csv option).

David J.


Re: [bug fix] pg_rewind creates corrupt WAL files, and the standby cannot catch up the primary

2018-03-28 Thread Fujii Masao
On Wed, Mar 28, 2018 at 11:24 AM, Michael Paquier  wrote:
> On Wed, Mar 28, 2018 at 04:45:49AM +0900, Fujii Masao wrote:
>> The patch looks good to me! Barring objection, I will commit it
>> and back-patch to 9.5 where pg_rewind was added.
>
> Thanks in advance!  I just eyeballed the patch again and I don't see
> issues with what's used here.  The thing should apply smoothly back to
> 9.5, of course let me know if you need help or an extra pair of eyes.

Pushed. Thanks!

Regards,

-- 
Fujii Masao



Re: Function to track shmem reinit time

2018-03-28 Thread David Steele
On 3/4/18 11:17 AM, Tomas Vondra wrote:
> 
> Furthermore, the patch is yet another victim of fd1a421fe - fixing the
> pg_proc entries is trivial, but a new version is needed.
> 
> I'd also like to see an example/explanation how this improves this
> situation compared to only having pg_postmaster_start_time.
> 
> So I'm setting this as waiting on author for now.

I'm not sure why this was set back to Needs Review since it neither
applies cleanly nor builds.

I'm setting this entry to Waiting on Author, but based on the discussion
I think it should be Returned with Feedback.

Regards,
-- 
-David
da...@pgmasters.net



Re: PL/pgSQL nested CALL with transactions

2018-03-28 Thread Peter Eisentraut
On 3/28/18 09:00, Tomas Vondra wrote:
> I see. I thought "fixed" means you adopted the #define, but that's not
> the case. I don't think an extra comment is needed - the variable name
> is descriptive enough IMHO.

Committed as presented then.  Thanks.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Rangejoin rebased

2018-03-28 Thread David Steele
On 3/28/18 2:23 PM, Andres Freund wrote:
> On 2018-03-28 14:18:42 -0400, David Steele wrote:
>> It seems that a new patch is needed here but none has been presented.
>> I've marked this Waiting on Author for the moment, but I really think it
>> should be marked Returned with Feedback and submitted to the next CF
>> when a new patch is ready.
> 
> I'd just do so now. There's not been any progress for months, and
> there's been an update request weeks ago...
Done.

-- 
-David
da...@pgmasters.net



Re: Rangejoin rebased

2018-03-28 Thread Andres Freund
Hi,

On 2018-03-28 14:18:42 -0400, David Steele wrote:
> It seems that a new patch is needed here but none has been presented.
> I've marked this Waiting on Author for the moment, but I really think it
> should be marked Returned with Feedback and submitted to the next CF
> when a new patch is ready.

I'd just do so now. There's not been any progress for months, and
there's been an update request weeks ago...

Greetings,

Andres Freund



Re: Rangejoin rebased

2018-03-28 Thread David Steele
On 3/2/18 11:44 AM, Robert Haas wrote:
> On Fri, Mar 2, 2018 at 11:12 AM, Alexander Kuzmenkov
>  wrote:
>> On 16.01.2018 10:49, Jeff Davis wrote:
>>> My proposed fix is to make an internal opfamily identical to the
>>> external one, such that it's not recognized as part of the same EC,
>>> and the planner won't try to eliminate it. It loses out on potential
>>> optimizations, but those are mostly theoretical since the btree
>>> opclass ordering for ranges is not very interesting to a user.
>>
>> I think I figured out what to do with missing sort directions. We can change
>> select_outer_pathkeys_for_merge() to generate the pathkeys we need. Also,
>> find_mergeclauses_for_outer_pathkeys() has to be changed too, so that it
>> knows which pathkeys are compatible to which range join clauses.
>>
>> About the patch, do I understand it right that you are working on the next
>> version now?
> 
> I think we are quite clearly past the deadline to submit a new patch
> for inclusion in v11 at this point.

It seems that a new patch is needed here but none has been presented.
I've marked this Waiting on Author for the moment, but I really think it
should be marked Returned with Feedback and submitted to the next CF
when a new patch is ready.

Regards,
-- 
-David
da...@pgmasters.net



Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-28 Thread Andres Freund
Hi,

On 2018-03-28 13:52:24 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Given, as explained nearby, we already do store transient data in the
> > ctid for speculative insertions (i.e. ON CONFLICT), and it hasn't caused
> > even a whiff of trouble, I'm currently not inclined to see a huge issue
> > here.  It'd be great if you could expand on your concerns here a bit, we
> > gotta figure out a way forward.
> 
> Just what I said.  There's a lot of code that knows how to follow tuple
> update chains, probably not all of it in core, and this will break it.
> But only in seldom-exercised corner cases, which is the worst of all
> possible worlds from a reliability standpoint.

How will it break it? They'll see an invalid ctid and conclude that the
tuple is dead? Without any changes that's already something that can
happen if a later tuple in the chain has been pruned away.  Sure, that
code won't realize it should error out because the tuple is now in a
different partition, but neither would a infomask bit.

I think my big problem is that I just don't see what the worst that can
happen is. We'd potentially see a broken ctid chain, something that very
commonly happens, and consider the tuple to be invisible.  That seems
pretty sane behaviour for unadapted code, and not any worse than other
potential solutions.


> (I don't think ON CONFLICT is a counterexample because, IIUC, it's not
> a persistent state.)

Hm, it can be persistent if we error out in the right moment. But it's
nots super common to encounter that over a long time, I grant you
that. Not that this'd be super persistent either, vacuum/pruning would
normally remove the tuple as well, it's dead after all.


> >> I would've been happier about expending an infomask bit towards this
> >> purpose.  Just eyeing what we've got, I can't help noticing that
> >> HEAP_MOVED_OFF/HEAP_MOVED_IN couldn't possibly be set in any tuple
> >> in a partitioned table.  Perhaps making these tests depend on
> >> partitioned-ness would be unworkably messy, but it's worth thinking
> >> about.
> 
> > They previously couldn't be set together IIRC, so we could just (mask &
> > (HEAP_MOVED_OFF |HEAP_MOVED_IN)) == (HEAP_MOVED_OFF |HEAP_MOVED_IN) but
> > that'd be permanently eating two infomask bits.
> 
> Hmm.  That objection only matters if we have realistic intentions of
> reclaiming those bits in future, which I've not heard anyone making
> serious effort towards.

I plan to submit a patch early v12 that keeps track of the last time a
table has been fully scanned (and when it was created). With part of the
goal being debuggability and part being able to reclaim things like
these bits.


Greetings,

Andres Freund




Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-28 Thread Tom Lane
Andres Freund  writes:
> On 2018-03-08 13:46:53 -0500, Tom Lane wrote:
>> ... Breaking fundamental invariants like
>> "ctid points to this tuple or its update successor" is going to cause
>> trouble.  There's a lot of code that knows that; more than knows the
>> details of what's in xmax, I believe.

> Given, as explained nearby, we already do store transient data in the
> ctid for speculative insertions (i.e. ON CONFLICT), and it hasn't caused
> even a whiff of trouble, I'm currently not inclined to see a huge issue
> here.  It'd be great if you could expand on your concerns here a bit, we
> gotta figure out a way forward.

Just what I said.  There's a lot of code that knows how to follow tuple
update chains, probably not all of it in core, and this will break it.
But only in seldom-exercised corner cases, which is the worst of all
possible worlds from a reliability standpoint.  (I don't think ON CONFLICT
is a counterexample because, IIUC, it's not a persistent state.)

Given that there are other ways we could attack it, I think throwing away
this particular invariant is an unnecessarily risky solution.

>> I would've been happier about expending an infomask bit towards this
>> purpose.  Just eyeing what we've got, I can't help noticing that
>> HEAP_MOVED_OFF/HEAP_MOVED_IN couldn't possibly be set in any tuple
>> in a partitioned table.  Perhaps making these tests depend on
>> partitioned-ness would be unworkably messy, but it's worth thinking
>> about.

> They previously couldn't be set together IIRC, so we could just (mask &
> (HEAP_MOVED_OFF |HEAP_MOVED_IN)) == (HEAP_MOVED_OFF |HEAP_MOVED_IN) but
> that'd be permanently eating two infomask bits.

Hmm.  That objection only matters if we have realistic intentions of
reclaiming those bits in future, which I've not heard anyone making
serious effort towards.  Rather than messing with the definition of ctid,
I'd be happier with saying that they're never going to be reclaimed, but
at least we're getting one bit's worth of real use out of them.

regards, tom lane



Re: Reopen logfile on SIGHUP

2018-03-28 Thread David Steele
On 2/27/18 8:54 PM, Michael Paquier wrote:
> On Tue, Feb 27, 2018 at 05:52:20PM -0500, Tom Lane wrote:
>> It already does treat SIGUSR1 as a log rotation request.  Apparently
>> the point of this patch is that some people don't find that easy enough
>> to use, which is fair, because finding out the collector's PID from
>> outside isn't very easy.
> 
> True enough.  The syslogger does not show up in pg_stat_activity either,
> so I think that being able to do so would help for this case.

There does not seem to be any consensus on this patch so I'm marking it
Waiting on Author for the time being.  At the end of the CF I'll mark it
Returned with Feedback if there is no further activity.

Regards,
-- 
-David
da...@pgmasters.net



Re: disable SSL compression?

2018-03-28 Thread Konstantin Knizhnik



On 28.03.2018 20:26, Konstantin Knizhnik wrote:



On 17.03.2018 17:18, Peter Eisentraut wrote:

On 3/11/18 13:28, Tom Lane wrote:

My proposal is the attached patch that sets the default in libpq to off
and adjusts the documentation a bit so it doesn't sound like we have
missed the news altogether.

Seems reasonable as far as it goes, but do we need to make corresponding
server-side changes?

We could add a setting to disable SSL compression on the server, as some
web servers have been doing, but the niche of version combinations where
that would be applicable seems pretty low.




One of our customers was managed to improve speed about 10 times by 
using SSL compression for the system where client and servers are 
located in different geographical regions
and query results are very large because of JSON columns. Them 
actually do not need encryption, just compression.
I expect that it is not the only case where compression of libpq 
protocol can be useful. Please notice that logical replication is also 
using libpq protocol.


If SSL compression is deprecated, should we provide own compression?
I have implemented some prototype implementation of it (patch is 
attached).
I have added compression=on/off parameter to connection string and -Z 
option to psql and pgbench utilities.

Below are some results:

Compression ratio (raw->compressed):


libz (level=1)
libzstd (level=1)
pgbench -i -s 10
16997209->2536330
16997209->268077
pgbench -t 10 -S
6289036->1523862
6600338<-900293
6288933->1777400
6600338<-1000318


There is no mistyping: libzstd compress COPY data about 10 times 
better than libz, with wonderful compression ratio 63.


Influence on execution time is minimal (I have tested local 
configuration when client and server are at the same host):



no compression
libz (level=1)
libzstd (level=1)
pgbench -i -s 10
1.552
1.572
1.611
pgbench -t 10 -S
4.482
4.926
4.877





--
Konstantin Knizhnik
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company


Sorry, I have attached wrong patch.
To use zstd compression, Postgres should be configured with --with-zstd. 
Otherwise compression will use zlib unless it is disabled by 
--without-zlib option.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/configure b/configure
index 6f659a5..f2ef62d 100755
--- a/configure
+++ b/configure
@@ -698,6 +698,7 @@ ELF_SYS
 EGREP
 GREP
 with_zlib
+with_zstd
 with_system_tzdata
 with_libxslt
 with_libxml
@@ -862,6 +863,7 @@ with_libxml
 with_libxslt
 with_system_tzdata
 with_zlib
+with_zstd
 with_gnu_ld
 enable_largefile
 enable_float4_byval
@@ -8016,6 +8018,86 @@ fi
 
 
 #
+# ZStd
+#
+
+
+
+# Check whether --with-zstd was given.
+if test "${with_zstd+set}" = set; then :
+  withval=$with_zstd;
+  case $withval in
+yes)
+  ;;
+no)
+  :
+  ;;
+*)
+  as_fn_error $? "no argument expected for --with-zstd option" "$LINENO" 5
+  ;;
+  esac
+
+else
+  with_zstd=no
+
+fi
+
+
+
+
+if test "$with_zstd" = yes ; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for ZSTD_compress in -lzstd" >&5
+$as_echo_n "checking for ZSTD_compress in -lzstd... " >&6; }
+if ${ac_cv_lib_zstd_ZSTD_compress+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lzstd  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char ZSTD_compress ();
+int
+main ()
+{
+return ZSTD_compress ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_zstd_ZSTD_compress=yes
+else
+  ac_cv_lib_zstd_ZSTD_compress=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_zstd_ZSTD_compress" >&5
+$as_echo "$ac_cv_lib_zstd_ZSTD_compress" >&6; }
+if test "x$ac_cv_lib_zstd_ZSTD_compress" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBZSTD 1
+_ACEOF
+
+  LIBS="-lzstd $LIBS"
+
+else
+  as_fn_error $? "library 'zstd' is required for ZSTD support" "$LINENO" 5
+fi
+
+fi
+
+
+
+#
 # Elf
 #
 
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 3bbdf17..08b8ab2 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -195,6 +195,7 @@ with_llvm	= @with_llvm@
 with_system_tzdata = @with_system_tzdata@
 with_uuid	= @with_uuid@
 with_zlib	= @with_zlib@
+with_zstd   = @with_zstd@
 enable_rpath	= @enable_rpath@
 enable_nls	= @enable_nls@
 enable_debug	= @enable_debug@
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 6450c5a..1522c70 100644

Re: PostgreSQL crashes with SIGSEGV

2018-03-28 Thread Tom Lane
Peter Geoghegan  writes:
> On Mon, Mar 26, 2018 at 5:14 AM, Kyotaro HORIGUCHI
>  wrote:
>> If no one objects, I'll mark this as Ready for Commit in a couple
>> of days.

> Thank you for the review, Horiguchi-san. It's hard to decide how
> important each goal is when coming up with a back-patchable fix like
> this. When the goals are somewhat in competition with each other, a
> second or a third opinion is particularly appreciated.

It looks good to me.  The only real objection would be if someone came
up with a test case proving that there's a significant performance
degradation from the extra copies.  But given that these are back
branches, it would take a pretty steep penalty for me to want to take
the risk of refactoring to avoid that.

I've pushed it with some cosmetic adjustments.

regards, tom lane



Re: [HACKERS] Transactions involving multiple postgres foreign servers

2018-03-28 Thread David Steele
On 2/27/18 2:21 AM, Masahiko Sawada wrote:
> 
> Hmm, although I've thought concern in case where we don't consider
> local xids of un-resolved fdwxact in GetOldestXmin, I could not find
> problem. Could you share your concern if you have? I'll try to find a
> possibility based on it.

It appears that this entry should be marked Waiting on Author so I have
done that.

I also think it may be time to move this patch to the next CF.

Regards,
-- 
-David
da...@pgmasters.net



Re: disable SSL compression?

2018-03-28 Thread Konstantin Knizhnik



On 17.03.2018 17:18, Peter Eisentraut wrote:

On 3/11/18 13:28, Tom Lane wrote:

My proposal is the attached patch that sets the default in libpq to off
and adjusts the documentation a bit so it doesn't sound like we have
missed the news altogether.

Seems reasonable as far as it goes, but do we need to make corresponding
server-side changes?

We could add a setting to disable SSL compression on the server, as some
web servers have been doing, but the niche of version combinations where
that would be applicable seems pretty low.




One of our customers was managed to improve speed about 10 times by 
using SSL compression for the system where client and servers are 
located in different geographical regions
and query results are very large because of JSON columns. Them actually 
do not need encryption, just compression.
I expect that it is not the only case where compression of libpq 
protocol can be useful. Please notice that logical replication is also 
using libpq protocol.


If SSL compression is deprecated, should we provide own compression?
I have implemented some prototype implementation of it (patch is attached).
I have added compression=on/off parameter to connection string and -Z 
option to psql and pgbench utilities.

Below are some results:

Compression ratio (raw->compressed):


libz (level=1)
libzstd (level=1)
pgbench -i -s 10
16997209->2536330
16997209->268077
pgbench -t 10 -S
6289036->1523862
6600338<-900293
6288933->1777400
6600338<-1000318


There is no mistyping: libzstd compress COPY data about 10 times better 
than libz, with wonderful compression ratio 63.


Influence on execution time is minimal (I have tested local 
configuration when client and server are at the same host):



no compression
libz (level=1)
libzstd (level=1)
pgbench -i -s 10
1.552
1.572
1.611
pgbench -t 10 -S
4.482
4.926
4.877






--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/configure b/configure
index 6f659a5..f2ef62d 100755
--- a/configure
+++ b/configure
@@ -698,6 +698,7 @@ ELF_SYS
 EGREP
 GREP
 with_zlib
+with_zstd
 with_system_tzdata
 with_libxslt
 with_libxml
@@ -862,6 +863,7 @@ with_libxml
 with_libxslt
 with_system_tzdata
 with_zlib
+with_zstd
 with_gnu_ld
 enable_largefile
 enable_float4_byval
@@ -8016,6 +8018,86 @@ fi
 
 
 #
+# ZStd
+#
+
+
+
+# Check whether --with-zstd was given.
+if test "${with_zstd+set}" = set; then :
+  withval=$with_zstd;
+  case $withval in
+yes)
+  ;;
+no)
+  :
+  ;;
+*)
+  as_fn_error $? "no argument expected for --with-zstd option" "$LINENO" 5
+  ;;
+  esac
+
+else
+  with_zstd=no
+
+fi
+
+
+
+
+if test "$with_zstd" = yes ; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for ZSTD_compress in -lzstd" >&5
+$as_echo_n "checking for ZSTD_compress in -lzstd... " >&6; }
+if ${ac_cv_lib_zstd_ZSTD_compress+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lzstd  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char ZSTD_compress ();
+int
+main ()
+{
+return ZSTD_compress ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_zstd_ZSTD_compress=yes
+else
+  ac_cv_lib_zstd_ZSTD_compress=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_zstd_ZSTD_compress" >&5
+$as_echo "$ac_cv_lib_zstd_ZSTD_compress" >&6; }
+if test "x$ac_cv_lib_zstd_ZSTD_compress" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBZSTD 1
+_ACEOF
+
+  LIBS="-lzstd $LIBS"
+
+else
+  as_fn_error $? "library 'zstd' is required for ZSTD support" "$LINENO" 5
+fi
+
+fi
+
+
+
+#
 # Elf
 #
 
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 3bbdf17..08b8ab2 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -195,6 +195,7 @@ with_llvm	= @with_llvm@
 with_system_tzdata = @with_system_tzdata@
 with_uuid	= @with_uuid@
 with_zlib	= @with_zlib@
+with_zstd   = @with_zstd@
 enable_rpath	= @enable_rpath@
 enable_nls	= @enable_nls@
 enable_debug	= @enable_debug@
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 6450c5a..1522c70 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -50,6 +50,14 @@ ifeq ($(with_systemd),yes)
 LIBS += -lsystemd
 endif
 
+ifeq ($(with_zstd),yes)
+LIBS += -lzstd
+endif
+
+ifeq ($(with_zlib),yes)
+LIBS += -lz
+endif
+
 ##
 
 all: submake-libpgport submake-schemapg postgres 

Re: [HACKERS] logical decoding of two-phase transactions

2018-03-28 Thread Simon Riggs
On 28 March 2018 at 16:28, Nikhil Sontakke  wrote:

> Simon, 0003-Add-GID-and-replica-origin-to-two-phase-commit-abort.patch
> is the exact patch that you had posted for an earlier commit.

0003 Pushed

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-03-28 Thread Teodor Sigaev

Hi!

I slightly modified test for clean demonstration of difference between 
fastupdate on and off. Also I added CheckForSerializableConflictIn() to 
fastupdate off codepath, but only in case of non-empty pending list.


The next question what I see: why do not we lock entry leaf pages in some cases? 
As I understand, scan should lock any visited page, but now it's true only for 
posting tree. Seems, it also should lock pages in entry tree because concurrent 
procesess could add new entries which could be matched by partial search, for 
example. BTW, partial search (see collectMatchBitmap()) locks correctly entry 
tree, but regular startScanEntry() doesn't lock entry page in case of posting 
tree, only in case of posting list.



Dmitry Ivanov wrote:

I'd like to see fastupdate=on in test too, now tests cover only case
without fastupdate. Please, add them.


Here's a couple of tests for pending list (fastupdate = on).



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 37070b3b72..095b1192cb 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/xloginsert.h"
+#include "storage/predicate.h"
 #include "miscadmin.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -515,6 +516,19 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			btree->fillRoot(btree, newrootpg,
 			BufferGetBlockNumber(lbuffer), newlpage,
 			BufferGetBlockNumber(rbuffer), newrpage);
+
+			if (GinPageIsLeaf(BufferGetPage(stack->buffer)))
+			{
+
+PredicateLockPageSplit(btree->index,
+		BufferGetBlockNumber(stack->buffer),
+		BufferGetBlockNumber(lbuffer));
+
+PredicateLockPageSplit(btree->index,
+		BufferGetBlockNumber(stack->buffer),
+		BufferGetBlockNumber(rbuffer));
+			}
+
 		}
 		else
 		{
@@ -524,6 +538,14 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			GinPageGetOpaque(newrpage)->rightlink = savedRightLink;
 			GinPageGetOpaque(newlpage)->flags |= GIN_INCOMPLETE_SPLIT;
 			GinPageGetOpaque(newlpage)->rightlink = BufferGetBlockNumber(rbuffer);
+
+			if (GinPageIsLeaf(BufferGetPage(stack->buffer)))
+			{
+
+PredicateLockPageSplit(btree->index,
+		BufferGetBlockNumber(stack->buffer),
+		BufferGetBlockNumber(rbuffer));
+			}
 		}
 
 		/*
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index f9daaba52e..3fb4fc8264 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -19,6 +19,7 @@
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
+#include "storage/predicate.h"
 #include "utils/rel.h"
 
 /*
@@ -1423,7 +1424,7 @@ disassembleLeaf(Page page)
  * Any segments that acquire new items are decoded, and the new items are
  * merged with the old items.
  *
- * Returns true if any new items were added. False means they were all
+ * Returns true if any new items were added. false means they were all
  * duplicates of existing items on the page.
  */
 static bool
@@ -1759,7 +1760,7 @@ leafRepackItems(disassembledLeaf *leaf, ItemPointer remaining)
  */
 BlockNumber
 createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
-  GinStatsData *buildStats)
+  GinStatsData *buildStats, Buffer entrybuffer)
 {
 	BlockNumber blkno;
 	Buffer		buffer;
@@ -1810,6 +1811,12 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	page = BufferGetPage(buffer);
 	blkno = BufferGetBlockNumber(buffer);
 
+	/*
+	 * Copy a predicate lock from entry tree leaf (containing posting list)
+	 * to  posting tree.
+	 */
+	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
@@ -1904,6 +1911,7 @@ ginInsertItemPointers(Relation index, BlockNumber rootBlkno,
 		btree.itemptr = insertdata.items[insertdata.curitem];
 		stack = ginFindLeafPage(, false, NULL);
 
+		GinCheckForSerializableConflictIn(btree.index, NULL, stack->buffer);
 		ginInsertValue(, stack, , buildStats);
 	}
 }
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 6fe67f346d..63603859bc 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -17,8 +17,10 @@
 #include "access/gin_private.h"
 #include "access/relscan.h"
 #include "miscadmin.h"
+#include "storage/predicate.h"
 #include "utils/datum.h"
 #include "utils/memutils.h"
+#include "utils/rel.h"
 
 /* GUC parameter */
 int			GinFuzzySearchLimit = 0;
@@ -33,11 +35,18 @@ typedef struct pendingPosition
 } pendingPosition;
 
 
+static void
+GinPredicateLockPage(Relation index, BlockNumber blkno, Snapshot snapshot)
+{
+	if (!GinGetUseFastUpdate(index))
+			

Re: Re: csv format for psql

2018-03-28 Thread Pavel Stehule
2018-03-28 10:24 GMT+02:00 Fabien COELHO :

>
> Hello Pavel,
>
> I'd like to convince you to compromise, because otherwise I'm afraid we
>>> will not get this feature.
>>>
>>
>> [...]
>>
>>>
>>> The only no-surprise, no-behavioral-change, alternative I see is to have
>>> a
>>> csv-specific fieldsep. I'm not keen on that one because it is yet another
>>> variable, one has to remember it, and then it asks the question about
>>> recordsep... and I think there are already too many parameters, and more
>>> is
>>> not desirable, although I probably could live with that if I can live
>>> with
>>> "fieldsep_zero":-)
>>>
>>>
>>> I don't share your opinion so introduction csv-specific fieldsep is
>> surprise less. Current design is wrong - this thread is a evidence.
>>
>
> Wrong is too strong a word. Current design is not perfect, sure. Proposed
> alternatives are somehow worse, at least to some people mind, which
> explains why we arrived on this version after a few iterations.
>
> And if we introduce csv-specific fieldsep, then we multiply this wrong
>> design. The fix in this direction is renaming fieldsep to fieldsep-unaliagn
>> - but it is probably too big change too. So this design is nothing what I
>> can mark as good solution.
>>
>
> Good, we somehow agree on something!
>
> I can live with because it is much better than using pipe as separator for
>> csv, and because real impact is small - for almost people it will be
>> invisible - but have not good feeling from it.
>>
>
> Are there some possible alternatives?
>>
>
> Given the date and the fact that the cf end is 3 days away, the proposed
> short term alternative is Daniel's version, that I feel is reasonable. Ok,
> people have to do two pset to get comma-separated csv, otherwise they get
> pipe-separated csv in one pset.
>

I dislike the last Daniel's design. I am sorry.


> You could compromise on that for now, and submit an improvement patch for
> a later version if you wish.
>

I am able to accept csv specific field sep independent on unaligned field
sep.

I have not any other idea. And there is not any good (acceptable) ideas. It
is hard to expect so there will be change next year. This space is small,
and there are not too much possible variants. We checked all possibility
without any agreement.


>
> Otherwise, ISTM that the current result of this thread is that there will
> be no simple CSV in pg11:-(
>

Can be. If there is not good enough solution now. If we merge it now, then
will be hard to change it due possible compatibility issues.


>
> Or maybe I can mark Daniel's latest version as "ready" and point out that
> there is some disagreement on the thread, so it is not a full consensus.
> Then a committer can decide whether it is better in like that or that it
> should be put back in some design stage, possibly with their preference wrt
> to the kind of solution they think best.


You can do it. But from my view the Daniel's latest version (with respect
to his work) is the worst variant :(. So I am against to commit this
version.

Regards

Pavel



>
> --
> Fabien.
>


Re: [HACKERS] path toward faster partition pruning

2018-03-28 Thread Jesper Pedersen

Hi,

On 03/28/2018 06:30 AM, Amit Langote wrote:

On 2018/03/28 18:29, Amit Langote wrote:

Attached is the updated set of patches, which contains other miscellaneous
changes such as updated comments, beside the main changes described above.


Sorry, one of those miscellaneous changes was a typo that would cause
compilation to fail... Sigh.   Fixed in the updated version.



Just some trivial changes.

However,

explain (costs off) select * from mc2p where a = 2 and b < 1;

is picking up

   ->  Seq Scan on mc2p2
 Filter: ((b < 1) AND (a = 2))

which doesn't seem right, as its definition is

create table mc2p2 partition of mc2p for values from (1, 1) to (2, 
minvalue);


Best regards,
 Jesper
>From 8bb5f25d31471910db2e7907b4c13029edd7bbdf Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Wed, 28 Mar 2018 12:39:42 -0400
Subject: [PATCH] Trivial changes

---
 src/backend/catalog/partition.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 3c26daa098..2ee5ce605c 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1774,7 +1774,7 @@ perform_pruning_base_step(PartitionPruneContext *context,
 	Datum		values[PARTITION_MAX_KEYS];
 	FmgrInfo	partsupfunc[PARTITION_MAX_KEYS];
 
-	/* There better be same number of expressions and compare functions. */
+	/* There better be the same number of expressions and compare functions. */
 	Assert(list_length(opstep->exprs) == list_length(opstep->cmpfns));
 
 	nvalues = 0;
@@ -1783,7 +1783,7 @@ perform_pruning_base_step(PartitionPruneContext *context,
 
 	/*
 	 * Generate the partition look-up key that will be used by one of
-	 * get_partitions_from_keys_* functions called below.
+	 * the get_partitions_from_keys_* functions called below.
 	 */
 	for (keyno = 0; keyno < context->partnatts; keyno++)
 	{
@@ -1969,7 +1969,7 @@ perform_pruning_combine_step(PartitionPruneContext *context,
 	PruneStepResult *result = NULL;
 
 	/*
-	 * In some cases, planner generates a combine step that doesn't contain
+	 * In some cases, the planner generates a combine step that doesn't contain
 	 * any argument steps, to signal us to not prune any partitions.  So,
 	 * return indexes of all datums in that case, including null and/or
 	 * default partition, if any.
@@ -1994,7 +1994,7 @@ perform_pruning_combine_step(PartitionPruneContext *context,
 			PruneStepResult *step_result;
 
 			/*
-			 * step_results[step_id] must contain valid result, which is
+			 * step_results[step_id] must contain a valid result, which is
 			 * confirmed by the fact that cstep's ID is greater than
 			 * step_id.
 			 */
@@ -2147,10 +2147,10 @@ get_partitions_for_keys_hash(PartitionPruneContext *context,
  * If special partitions (null and default) need to be scanned for given
  * values, set scan_null and scan_default in result if present.
  *
- * 'nvalues', if non-zero, should be exactly 1, because list partitioning.
+ * 'nvalues', if non-zero, should be exactly 1, because of list partitioning.
  * 'value' contains the value to use for pruning
  * 'opstrategy' if non-zero must be a btree strategy number
- * 'partsupfunc' contains list partitioning comparison function to be used to
+ * 'partsupfunc' contains the list partitioning comparison function to be used to
  * perform partition_list_bsearch
  * 'nullkeys' is the set of partition keys that are null.
  */
@@ -2161,7 +2161,6 @@ get_partitions_for_keys_list(PartitionPruneContext *context,
 {
 	PruneStepResult *result;
 	PartitionBoundInfo boundinfo = context->boundinfo;
-	int		   *partindices = boundinfo->indexes;
 	int			off,
 minoff,
 maxoff;
@@ -2272,7 +2271,7 @@ get_partitions_for_keys_list(PartitionPruneContext *context,
 			{
 /*
  * This case means all partition bounds are greater, which in
- * turn means that all partition satisfy this key.
+ * turn means that all partitions satisfy this key.
  */
 off = 0;
 			}
@@ -2333,7 +2332,7 @@ get_partitions_for_keys_list(PartitionPruneContext *context,
  * 'values' contains values for partition keys (or a prefix) to be used for
  * pruning
  * 'opstrategy' if non-zero must be a btree strategy number
- * 'partsupfunc' contains range partitioning comparison function to be used to
+ * 'partsupfunc' contains the range partitioning comparison function to be used to
  * perform partition_range_datum_bsearch or partition_rbound_datum_cmp
  * 'nullkeys' is the set of partition keys that are null.
  */
@@ -2404,7 +2403,7 @@ get_partitions_for_keys_range(PartitionPruneContext *context,
 			{
 if (nvalues == partnatts)
 {
-	/* There can only be zero or one matching partitions. */
+	/* There can only be zero or one matching partition. */
 	if (partindices[off + 1] >= 0)
 	{
 		result->datum_offsets = bms_make_singleton(off + 1);
@@ -2532,7 +2531,7 @@ 

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-28 Thread Andres Freund
Hi,

On 2018-03-08 13:46:53 -0500, Tom Lane wrote:
> Robert Haas  writes:
> > On Thu, Mar 8, 2018 at 12:03 PM, Tom Lane  wrote:
> >> FWIW, I would also vote for (1), especially if the only way to do (2)
> >> is stuff as outright scary as this.  I would far rather have (3) than
> >> this, because IMO, what we are looking at right now is going to make
> >> the fallout from multixacts look like a pleasant day at the beach.
> 
> > Whoa.  Well, that would clearly be bad, but I don't understand why you
> > find this so scary.  Can you explain further?
> 
> Possibly I'm crying wolf; it's hard to be sure.  But I recall that nobody
> was particularly afraid of multixacts when that went in, and look at all
> the trouble we've had with that.  Breaking fundamental invariants like
> "ctid points to this tuple or its update successor" is going to cause
> trouble.  There's a lot of code that knows that; more than knows the
> details of what's in xmax, I believe.

Given, as explained nearby, we already do store transient data in the
ctid for speculative insertions (i.e. ON CONFLICT), and it hasn't caused
even a whiff of trouble, I'm currently not inclined to see a huge issue
here.  It'd be great if you could expand on your concerns here a bit, we
gotta figure out a way forward.

I think the proposed patch needs some polish (I'm e.g. unhappy with the
naming of the new macros etc), but I think otherwise it's a reasonable
attempt at solving the problem.


> I would've been happier about expending an infomask bit towards this
> purpose.  Just eyeing what we've got, I can't help noticing that
> HEAP_MOVED_OFF/HEAP_MOVED_IN couldn't possibly be set in any tuple
> in a partitioned table.  Perhaps making these tests depend on
> partitioned-ness would be unworkably messy, but it's worth thinking
> about.

They previously couldn't be set together IIRC, so we could just (mask &
(HEAP_MOVED_OFF |HEAP_MOVED_IN)) == (HEAP_MOVED_OFF |HEAP_MOVED_IN) but
that'd be permanently eating two infomask bits. For something that
doesn't in general have to be able to live on tuples, just on (at?) the
"deleted tuple at end of a chain".

Greetings,

Andres Freund



Re: [HACKERS] Pluggable storage

2018-03-28 Thread David Steele
On 2/26/18 3:19 AM, Alexander Korotkov wrote:
> On Fri, Feb 23, 2018 at 2:20 AM, Robert Haas  > wrote:
> 
> On Fri, Feb 16, 2018 at 5:56 AM, Alexander Korotkov
> > wrote:
> > BTW, EnterpriseDB announces zheap table access method (heap with undo 
> log)
> > [2].  I think this is great, and I'm looking forward for publishing 
> zheap in
> > mailing lists.  But I'm concerning about its compatibility with 
> pluggable
> > table access methods API.  Does zheap use table AM API from this 
> thread?  Or
> > does it just override current heap and needs to be adopted to use table 
> AM
> > API?  Or does it implements own API?
> 
> Right now it just hacks the code.  The plan is to adapt it to whatever
> API we settle on in this thread.
> 
> 
> Great, thank you for clarification.  I'm looking forward reviewing zheap :)
I think this entry should be moved the the next CF.  I'll do that
tomorrow unless there are objections.

Regards,
-- 
-David
da...@pgmasters.net



Re: [HACKERS] [PATCH] Incremental sort

2018-03-28 Thread Andres Freund
Hi,

On 2018-03-28 16:28:01 +0300, Teodor Sigaev wrote:
> > BTW, patch had conflicts with master.  Please, find rebased version 
> > attached.
> 
> Despite by patch conflist patch looks commitable, has anybody objections to
> commit it?

> Patch recieved several rounds of review during 2 years, and seems to me,
> keeping it out from sources may cause a lost it. Although it suggests
> performance improvement in rather wide usecases.

My impression it has *NOT* received enough review to be RFC. Not saying
it's impossible to get there this release, but that just committing it
doesn't seem wise.

Greetings,

Andres Freund



Re: Proposal: http2 wire format

2018-03-28 Thread Andres Freund
Hi,

On 2018-03-28 09:59:34 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > A few random, very tired, points:
> 
> > - consolidated message for common tasks:
> >   - (bind, [describe?,] execute) to reduce overhead of prepared
> > statement execution (both in messages, as well as branches)
> >   - (anonymous parse, bind, describe, execute) to make it cheaper to
> > send statements with out-of-line parameters
> 
> I do not see a need for this; you can already send those combinations of
> messages in a single network packet if you have a mind to.

The simple protocol right now is *considerably* faster than the extended
protocol. The extended protocol sends more data overall, we do more
memory context resets, there's more switches between protocol messages
in both backend and client. All of those aren't free.

https://www.postgresql.org/message-id/12500.1470002232%40sss.pgh.pa.us

I've previously wondered whether we can peek ahead in the stream and
recognize that we got a bind/describe/execute or
parse/bind/describe/execute and execute them all together if all the
necessary data is there. To avoid new protocol messages.


> Your other points are sound, except I have no idea what this means:
> 
> > - nested table support

Yea, not the most descriptive... Returning multiple different resultsets
from a function / procedure. Inability to do so is a serious limitation
of postgres in comparison to some other language with procedures.


Greetings,

Andres Freund



Re: Proposal: http2 wire format

2018-03-28 Thread Andres Freund
Hi,

On 2018-03-28 16:29:37 +0800, Craig Ringer wrote:
> > - allow *streaming* of large datums
> 
> Yes, very much +1 there. That's already on the wiki. Yeah:
> 
> * Permit lazy fetches of large values, at least out-of-line TOASTED values
> http://www.postgresql.org/message-id/53ff0ef8@2ndquadrant.com

That's not necessarily the same though. What I think we need is the
ability to have "chunked" encoding with *optional* length for the
overall datum. And then the backend infrastructure to be able to send
*to the wire* partial datums.  Probably with some callback based
StringInfo like buffer.

> - nested table support
> >
> >
> Can you elaborate on that one?

Nested recordsets. E.g. a SRF or procedure returning multiple query results.


> * room for other resultset formats later. Like Damir, I really want to add
> protobuf or json serializations of result sets at some point, mainly so we
> can return "entity graphs" in graph representation rather than left-join
> projection.

-1. I don't think this belongs in postgres.

Greetings,

Andres Freund



Re: [HACKERS] [PATCH] Incremental sort

2018-03-28 Thread Alvaro Herrera
Teodor Sigaev wrote:
> > BTW, patch had conflicts with master.  Please, find rebased version 
> > attached.
> 
> Despite by patch conflist patch looks commitable, has anybody objections to
> commit it?
> 
> Patch recieved several rounds of review during 2 years, and seems to me,
> keeping it out from sources may cause a lost it. Although it suggests
> performance improvement in rather wide usecases.

Can we have a recap on what the patch *does*?  I see there's a
description in Alexander's first email
https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=gon-...@mail.gmail.com
but that was a long time ago, and the patch has likely changed in the
meantime ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Parallel Aggregates for string_agg and array_agg

2018-03-28 Thread David Rowley
On 27 March 2018 at 09:27, Tom Lane  wrote:
> The main thing that remains undone is to get some test coverage ---
> AFAICS, none of these new functions get exercised in the standard
> regression tests.

I've attached a patch which implements some tests for the new functions.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


parallel_string_agg_array_agg_tests.patch
Description: Binary data


Re: committing inside cursor loop

2018-03-28 Thread Ildus Kurbangaliev
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

I have checked new version. Although I can miss something,  the patch looks 
good to me.

The new status of this patch is: Ready for Committer


Re: WIP: Covering + unique indexes.

2018-03-28 Thread Erik Rijkers

On 2018-03-28 16:59, Anastasia Lubennikova wrote:

Here is the new version of the patch set.


I can't get these to apply:

patch -b -l -F 25 -p 1 < 
/home/aardvark/download/pgpatches/0110/covering_indexes/20180328/0001-Covering-core-v8.patch



1 out of 19 hunks FAILED -- saving rejects to file 
src/backend/utils/cache/relcache.c.rej



$ cat src/backend/utils/cache/relcache.c.rej
--- src/backend/utils/cache/relcache.c
+++ src/backend/utils/cache/relcache.c
@@ -542,7 +542,7 @@
attp = (Form_pg_attribute) 
GETSTRUCT(pg_attribute_tuple);


if (attp->attnum <= 0 ||
-   attp->attnum > relation->rd_rel->relnatts)
+   attp->attnum > 
RelationGetNumberOfAttributes(relation))
elog(ERROR, "invalid attribute number %d for 
%s",
 attp->attnum, 
RelationGetRelationName(relation));






Erik Rijkers




Re: [HACKERS] [PATCH] Incremental sort

2018-03-28 Thread Tomas Vondra


On 03/28/2018 05:12 PM, Alexander Korotkov wrote:
> On Wed, Mar 28, 2018 at 4:44 PM, Tomas Vondra
> > wrote:
> 
> On 03/28/2018 03:28 PM, Teodor Sigaev wrote:
> >> BTW, patch had conflicts with master.  Please, find rebased version
> >> attached.
> >
> > Despite by patch conflist patch looks commitable, has anybody objections
> > to commit it?
> >
> > Patch recieved several rounds of review during 2 years, and seems to me,
> > keeping it out from sources may cause a lost it. Although it suggests
> > performance improvement in rather wide usecases.
> >
> 
> No objections from me - if you want me to do one final round of review
> after the rebase (not sure how invasive it'll turn out), let me know.
> 
> 
> Rebased patch is attached.  Incremental sort get used in multiple places
> of partition_aggregate regression test.  I've checked those cases, and
> it seems that incremental sort was selected right.
> 

OK, I'll take a look.

> BTW one detail I'd change is name of the GUC variable. enable_incsort
> seems unnecessarily terse - let's go for enable_incremental_sort or
> something like that.
> 
> 
> Already enable_incsort was already renamed to enable_incrementalsort
> since [1].  
> 
> 1.
> https://www.postgresql.org/message-id/CAPpHfduAVmiGDZC%2BdfNL1rEGu0mt45Rd_mxwjY57uqwWhrvQzg%40mail.gmail.com
>  
> 

Ah, apologies. I've been looking at the wrong version.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Implementing SQL ASSERTION

2018-03-28 Thread David Fetter
On Sun, Mar 18, 2018 at 12:29:50PM +, Joe Wildish wrote:
> > 
> >> 
> >> This patch no longer applies.  Any chance of a rebase?
> 
> Attached is a rebased version of this patch. It takes into account
> the ACL checking changes and a few other minor amendments.

Sorry to bother you again, but this now doesn't compile atop master.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-03-28 Thread Dmitry Ivanov

I'd like to see fastupdate=on in test too, now tests cover only case
without fastupdate. Please, add them.


Here's a couple of tests for pending list (fastupdate = on).


By the way, isn't it strange that permutation "fu1" "rxy3" "wx3" "rxy4" 
"c1" "wy4" "c2" doesn't produce an ERROR?


--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: segfault due to invalid cached plan

2018-03-28 Thread Nicolas Thauvin
On Wed, 28 Mar 2018 16:14:04 +0200
Tomas Vondra  wrote:

> On 03/28/2018 03:54 PM, Nicolas Thauvin wrote:
> > Hello,
> > 
> > A customer sent us a core dump of the crash of the background
> > worker of the powa extension, running on 9.6.8 along side with
> > cstore_fdw.
> > 
> > The background worker loops on a plpgsql function, which then
> > execute another plpgsql function that issues the query "TRUNCATE
> > powa_statements_history_current".
> > 
> > Here is the backtrace :
> > 
> > ...
> >
> > The core shows that ProcessUtility hook of cstore_fdw produces a
> > segfault when trying to copy the relation list of the truncate
> > statement plan.
> > 
> > In frame 2:
> > (gdb) p *((TruncateStmt *)parseTree)
> > $26 = {type = T_TruncateStmt, relations = 0x2f106a8, restart_seqs =
> > 0 '\000', behavior = DROP_RESTRICT} (gdb) p *(((TruncateStmt
> > *)parseTree)->relations) $27 = {type = 49574520, length = 0, head =
> > 0x0, tail = 0x2faaab8}
> > 
> > With this invalid list, standard_ProcessUtility would not have
> > failed but the relation would not have been truncated.
> > 
> > ...
> > 
> > Could you please give me some pointers to further investigate this
> > crash?
> >   
> 
> Sounds like a bug in CStoreProcessUtility, which is part of
> cstore_fdw, not PostgreSQL. So I guess the right place to report the
> issue is
> 
>https://github.com/citusdata/cstore_fdw
> 
> 

Having a second look at the code of cstore_fdw, it modifies the data.
I'll investigate more on this code.

Thank you
-- 
Nicolas Thauvin
+33 (0)1 84 16 92 09
http://dalibo.com - http://dalibo.org



Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-03-28 Thread Tomas Vondra
On 03/28/2018 04:12 PM, Dean Rasheed wrote:
> On 28 March 2018 at 01:34, Tomas Vondra  wrote:
>> Attached is a patch fixing this. In the end I've decided to keep both
>> branches - one handling boolean Vars and one for NOT clauses. I think
>> you're right we can only see (NOT var) cases, but I'm not sure about that.
>>
>> For example, what if an operator does not have a negator? Then we can't
>> transform NOT (a AND b) => (NOT a OR NOT b), I guess. So I kept this for
>> now, and we can remove this later.
>>
> 
> OK, but it's going to have to work harder to set "fullmatch"
> correctly. If you have a boolean Var clause, which is identical to
> "bool_var = true", it ought to add to "eqmatches" if true is found in
> the MCV list. Likewise a boolean Var under a NOT clause is identical
> to "bool_var = false", so it ought to add to "eqmatches" if false is
> found in the MCV list. Both those cases would be easy to handle, if
> general NOT support wasn't required, and you just special-cased "NOT
> bool_var".
> 
> If you're going to handle the general case of an arbitrary clause
> under a NOT, then the recursive call to mcv_update_match_bitmap()
> would seem to need to know that it's under a NOT (a new "is_not"
> parameter?), to invert the logic around adding to "eqmatches". That
> applies to other general OpExpr's too -- for example, "NOT (box_var =
> ?)" won't be rewritten because there is no box_ne operator, but when
> mcv_update_match_bitmap() is called recursively with the "box_var =
> ?", it shouldn't add to "eqmatches", despite this being an EQSEL
> operator.
> 
> As mentioned before, I think this whole thing only works if
> mcv_update_match_bitmap() returns the "eqmatches" list, so that if it
> is called recursively, it can be merged with the caller's list. What
> isn't immediately obvious to me is what happens to a NOT clause under
> another NOT clause, possibly with an AND or OR in-between. Would the
> "is_not" parameter just flip back to false again?
> 

After thinking about this a bit more, I'm not sure if updating the info
based on recursive calls makes sense. The fullmatch flag was supposed to
answer a simple question - can there be just a single matching item?

If there are equality conditions on all columns, there can be just a
single matching item - if we have found it in the MCV (i.e. s1 > 0.0),
then we don't need to inspect the non-MCV part.

But handling this in recursive manner breaks this entirely, because with
something like

   (a=1) AND (b=1 OR b=2)

you suddenly can have multiple matching items. Which makes the fullmatch
flag somewhat useless.

So I think we should be looking at top-level equality clauses only, just
like number_of_groups() does.


> There's also an interesting question around the NullTest clause. Since
> NULLs are being recorded in the MCV list, shouldn't "IS NULL" be
> treated as semantically like an equality clause, and cause that
> attribute to be added to "eqmatches" if NULL is found in the MCV list?
> 
> 
>> I've also realized that the "fullmatch" flag is somewhat confused,
>> because some places interpreted it as "there is equality on each
>> attribute" but in fact it also required an actual MCV match.
> 
> Yes, I was having similar thoughts. I think "eqmatches" / "fullmatch"
> probably just wants to track whether there was an exact comparison on
> all the attributes, not whether or not the value was in the MCV list,
> because the latter is already available in the "matches" bitmap.
> Knowing that complete, exact comparisons happened, and it wasn't in
> the MCV list, makes the "(1 - mcv_totalsel)) / otherdistinct" estimate
> reasonable.
> 

I think we can remove the fullmatch flag from mcv_update_bitmap
entirely. All we need to know is the presence of equality clauses and
whether there was a match in MCV (which we know from s1 > 0.0).

> However, I don't think that tracking "eqmatches" or "fullmatch" is
> sufficient for the general case. For example, for other operators like
> "!=", "<", "<=", all (or maybe half) the "1 - mcv_totalsel" ought to
> count towards the selectivity, plus possibly part of the MCV list
> (e.g., for "<=", using the sum of the matching MCV frequencies plus
> half the sum of the non-MCV frequencies might be reasonable -- c.f.
> scalarineqsel()). For an OR clause, you might want to count the number
> of non-MCV matches, because logically each one adds another "(1 -
> mcv_totalsel)) / otherdistinct" to the total selectivity. It's not
> immediately obvious how that can be made to fit into the current code
> structure. Perhaps it could be made to work by tracking the overall
> selectivity as it goes along. Or perhaps it could track the
> count/proportion of non-MCV matches.
> 

Yes, ignoring the non-equality clauses in 0002 is wrong - that's pretty
much why it's WIP and not merged into 0001.

thanks for the feedback

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, 

Re: segfault due to invalid cached plan

2018-03-28 Thread Tom Lane
Nicolas Thauvin  writes:
> A customer sent us a core dump of the crash of the background worker of
> the powa extension, running on 9.6.8 along side with cstore_fdw.
> The background worker loops on a plpgsql function, which then execute
> another plpgsql function that issues the query "TRUNCATE
> powa_statements_history_current".

I think the odds are very high that this is a cstore_fdw bug.  The
core backend simply doesn't do anything very interesting with TRUNCATE.

regards, tom lane



Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-03-28 Thread Dmitry Ivanov

I'd like to see fastupdate=on in test too, now tests cover only case
without fastupdate. Please, add them.


Here's a couple of tests for pending list (fastupdate = on).

--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companydiff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 37070b3b72..095b1192cb 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/xloginsert.h"
+#include "storage/predicate.h"
 #include "miscadmin.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -515,6 +516,19 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			btree->fillRoot(btree, newrootpg,
 			BufferGetBlockNumber(lbuffer), newlpage,
 			BufferGetBlockNumber(rbuffer), newrpage);
+
+			if (GinPageIsLeaf(BufferGetPage(stack->buffer)))
+			{
+
+PredicateLockPageSplit(btree->index,
+		BufferGetBlockNumber(stack->buffer),
+		BufferGetBlockNumber(lbuffer));
+
+PredicateLockPageSplit(btree->index,
+		BufferGetBlockNumber(stack->buffer),
+		BufferGetBlockNumber(rbuffer));
+			}
+
 		}
 		else
 		{
@@ -524,6 +538,14 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			GinPageGetOpaque(newrpage)->rightlink = savedRightLink;
 			GinPageGetOpaque(newlpage)->flags |= GIN_INCOMPLETE_SPLIT;
 			GinPageGetOpaque(newlpage)->rightlink = BufferGetBlockNumber(rbuffer);
+
+			if (GinPageIsLeaf(BufferGetPage(stack->buffer)))
+			{
+
+PredicateLockPageSplit(btree->index,
+		BufferGetBlockNumber(stack->buffer),
+		BufferGetBlockNumber(rbuffer));
+			}
 		}
 
 		/*
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index f9daaba52e..3fb4fc8264 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -19,6 +19,7 @@
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
+#include "storage/predicate.h"
 #include "utils/rel.h"
 
 /*
@@ -1423,7 +1424,7 @@ disassembleLeaf(Page page)
  * Any segments that acquire new items are decoded, and the new items are
  * merged with the old items.
  *
- * Returns true if any new items were added. False means they were all
+ * Returns true if any new items were added. false means they were all
  * duplicates of existing items on the page.
  */
 static bool
@@ -1759,7 +1760,7 @@ leafRepackItems(disassembledLeaf *leaf, ItemPointer remaining)
  */
 BlockNumber
 createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
-  GinStatsData *buildStats)
+  GinStatsData *buildStats, Buffer entrybuffer)
 {
 	BlockNumber blkno;
 	Buffer		buffer;
@@ -1810,6 +1811,12 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	page = BufferGetPage(buffer);
 	blkno = BufferGetBlockNumber(buffer);
 
+	/*
+	 * Copy a predicate lock from entry tree leaf (containing posting list)
+	 * to  posting tree.
+	 */
+	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
@@ -1904,6 +1911,7 @@ ginInsertItemPointers(Relation index, BlockNumber rootBlkno,
 		btree.itemptr = insertdata.items[insertdata.curitem];
 		stack = ginFindLeafPage(, false, NULL);
 
+		GinCheckForSerializableConflictIn(btree.index, NULL, stack->buffer);
 		ginInsertValue(, stack, , buildStats);
 	}
 }
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 6fe67f346d..63603859bc 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -17,8 +17,10 @@
 #include "access/gin_private.h"
 #include "access/relscan.h"
 #include "miscadmin.h"
+#include "storage/predicate.h"
 #include "utils/datum.h"
 #include "utils/memutils.h"
+#include "utils/rel.h"
 
 /* GUC parameter */
 int			GinFuzzySearchLimit = 0;
@@ -33,11 +35,18 @@ typedef struct pendingPosition
 } pendingPosition;
 
 
+static void
+GinPredicateLockPage(Relation index, BlockNumber blkno, Snapshot snapshot)
+{
+	if (!GinGetUseFastUpdate(index))
+			PredicateLockPage(index, blkno, snapshot);
+}
+
 /*
  * Goes to the next page if current offset is outside of bounds
  */
 static bool
-moveRightIfItNeeded(GinBtreeData *btree, GinBtreeStack *stack)
+moveRightIfItNeeded(GinBtreeData *btree, GinBtreeStack *stack, Snapshot snapshot)
 {
 	Page		page = BufferGetPage(stack->buffer);
 
@@ -73,6 +82,9 @@ scanPostingTree(Relation index, GinScanEntry scanEntry,
 	/* Descend to the leftmost leaf page */
 	stack = ginScanBeginPostingTree(, index, rootPostingTree, snapshot);
 	buffer = stack->buffer;
+
+	GinPredicateLockPage(index, BufferGetBlockNumber(buffer), snapshot);
+
 	IncrBufferRefCount(buffer); /* prevent unpin in freeGinBtreeStack */
 
 	freeGinBtreeStack(stack);
@@ -94,6 +106,8 @@ scanPostingTree(Relation index, GinScanEntry scanEntry,
 			break;/* no more pages */
 
 		buffer = 

Re: segfault due to invalid cached plan

2018-03-28 Thread Tomas Vondra
On 03/28/2018 03:54 PM, Nicolas Thauvin wrote:
> Hello,
> 
> A customer sent us a core dump of the crash of the background worker of
> the powa extension, running on 9.6.8 along side with cstore_fdw.
> 
> The background worker loops on a plpgsql function, which then execute
> another plpgsql function that issues the query "TRUNCATE
> powa_statements_history_current".
> 
> Here is the backtrace :
> 
> ...
>
> The core shows that ProcessUtility hook of cstore_fdw produces a
> segfault when trying to copy the relation list of the truncate
> statement plan.
> 
> In frame 2:
> (gdb) p *((TruncateStmt *)parseTree)
> $26 = {type = T_TruncateStmt, relations = 0x2f106a8, restart_seqs = 0 '\000', 
> behavior = DROP_RESTRICT}
> (gdb) p *(((TruncateStmt *)parseTree)->relations)
> $27 = {type = 49574520, length = 0, head = 0x0, tail = 0x2faaab8}
> 
> With this invalid list, standard_ProcessUtility would not have failed
> but the relation would not have been truncated.
> 
> ...
> 
> Could you please give me some pointers to further investigate this
> crash?
> 

Sounds like a bug in CStoreProcessUtility, which is part of cstore_fdw,
not PostgreSQL. So I guess the right place to report the issue is

   https://github.com/citusdata/cstore_fdw


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-03-28 Thread Dean Rasheed
On 28 March 2018 at 01:34, Tomas Vondra  wrote:
> Attached is a patch fixing this. In the end I've decided to keep both
> branches - one handling boolean Vars and one for NOT clauses. I think
> you're right we can only see (NOT var) cases, but I'm not sure about that.
>
> For example, what if an operator does not have a negator? Then we can't
> transform NOT (a AND b) => (NOT a OR NOT b), I guess. So I kept this for
> now, and we can remove this later.
>

OK, but it's going to have to work harder to set "fullmatch"
correctly. If you have a boolean Var clause, which is identical to
"bool_var = true", it ought to add to "eqmatches" if true is found in
the MCV list. Likewise a boolean Var under a NOT clause is identical
to "bool_var = false", so it ought to add to "eqmatches" if false is
found in the MCV list. Both those cases would be easy to handle, if
general NOT support wasn't required, and you just special-cased "NOT
bool_var".

If you're going to handle the general case of an arbitrary clause
under a NOT, then the recursive call to mcv_update_match_bitmap()
would seem to need to know that it's under a NOT (a new "is_not"
parameter?), to invert the logic around adding to "eqmatches". That
applies to other general OpExpr's too -- for example, "NOT (box_var =
?)" won't be rewritten because there is no box_ne operator, but when
mcv_update_match_bitmap() is called recursively with the "box_var =
?", it shouldn't add to "eqmatches", despite this being an EQSEL
operator.

As mentioned before, I think this whole thing only works if
mcv_update_match_bitmap() returns the "eqmatches" list, so that if it
is called recursively, it can be merged with the caller's list. What
isn't immediately obvious to me is what happens to a NOT clause under
another NOT clause, possibly with an AND or OR in-between. Would the
"is_not" parameter just flip back to false again?

There's also an interesting question around the NullTest clause. Since
NULLs are being recorded in the MCV list, shouldn't "IS NULL" be
treated as semantically like an equality clause, and cause that
attribute to be added to "eqmatches" if NULL is found in the MCV list?


> I've also realized that the "fullmatch" flag is somewhat confused,
> because some places interpreted it as "there is equality on each
> attribute" but in fact it also required an actual MCV match.

Yes, I was having similar thoughts. I think "eqmatches" / "fullmatch"
probably just wants to track whether there was an exact comparison on
all the attributes, not whether or not the value was in the MCV list,
because the latter is already available in the "matches" bitmap.
Knowing that complete, exact comparisons happened, and it wasn't in
the MCV list, makes the "(1 - mcv_totalsel)) / otherdistinct" estimate
reasonable.

However, I don't think that tracking "eqmatches" or "fullmatch" is
sufficient for the general case. For example, for other operators like
"!=", "<", "<=", all (or maybe half) the "1 - mcv_totalsel" ought to
count towards the selectivity, plus possibly part of the MCV list
(e.g., for "<=", using the sum of the matching MCV frequencies plus
half the sum of the non-MCV frequencies might be reasonable -- c.f.
scalarineqsel()). For an OR clause, you might want to count the number
of non-MCV matches, because logically each one adds another "(1 -
mcv_totalsel)) / otherdistinct" to the total selectivity. It's not
immediately obvious how that can be made to fit into the current code
structure. Perhaps it could be made to work by tracking the overall
selectivity as it goes along. Or perhaps it could track the
count/proportion of non-MCV matches.

Regards,
Dean



  1   2   >