Vacuuming the operating system documentation

2020-06-05 Thread Thomas Munro
Hi,

We're carrying a bunch of obsolete and in one case insecure advice on
kernel settings.  Here's an attempt to clean some of that up.

Linux:
 * Drop reference to ancient systems that didn't have a sysctl command.
 * Drop references to Linux before 2.6.
 * I was tempted to remove the reference to oom_adj, which was
apparently deprecated from 2.6.36, but that's probably recent enough
to keep (RHEL6 may outlive humanity).

macOS:
 * Drop reference to 10.2 and 10.3 systems.  That's 15-16 years ago.
Even the ancient PPC systems in the build farm run 10.4+.

FreeBSD:
 * Drop insecure and outdated jail instructions.  I moved the
pre-FreeBSD 11 behaviour into a brief note in parentheses, because
FreeBSD 11 is the oldest release of that OS that is still in support.
In that parenthetical note, I dropped the reference to port numbers
and UIDs in shmem keys since we now use pgdata inode numbers instead.
 * Drop SysV semaphore instruction.  We switched to POSIX on this
platform in PostgreSQL 10, and we don't bother to give the redundant
instructions about semaphores for Linux so we might as well drop this
noise for FreeBSD too.
 * Clarify that kern.ipc.shm_use_phys only has a useful effect if
shared_memory_type=sysv, which is not the default.
 * Drop some stuff about pre-4.0 systems.  That was 20 years ago.

NetBSD:
 * Drop reference to pre-5.0 systems.  That was 11 years ago.  Maybe
someone wants to argue with me on this one?

OpenBSD:
 * Drop instruction on recompiling the kernel on pre-3.3 systems.
That was 17 years ago.

Solaris/illumos:
 * Drop instructions on Solaris 6-9 systems.  10 came out 15 years
ago, 9 was fully desupported 6 years ago.  The last person to mention
Solaris 9 on the mailing list was ... me.  That machine had cobwebs
even then.
 * Drop reference to OpenSolaris, which was cancelled ten years ago;
the surviving project goes by illumos, so use that name.

AIX:
 * Drop reference to 5.1, since there is no way older systems than
that are going to be running new PostgreSQL releases.  5.1 itself was
desupported by IBM 14 years ago.

HP-UX:
 * Drop advice for v10.  11.x came out 23 years ago.

It's a bit inconsistent that we bother to explain the SysV shmem
sysctls on some systems but not others, just because once upon a time
it was necessary to tweak them on some systems and not others due to
defaults.  You shouldn't need that anywhere now IIUC, unless you run a
lot of clusters or use shared_memory_type=sysv.  I'm not proposing to
add it where it's missing, as I don't have the information and I doubt
it's really useful anyway; you can find that stuff elsewhere if you
really need it.
From be61fe032a2b8ac3b6130c1f7a38c918d9423ec8 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 6 Jun 2020 15:20:14 +1200
Subject: [PATCH] doc: Clean up references to obsolete OS versions.

Modernize the documentation to remove insecure and/or obsolete
instructions about old operating systems.
---
 doc/src/sgml/runtime.sgml | 152 +-
 1 file changed, 19 insertions(+), 133 deletions(-)

diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index a8bb85e6f5..c0d1860bf2 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -872,7 +872,7 @@ psql: could not connect to server: No such file or directory
   
   

-At least as of version 5.1, it should not be necessary to do
+It should not be necessary to do
 any special configuration for such parameters as
 SHMMAX, as it appears this is configured to
 allow all memory to be used as shared memory.  That is the
@@ -907,41 +907,24 @@ psql: could not connect to server: No such file or directory
 /etc/sysctl.conf.

 
-   
-These semaphore-related settings are read-only as far as
-sysctl is concerned, but can be set in
-/boot/loader.conf:
-
-kern.ipc.semmni=256
-kern.ipc.semmns=512
-
-After modifying that file, a reboot is required for the new
-settings to take effect.
-   
-
-   
-You might also want to configure your kernel to lock System V shared
+  
+If you have set shared_memory_type to
+sysv (not the default, see ),
+you might also want to configure your kernel to lock System V shared
 memory into RAM and prevent it from being paged out to swap.
 This can be accomplished using the sysctl
 setting kern.ipc.shm_use_phys.

 

-If running in FreeBSD jails by enabling sysctl's
-security.jail.sysvipc_allowed, postmasters
-running in different jails should be run by different operating system
-users.  This improves security because it prevents non-root users
-from interfering with shared memory or semaphores in different jails,
-and it allows the PostgreSQL IPC cleanup code to function properly.
-(In FreeBSD 6.0 and later the IPC cleanup code does not 

Re: libpq copy error handling busted

2020-06-05 Thread Andres Freund
Hi,

On 2020-06-03 18:41:28 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > When libpq is used to COPY data to the server, it doesn't properly
> > handle errors.
> > This is partially an old problem, and partially got recently
> > worse. Before the below commit we detected terminated connections, but
> > we didn't handle copy failing.
> 
> Yeah.  After poking at that for a little bit, there are at least three
> problems:
> 
> * pqSendSome() is responsible not only for pushing out data, but for
> calling pqReadData in any situation where it can't get rid of the data
> promptly.  1f39a1c06 overlooked that requirement, and the upshot is
> that we don't necessarily notice that the connection is broken (it's
> pqReadData's job to detect that).  Putting a pqReadData call into
> the early-exit path helps, but doesn't fix things completely.

Is that fully necessary? Couldn't we handle at least the case I had by
looking at write_failed in additional places?

It might still be the right thing to continue to call pqReadData() from
pqSendSome(), don't get me wrong.


> * The more longstanding problem is that the PQputCopyData code path
> doesn't have any mechanism for consuming an 'E' (error) message
> once pqReadData has collected it.  AFAICS that's ancient.

Yea, I looked back quite a bit, and it looked that way for a long
time. I thought for a moment that it might be related to the copy-both
introduction, but it wasn't.


> I think that the idea was to let the client dump all its copy data and
> then report the error message when PQendCopy is called, but as you
> say, that's none too friendly when there's gigabytes of data involved.
> I'm not sure we can do anything about this without effectively
> changing the client API for copy errors, though.

Hm. Why would it *really* be an API change? Until recently connection
failures etc were returned from PQputCopyData(), and it is documented
that way:

/*
 * PQputCopyData - send some data to the backend during COPY IN or COPY BOTH
 *
 * Returns 1 if successful, 0 if data could not be sent (only possible
 * in nonblock mode), or -1 if an error occurs.
 */
int
PQputCopyData(PGconn *conn, const char *buffer, int nbytes)

So consuming 'E' when in copy mode doesn't seem like a crazy change to
me. Probably a bit too big to backpatch though. But given that this
hasn't been complained about much in however many years...

Greetings,

Andres Freund




hashagg slowdown due to spill changes

2020-06-05 Thread Andres Freund
Hi,

While preparing my pgcon talk I noticed that our hash-agg performance
degraded noticeably. Looks to me like it's due to the spilling-hashagg
changes.

Sample benchmark:

config:
-c huge_pages=on -c shared_buffers=32GB -c jit=0 -c 
max_parallel_workers_per_gather=0
(largely just to reduce variance)

data prep:
CREATE TABLE fewgroups_many_rows AS SELECT (random() * 4)::int cat, 
(random()*1)::int val FROM generate_series(1, 1);
VACUUM (FREEZE, ANALYZE) fewgroups_many_rows;

test prep:
CREATE EXTENSION IF NOT EXISTS pg_prewarm;SELECT 
pg_prewarm('fewgroups_many_rows', 'buffer');

test:
SELECT cat, count(*) FROM fewgroups_many_rows GROUP BY 1;

Using best-of-three timing:

12  12221.031 ms
master  13855.129 ms

While not the end of the world, that's a definitely noticable and
reproducible slowdown (~12%).

I don't think this is actually an inherent cost, but a question of how
the code ended up being organized. Here's a perf diff of profiles for
both versions:

# Baseline  Delta Abs  Shared Object Symbol
#   .    
.
#
   +6.70%  postgres  [.] LookupTupleHashEntryHash
   +6.37%  postgres  [.] prepare_hash_slot
   +4.74%  postgres  [.] TupleHashTableHash_internal.isra.0
20.36% -2.89%  postgres  [.] ExecInterpExpr
 6.31% -2.73%  postgres  [.] lookup_hash_entries
   +2.36%  postgres  [.] lookup_hash_entry
   +2.14%  postgres  [.] ExecJustAssignScanVar
 2.28% +1.97%  postgres  [.] ExecScan
 2.54% +1.93%  postgres  [.] MemoryContextReset
 3.84% -1.86%  postgres  [.] SeqNext
10.19% -1.50%  postgres  [.] tts_buffer_heap_getsomeattrs
   +1.42%  postgres  [.] hash_bytes_uint32
   +1.39%  postgres  [.] TupleHashTableHash
   +1.10%  postgres  [.] tts_virtual_clear
 3.36% -0.74%  postgres  [.] ExecAgg
   +0.45%  postgres  [.] 
CheckForSerializableConflictOutNeeded
 0.25% +0.44%  postgres  [.] hashint4
 5.80% -0.35%  postgres  [.] tts_minimal_getsomeattrs
 1.91% -0.33%  postgres  [.] heap_getnextslot
 4.86% -0.32%  postgres  [.] heapgettup_pagemode
 1.46% -0.32%  postgres  [.] tts_minimal_clear

While some of this is likely is just noise, it's pretty clear that we
spend a substantial amount of additional time below
lookup_hash_entries().

And looking at the code, I'm not too surprised:

Before there was basically one call from nodeAgg.c to execGrouping.c for
each tuple and hash table. Now it's a lot more complicated:
1) nodeAgg.c: prepare_hash_slot()
2) execGrouping.c: TupleHashTableHash()
3) nodeAgg.c: lookup_hash_entry()
4) execGrouping.c: LookupTupleHashEntryHash()

For each of these data needs to be peeled out of one or more of AggState
/ AggStatePerHashData / TupleHashTable. There's no way the compiler can
know that nothing inside those changes, therefore it has to reload the
contents repeatedly.  By my look at the profiles, that's where most of
the time is going.

There's also the issue that the signalling whether to insert / not to
insert got unnecessarily complicated. There's several checks:
1) lookup_hash_entry() (p_isnew = aggstate->hash_spill_mode ? NULL : )
2) LookupTupleHashEntry_internal() (if (isnew))
3) lookup_hash_entry() (if (entry == NULL) and if (isnew))
4) lookup_hash_entries() if (!in_hash_table)

Not performance related: I am a bit confused why the new per-hash stuff
in lookup_hash_entries() isn't in lookup_hash_entry()? I assume that's
because of agg_refill_hash_table()?


Why isn't the flow more like this:
1) prepare_hash_slot()
2) if (aggstate->hash_spill_mode) goto 3; else goto 4
3) entry = LookupTupleHashEntry(); if (!entry) hashagg_spill_tuple();
4) InsertTupleHashEntry(, ); if (isnew) initialize(entry)

That way there's again exactly one call to execGrouping.c, there's no
need for nodeAgg to separately compute the hash, there's far fewer
branches...

Doing things this way might perhaps make agg_refill_hash_table() a tiny
bit more complicated, but it'll also avoid the slowdown for the vast
majority of cases where we're not spilling.

Greetings,

Andres Freund




Re: Atomic operations within spinlocks

2020-06-05 Thread Andres Freund
Hi,

On 2020-06-05 22:52:47 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2020-06-05 21:01:56 -0400, Tom Lane wrote:
> >> At some point I think we'll have to give up --disable-spinlocks; it's
> >> really of pretty marginal use (how often does anyone port PG to a new
> >> CPU type?) and the number of weird interactions it adds in this area
> >> seems like more than it's worth.
>
> > Indeed. And any new architecture one would port PG to would have good
> > enough compiler intrinsics to make that trivial. I still think it'd make
> > sense to have a fallback implementation using compiler intrinsics...
>
> > And I think we should just require 32bit atomics at the same time. Would
> > probably kill gaur though.
>
> Not only gaur.  A quick buildfarm survey finds these active members
> reporting not having 32-bit atomics:

Hm, I don't think that's the right test. We have bespoke code to support
most of these, I think:


>  anole | 2020-06-05 11:20:17 | pgac_cv_gcc_atomic_int32_cas=no

Has support via acc specific intrinsics.


>  chipmunk  | 2020-05-29 22:27:56 | pgac_cv_gcc_atomic_int32_cas=no

Doesn't have support for __atomic, but does have support for 32bit
__sync.


>  gharial   | 2020-06-05 12:41:14 | pgac_cv_gcc_atomic_int32_cas=no

__sync support for both 32 and 64 bit.


>  curculio  | 2020-06-05 22:30:06 | pgac_cv_gcc_atomic_int32_cas=no
>  frogfish  | 2020-05-31 13:00:25 | pgac_cv_gcc_atomic_int32_cas=no

__sync support for both 32 and 64 bit.


>  mandrill  | 2020-06-05 09:20:03 | pgac_cv_gcc_atomic_int32_cas=no

__sync support for 32, as well as as inline asm for 32bit atomics
(although we might be able to add 64 bit).


>  hornet| 2020-06-05 09:11:26 | pgac_cv_gcc_atomic_int32_cas=no
>  hoverfly  | 2020-06-05 22:06:14 | pgac_cv_gcc_atomic_int32_cas=no

__sync support for both 32 and 64 bit, and we have open coded ppc asm.


>  locust| 2020-06-05 10:14:29 | pgac_cv_gcc_atomic_int32_cas=no
>  prairiedog| 2020-06-05 09:55:49 | pgac_cv_gcc_atomic_int32_cas=no

Wee, these don't have __sync? But I think it should be able to use the
asm ppc implementation for 32 bit atomics.



>  gaur  | 2020-05-19 13:33:25 | pgac_cv_gcc_atomic_int32_cas=no

As far as I understand pa-risc doesn't have any atomic instructions
except for TAS.


So I think gaur is really the only one that'd drop.



> It looks to me like this is mostly about compiler support not the
> hardware; that doesn't make it not a problem, though.  (I also
> remain skeptical about the quality of the compiler intrinsics
> on non-mainstream hardware.)

I think that's fair enough for really old platforms, but at least for
gcc / clang I don't think it's a huge concern for newer ones. Even if
not mainstream. For gcc/clang the intrinsics basically back the
C11/C++11 "language level" atomics support. And those are extremely
widely used these days.

Greetings,

Andres Freund




Re: valgrind error

2020-06-05 Thread Tom Lane
Noah Misch  writes:
> On Fri, Jun 05, 2020 at 12:17:54PM -0400, Tom Lane wrote:
>> as you have it, I'd prefer to use
>> -   fun:pg_comp_crc32c
>> +   fun:pg_comp_crc32c_sb8
>> which precisely matches what 4f700bc did.  The other way seems like
>> it's giving a free pass to problems that could lurk in unrelated CRC
>> implementations.

> The undefined data is in the CRC input, namely the padding bytes in xl_*
> structs.

Oh, I see.  Objection withdrawn.

> Apparently, valgrind-3.15.0 doesn't complain about undefined input
> to _mm_crc32_u* functions.  We should not be surprised if Valgrind gains the
> features necessary to complain about the other implementations.

Perhaps it already has ... I wonder if anyone's tried this on ARMv8
lately.

regards, tom lane




Re: valgrind error

2020-06-05 Thread Noah Misch
On Fri, Jun 05, 2020 at 12:17:54PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > I can reproduce this on a 2017-vintage CPU with ./configure
> > ... USE_SLICING_BY_8_CRC32C=1 and then running "make installcheck-parallel"
> > under valgrind-3.15.0 (as packaged by RHEL 7.8).  valgrind.supp has a
> > suppression for CRC calculations, but it didn't get the memo when commit
> > 4f700bc renamed the function.  The attached patch fixes the suppression.
> 
> I can also reproduce this, on RHEL 8.2 which likewise has valgrind-3.15.0,
> using the same configuration to force use of that CRC function.  I concur
> with your diagnosis that this is just a missed update of the pre-existing
> suppression rule.  However, rather than
> 
> -   fun:pg_comp_crc32c
> +   fun:pg_comp_crc32c*
> 
> as you have it, I'd prefer to use
> 
> -   fun:pg_comp_crc32c
> +   fun:pg_comp_crc32c_sb8
> 
> which precisely matches what 4f700bc did.  The other way seems like
> it's giving a free pass to problems that could lurk in unrelated CRC
> implementations.

The undefined data is in the CRC input, namely the padding bytes in xl_*
structs.  Apparently, valgrind-3.15.0 doesn't complain about undefined input
to _mm_crc32_u* functions.  We should not be surprised if Valgrind gains the
features necessary to complain about the other implementations.

Most COMP_CRC32C callers don't have a suppression, so Valgrind still studies
each CRC implementation via those other callers.




Re: Atomic operations within spinlocks

2020-06-05 Thread Tom Lane
Andres Freund  writes:
> On 2020-06-05 21:01:56 -0400, Tom Lane wrote:
>> At some point I think we'll have to give up --disable-spinlocks; it's
>> really of pretty marginal use (how often does anyone port PG to a new
>> CPU type?) and the number of weird interactions it adds in this area
>> seems like more than it's worth.

> Indeed. And any new architecture one would port PG to would have good
> enough compiler intrinsics to make that trivial. I still think it'd make
> sense to have a fallback implementation using compiler intrinsics...

> And I think we should just require 32bit atomics at the same time. Would
> probably kill gaur though.

Not only gaur.  A quick buildfarm survey finds these active members
reporting not having 32-bit atomics:

 anole | 2020-06-05 11:20:17 | pgac_cv_gcc_atomic_int32_cas=no
 chipmunk  | 2020-05-29 22:27:56 | pgac_cv_gcc_atomic_int32_cas=no
 curculio  | 2020-06-05 22:30:06 | pgac_cv_gcc_atomic_int32_cas=no
 frogfish  | 2020-05-31 13:00:25 | pgac_cv_gcc_atomic_int32_cas=no
 gaur  | 2020-05-19 13:33:25 | pgac_cv_gcc_atomic_int32_cas=no
 gharial   | 2020-06-05 12:41:14 | pgac_cv_gcc_atomic_int32_cas=no
 hornet| 2020-06-05 09:11:26 | pgac_cv_gcc_atomic_int32_cas=no
 hoverfly  | 2020-06-05 22:06:14 | pgac_cv_gcc_atomic_int32_cas=no
 locust| 2020-06-05 10:14:29 | pgac_cv_gcc_atomic_int32_cas=no
 mandrill  | 2020-06-05 09:20:03 | pgac_cv_gcc_atomic_int32_cas=no
 prairiedog| 2020-06-05 09:55:49 | pgac_cv_gcc_atomic_int32_cas=no

It looks to me like this is mostly about compiler support not the
hardware; that doesn't make it not a problem, though.  (I also
remain skeptical about the quality of the compiler intrinsics
on non-mainstream hardware.)

regards, tom lane




Re: v13: Performance regression related to FORTIFY_SOURCE

2020-06-05 Thread Andres Freund
Hi,

On 2020-06-05 18:39:28 -0700, Jeff Davis wrote:
> On Fri, 2020-06-05 at 14:49 -0700, Andres Freund wrote:
> > FWIW, with gcc 10 and glibc 2.30 I don't see such a switch. Taking a
> > profile shows me:
> 
> ...
> 
> >   4.65 │   → callqmemcpy@plt
> >│   LogicalTapeWrite():
> > 
> > I.e. normal memcpy is getting called.
> > 
> > That's with -D_FORTIFY_SOURCE=2
> 
> That's good news, although people will be using ubuntu 18.04 for a
> while.
> 
> Just to confirm, would you mind trying the example programs in the GCC
> bug report?
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95556

I get "call memcpy@PLT" for both files. With various debian versions of
gcc (7,8,9,10). But, very curiously, I do see the difference when
compiling with gcc-snapshot (which is a debian package wrapping a recent
snapshot from upstream gcc).

Greetings,

Andres Freund




Re: Atomic operations within spinlocks

2020-06-05 Thread Andres Freund
Hi,

On 2020-06-05 21:01:56 -0400, Tom Lane wrote:
> > I'm not really sure what to do about that issue. The easisest thing
> > would probably be to change the barrier generation to 32bit (which
> > doesn't have to use locks for reads in any situation).
>
> Yeah, I think we need a hard rule that you can't use a spinlock in
> an interrupt handler --- which means no atomics that don't have
> non-spinlock implementations on every platform.

Yea, that might be the easiest thing to do.  The only other thing I can
think of would be to mask all signals for the duration of the
atomic-using-spinlock operation. That'd make the fallback noticably more
expensive, but otoh, do we care enough?

I think a SIGNAL_HANDLER_BEGIN(); SIGNAL_HANDLER_END(); to back an
Assert(!InSignalHandler()); could be quite useful.  Could also save
errno etc.


> At some point I think we'll have to give up --disable-spinlocks; it's
> really of pretty marginal use (how often does anyone port PG to a new
> CPU type?) and the number of weird interactions it adds in this area
> seems like more than it's worth.

Indeed. And any new architecture one would port PG to would have good
enough compiler intrinsics to make that trivial. I still think it'd make
sense to have a fallback implementation using compiler intrinsics...

And I think we should just require 32bit atomics at the same time. Would
probably kill gaur though.


I did just find a longstanding bug in the spinlock emulation code:

void
s_init_lock_sema(volatile slock_t *lock, bool nested)
{
static int  counter = 0;

*lock = ((++counter) % NUM_SPINLOCK_SEMAPHORES) + 1;
}

void
s_unlock_sema(volatile slock_t *lock)
{
int lockndx = *lock;

if (lockndx <= 0 || lockndx > NUM_SPINLOCK_SEMAPHORES)
elog(ERROR, "invalid spinlock number: %d", lockndx);
PGSemaphoreUnlock(SpinlockSemaArray[lockndx - 1]);
}


I don't think it's ok that counter is a signed integer... While it maybe
used to be unlikely that we ever have that many spinlocks, I don't think
it's that hard anymore, because we dynamically allocate them for a lot
of parallel query stuff.  A small regression test that initializes
enough spinlocks indeed errors out with
2020-06-05 18:08:29.110 PDT [734946][3/2:0] ERROR:  invalid spinlock number: 
-126
2020-06-05 18:08:29.110 PDT [734946][3/2:0] STATEMENT:  SELECT 
test_atomic_ops();



> > Randomly noticed while looking at the code:
> > uint64  flagbit = UINT64CONST(1) << (uint64) type;
>
> I'm surprised we didn't get any compiler warnings about that.

Unfortunately I don't think one can currently compile postgres with
warnings for "implicit casts" enabled :(.


> But of course requiring 64-bit atomics is still a step too far.

If we had a 32bit compare-exchange it ought to be possible to write a
signal-safe emulation of 64bit atomics. I think. Something *roughly*
like:


typedef struct pg_atomic_uint64
{
/*
 * Meaning of state bits:
 * 0-1: current valid
 * 2-4: current proposed
 * 5: in signal handler
 * 6-31: pid of proposer
 */
pg_atomic_uint32 state;

/*
 * One current value, two different proposed values.
 */
uint64 value[3];
} pg_atomic_uint64;

The update protocol would be something roughly like:

bool
pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 
*expected, uint64 newval)
{
while (true)
{
uint32 old_state = pg_atomic_read_u32(>state);
uint32 updater_pid = PID_FROM_STATE(old_state);
uint32 new_state;
uint64 old_value;

int proposing;

/*
 * Value changed, so fail. This is obviously racy, but we'll
 * notice concurrent updates later.
 */
if (ptr->value[VALID_FIELD(old_state)] != *expected)
{
return false;
}

if (updater_pid == INVALID_PID)
{

new_state = old_state;

/* signal that current process is updating */
new_state |= MyProcPid >> PID_STATE_SHIFT;
if (InSignalHandler)
new_state |= PROPOSER_IN_SIGNAL_HANDLER_BIT;

/* set which index is being proposed */
new_state = (new_state & ~PROPOSER_BITS) |
NEXT_PROPOSED_FIELD(old_state, );

/*
 * If we successfully can update state to contain our new
 * value, we have a right to do so, and can only be
 * interrupted by ourselves, in a signal handler.
 */
if (!pg_atomic_compare_exchange(>state, _state, new_state))
{
/* somebody else updated, restart */
continue;
}

old_state = new_state;

/*
 * It's ok to compare the values now. If we are interrupted
 * by a signal handler, we'll notice when updating
 * state. There's no danger updating the same proposed 

Re: v13: Performance regression related to FORTIFY_SOURCE

2020-06-05 Thread Jeff Davis
On Fri, 2020-06-05 at 21:50 -0400, Tom Lane wrote:
> Or possibly casting the whole thing to int or unsigned int would be
> better.  Point being that I bet it's int vs long that is making the
> difference.

That did it, and it's much more tolerable as a workaround. Thank you.

I haven't tested end-to-end that it solves the problem, but I'm pretty
sure it will.

Regards,
Jeff Davis






Re: v13: Performance regression related to FORTIFY_SOURCE

2020-06-05 Thread Tom Lane
Jeff Davis  writes:
> On Thu, 2020-06-04 at 16:35 -0400, Alvaro Herrera wrote:
>> If it is something worth worrying about, let's discuss what's a good
>> fix for it.

> While making a minimal test case for the GCC bug report, I found
> another surprisingly-small workaround. Patch attached.

Ugh :-( ... but perhaps you could get the same result like this:

-#define TapeBlockPayloadSize  (BLCKSZ - sizeof(TapeBlockTrailer))
+#define TapeBlockPayloadSize  (BLCKSZ - (int) sizeof(TapeBlockTrailer))

Or possibly casting the whole thing to int or unsigned int would be
better.  Point being that I bet it's int vs long that is making the
difference.

regards, tom lane




Re: v13: Performance regression related to FORTIFY_SOURCE

2020-06-05 Thread Jeff Davis
On Thu, 2020-06-04 at 16:35 -0400, Alvaro Herrera wrote:
> If it is something worth worrying about, let's discuss what's a good
> fix for it.

While making a minimal test case for the GCC bug report, I found
another surprisingly-small workaround. Patch attached.

Regards,
Jeff Davis

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 666a7c0e81c..c09e1e308ae 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -99,7 +99,7 @@ typedef struct TapeBlockTrailer
  * bytes on last block (if < 0) */
 } TapeBlockTrailer;
 
-#define TapeBlockPayloadSize  (BLCKSZ - sizeof(TapeBlockTrailer))
+#define TapeBlockPayloadSize  (BLCKSZ - 16)
 #define TapeBlockGetTrailer(buf) \
 	((TapeBlockTrailer *) ((char *) buf + TapeBlockPayloadSize))
 


Re: v13: Performance regression related to FORTIFY_SOURCE

2020-06-05 Thread Jeff Davis
On Fri, 2020-06-05 at 14:49 -0700, Andres Freund wrote:
> FWIW, with gcc 10 and glibc 2.30 I don't see such a switch. Taking a
> profile shows me:

...

>   4.65 │   → callqmemcpy@plt
>│   LogicalTapeWrite():
> 
> I.e. normal memcpy is getting called.
> 
> That's with -D_FORTIFY_SOURCE=2

That's good news, although people will be using ubuntu 18.04 for a
while.

Just to confirm, would you mind trying the example programs in the GCC
bug report?

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95556

> With which compiler / libc versions did you encounter this?

gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
gcc-9 (Ubuntu 9.2.1-19ubuntu1~18.04.york0) 9.2.1 20191109
libc-dev-bin/bionic,now 2.27-3ubuntu1 amd64 [installed]

Regards,
Jeff Davis






Re: v13: Performance regression related to FORTIFY_SOURCE

2020-06-05 Thread Jeff Davis
On Thu, 2020-06-04 at 21:41 -0400, Tom Lane wrote:
> I think what'd make more sense is to file this as a gcc bug ("why
> doesn't
> it remove the useless object size check?") 

Filed:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95556

Regards,
Jeff Davis






Re: Atomic operations within spinlocks

2020-06-05 Thread Tom Lane
Andres Freund  writes:
> I wrote a patch for this, and when I got around to to testing it, I
> found that our tests currently don't pass when using both
> --disable-spinlocks and --disable-atomics. Turns out to not be related
> to the issue above, but the global barrier support added in 13.
> That *reads* two 64 bit atomics in a signal handler. Which is normally
> fine, but not at all cool when atomics (or just 64 bit atomics) are
> backed by spinlocks. Because we can "self interrupt" while already
> holding the spinlock.

This is the sort of weird platform-specific problem that I'd prefer to
avoid by minimizing our expectations of what spinlocks can be used for.

> I'm not really sure what to do about that issue. The easisest thing
> would probably be to change the barrier generation to 32bit (which
> doesn't have to use locks for reads in any situation).

Yeah, I think we need a hard rule that you can't use a spinlock in
an interrupt handler --- which means no atomics that don't have
non-spinlock implementations on every platform.

At some point I think we'll have to give up --disable-spinlocks;
it's really of pretty marginal use (how often does anyone port PG
to a new CPU type?) and the number of weird interactions it adds
in this area seems like more than it's worth.  But of course
requiring 64-bit atomics is still a step too far.

> Randomly noticed while looking at the code:
>   uint64  flagbit = UINT64CONST(1) << (uint64) type;

I'm surprised we didn't get any compiler warnings about that.

regards, tom lane




Re: WIP: WAL prefetch (another approach)

2020-06-05 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> I wonder if we can collect some stats to measure how effective the
> prefetching actually is. Ultimately we want something like cache hit
> ratio, but we're only preloading into page cache, so we can't easily
> measure that. Perhaps we could measure I/O timings in redo, though?

That would certainly be interesting, particularly as this optimization
seems likely to be useful on some platforms (eg, zfs, where the
filesystem block size is larger than ours..) and less on others
(traditional systems which have a smaller block size).

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Atomic operations within spinlocks

2020-06-05 Thread Andres Freund
Hi,

On 2020-06-04 19:33:02 -0700, Andres Freund wrote:
> But it looks like that code is currently buggy (and looks like it always
> has been), because we don't look at the nested argument when
> initializing the semaphore.  So we currently allocate too many
> semaphores, without benefiting from them :(.

I wrote a patch for this, and when I got around to to testing it, I
found that our tests currently don't pass when using both
--disable-spinlocks and --disable-atomics. Turns out to not be related
to the issue above, but the global barrier support added in 13.

That *reads* two 64 bit atomics in a signal handler. Which is normally
fine, but not at all cool when atomics (or just 64 bit atomics) are
backed by spinlocks. Because we can "self interrupt" while already
holding the spinlock.

It looks to me that that's a danger whenever 64bit atomics are backed by
spinlocks, not just when both --disable-spinlocks and --disable-atomics
are used. But I suspect that it's really hard to hit the tiny window of
danger when those options aren't used. While we have buildfarm animals
testing each of those separately, we don't have one that tests both
together...

I'm not really sure what to do about that issue. The easisest thing
would probably be to change the barrier generation to 32bit (which
doesn't have to use locks for reads in any situation).  I tested doing
that, and it fixes the hangs for me.


Randomly noticed while looking at the code:
uint64  flagbit = UINT64CONST(1) << (uint64) type;

that shouldn't be 64bit, right?

Greetings,

Andres Freund




Re: Trouble with hashagg spill I/O pattern and costing

2020-06-05 Thread Tomas Vondra

On Fri, Jun 05, 2020 at 06:51:34PM -0400, Alvaro Herrera wrote:

On 2020-Jun-06, Tomas Vondra wrote:


On Fri, Jun 05, 2020 at 05:19:43PM -0400, Alvaro Herrera wrote:

> Is this patch the only thing missing before this open item can be
> considered closed?

I've already pushed this as 4cad2534da6d17067d98cf04be2dfc1bda8f2cd0,
sorry for not mentioning it in this thread explicitly.


That's great to know, thanks.  The other bit necessary to answer my
question is whether do we need to do anything else in this area -- if
no, then we can mark the open item as closed:
https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items#Open_Issues



Hmmm, good question.

There was some discussion about maybe tweaking the costing model to make
it a bit more pessimistic (assuming more random I/O or something like
that), but I'm not sure it's still needed. Increasing random_page_cost
for the temp tablespace did the trick for me.

So I'd say we can mark it as closed ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Trouble with hashagg spill I/O pattern and costing

2020-06-05 Thread Alvaro Herrera
On 2020-Jun-06, Tomas Vondra wrote:

> On Fri, Jun 05, 2020 at 05:19:43PM -0400, Alvaro Herrera wrote:
>
> > Is this patch the only thing missing before this open item can be
> > considered closed?
> 
> I've already pushed this as 4cad2534da6d17067d98cf04be2dfc1bda8f2cd0,
> sorry for not mentioning it in this thread explicitly.

That's great to know, thanks.  The other bit necessary to answer my
question is whether do we need to do anything else in this area -- if
no, then we can mark the open item as closed:
https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items#Open_Issues

Thanks

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: WAL prefetch (another approach)

2020-06-05 Thread Tomas Vondra

Hi,

I wonder if we can collect some stats to measure how effective the
prefetching actually is. Ultimately we want something like cache hit
ratio, but we're only preloading into page cache, so we can't easily
measure that. Perhaps we could measure I/O timings in redo, though?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Trouble with hashagg spill I/O pattern and costing

2020-06-05 Thread Tomas Vondra

On Fri, Jun 05, 2020 at 05:19:43PM -0400, Alvaro Herrera wrote:

Hello

Is this patch the only thing missing before this open item can be
considered closed?



I've already pushed this as 4cad2534da6d17067d98cf04be2dfc1bda8f2cd0,
sorry for not mentioning it in this thread explicitly.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: v13: Performance regression related to FORTIFY_SOURCE

2020-06-05 Thread Andres Freund
Hi,

On 2020-04-19 15:07:22 -0700, Jeff Davis wrote:
> I brought up an issue where GCC in combination with FORTIFY_SOURCE[2]
> causes a perf regression for logical tapes after introducing
> LogicalTapeSetExtend()[3]. Unfortunately, FORTIFY_SOURCE is used by
> default on ubuntu. I have not observed the problem with clang.
>
> There is no reason why the change should trigger the regression, but it
> does. The slowdown is due to GCC switching to an inlined version of
> memcpy() for LogicalTapeWrite() at logtape.c:768. The change[3] seems
> to have little if anything to do with that.

FWIW, with gcc 10 and glibc 2.30 I don't see such a switch. Taking a
profile shows me:

   │   nthistime = TapeBlockPayloadSize - lt->pos;
   │   if (nthistime > size)
  3.01 │1  b0:   cmp  %rdx,%r12
  1.09 │ cmovbe   %r12,%rdx
   │   memcpy():
   │
   │   __fortify_function void *
   │   __NTH (memcpy (void *__restrict __dest, const void *__restrict 
__src,
   │   size_t __len))
   │   {
   │   return __builtin___memcpy_chk (__dest, __src, __len, __bos0 
(__dest));
  2.44 │ mov  %r13,%rsi
   │   LogicalTapeWrite():
   │   nthistime = size;
   │   Assert(nthistime > 0);
   │
   │   memcpy(lt->buffer + lt->pos, ptr, nthistime);
  2.49 │ add  0x28(%rbx),%rdi
  0.28 │ mov  %rdx,%r15
   │   memcpy():
  4.65 │   → callqmemcpy@plt
   │   LogicalTapeWrite():

I.e. normal memcpy is getting called.

That's with -D_FORTIFY_SOURCE=2

With which compiler / libc versions did you encounter this?

Greetings,

Andres Freund




Re: Trouble with hashagg spill I/O pattern and costing

2020-06-05 Thread Alvaro Herrera
Hello

Is this patch the only thing missing before this open item can be
considered closed?

Thanks

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: WAL prefetch (another approach)

2020-06-05 Thread Thomas Munro
On Sat, Jun 6, 2020 at 8:41 AM Tomas Vondra
 wrote:
> BTW in all three cases it happens right after the first restart point in
> the WAL stream:
>
> LOG:  restored log file "0001010800FD" from archive
> LOG:  restartpoint starting: time
> LOG:  restored log file "0001010800FE" from archive
> LOG:  restartpoint complete: wrote 236092 buffers (22.5%); 0 WAL ...
> LOG:  recovery restart point at 108/FC28
> DETAIL:  Last completed transaction was at log time 2020-06-04
>  15:27:00.95139+02.
> LOG:  recovery no longer prefetching: unexpected pageaddr
>   108/5700 in log segment 0001010800FF, offset 0
> LOG:  restored log file "0001010800FF" from archive
>
> It looks exactly like this in case of all 3 failures ...

Huh.  Thanks!  I'll try to reproduce this here.




Re: repeat() function, CHECK_FOR_INTERRUPTS(), and unlikely()

2020-06-05 Thread Joe Conway
On 6/4/20 5:20 PM, Alvaro Herrera wrote:
> On 2020-May-28, Joe Conway wrote:
> 
>> I backpatched and pushed the changes to the repeat() function. Any other
>> opinions regarding backpatch of the unlikely() addition to 
>> CHECK_FOR_INTERRUPTS()?
> 
> We don't use unlikely() in 9.6 at all, so I would stop that backpatching
> at 10 anyhow.  (We did backpatch unlikely()'s definition afterwards.)


Correct you are -- thanks for the heads up! Pushed to REL_10_STABLE and later.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: WIP: WAL prefetch (another approach)

2020-06-05 Thread Tomas Vondra

On Fri, Jun 05, 2020 at 10:04:14PM +0200, Tomas Vondra wrote:

On Fri, Jun 05, 2020 at 05:20:52PM +0200, Tomas Vondra wrote:


...

which is not particularly great, I guess. There however seems to be
something wrong, because with the prefetching I see this in the log:

prefetch:
2020-06-05 02:47:25.970 CEST 1591318045.970 [22961] LOG:  recovery no
longer prefetching: unexpected pageaddr 108/E800 in log segment
0001010800FF, offset 0

prefetch2:
2020-06-05 15:29:23.895 CEST 1591363763.895 [26676] LOG:  recovery no
longer prefetching: unexpected pageaddr 108/E800 in log segment
000101090001, offset 0

Which seems pretty suspicious, but I have no idea what's wrong. I admit
the archive/restore commands are a bit hacky, but I've only seen this
with prefetching on the SATA storage, while all other cases seem to be
just fine. I haven't seen in on NVME (which processes much more WAL).
And the SATA baseline (no prefetching) also worked fine.

Moreover, the pageaddr value is the same in both cases, but the WAL
segments are different (but just one segment apart). Seems strange.



I suspected it might be due to a somewhat hackish restore_command that
prefetches some of the WAL segments,  so I tried again with a much
simpler restore_command - essentially just:

 restore_command = 'cp /archive/%f %p.tmp && mv %p.tmp %p'

which I think should be fine for testing purposes. And I got this:

 LOG:  recovery no longer prefetching: unexpected pageaddr 108/5700
   in log segment 0001010800FF, offset 0
 LOG:  restored log file "0001010800FF" from archive

which is the same segment as in the earlier examples, but with a
different pageaddr value. Of course, there's no such pageaddr in the WAL
segment (and recovery of that segment succeeds).

So I think there's something broken ...



BTW in all three cases it happens right after the first restart point in
the WAL stream:

   LOG:  restored log file "0001010800FD" from archive
   LOG:  restartpoint starting: time
   LOG:  restored log file "0001010800FE" from archive
   LOG:  restartpoint complete: wrote 236092 buffers (22.5%); 0 WAL ...
   LOG:  recovery restart point at 108/FC28
   DETAIL:  Last completed transaction was at log time 2020-06-04
15:27:00.95139+02.
   LOG:  recovery no longer prefetching: unexpected pageaddr
 108/5700 in log segment 0001010800FF, offset 0
   LOG:  restored log file "0001010800FF" from archive

It looks exactly like this in case of all 3 failures ...

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: WIP: WAL prefetch (another approach)

2020-06-05 Thread Tomas Vondra

On Fri, Jun 05, 2020 at 05:20:52PM +0200, Tomas Vondra wrote:


...

which is not particularly great, I guess. There however seems to be
something wrong, because with the prefetching I see this in the log:

prefetch:
2020-06-05 02:47:25.970 CEST 1591318045.970 [22961] LOG:  recovery no
longer prefetching: unexpected pageaddr 108/E800 in log segment
0001010800FF, offset 0

prefetch2:
2020-06-05 15:29:23.895 CEST 1591363763.895 [26676] LOG:  recovery no
longer prefetching: unexpected pageaddr 108/E800 in log segment
000101090001, offset 0

Which seems pretty suspicious, but I have no idea what's wrong. I admit
the archive/restore commands are a bit hacky, but I've only seen this
with prefetching on the SATA storage, while all other cases seem to be
just fine. I haven't seen in on NVME (which processes much more WAL).
And the SATA baseline (no prefetching) also worked fine.

Moreover, the pageaddr value is the same in both cases, but the WAL
segments are different (but just one segment apart). Seems strange.



I suspected it might be due to a somewhat hackish restore_command that
prefetches some of the WAL segments,  so I tried again with a much
simpler restore_command - essentially just:

  restore_command = 'cp /archive/%f %p.tmp && mv %p.tmp %p'

which I think should be fine for testing purposes. And I got this:

  LOG:  recovery no longer prefetching: unexpected pageaddr 108/5700
in log segment 0001010800FF, offset 0
  LOG:  restored log file "0001010800FF" from archive

which is the same segment as in the earlier examples, but with a
different pageaddr value. Of course, there's no such pageaddr in the WAL
segment (and recovery of that segment succeeds).

So I think there's something broken ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: significant slowdown of HashAggregate between 9.6 and 10

2020-06-05 Thread Andres Freund
Hi,

On 2020-06-05 15:25:26 +0200, Tomas Vondra wrote:
> I think you're right. I think I was worried about having to resize the
> hash table in case of an under-estimate, and it seemed fine to waste a
> tiny bit more memory to prevent that.

It's pretty cheap to resize a hashtable with a handful of entries, so I'm not
worried about that. It's also how it has worked for a *long* time, so I think
unless we have some good reason to change that, I wouldn't.


> But this example shows we may need to scan the hash table
> sequentially, which means it's not just about memory consumption.

We *always* scan the hashtable sequentially, no? Otherwise there's no way to
get at the aggregated data.


> So in hindsight we either don't need the limit at all, or maybe it
> could be much lower (IIRC it reduces probability of collision, but
> maybe dynahash does that anyway internally).

This is simplehash using code. Which resizes on a load factor of 0.9.


> I wonder if hashjoin has the same issue, but probably not - I don't
> think we'll ever scan that internal hash table sequentially.

I think we do for some outer joins (c.f. ExecPrepHashTableForUnmatched()), but
it's probably not relevant performance-wise.

Greetings,

Andres Freund




Re: minor doc fix - garbage in example of result of unnest

2020-06-05 Thread Tom Lane
"David G. Johnston"  writes:
>> On Fri, Jun 05, 2020 at 09:56:54AM -0400, Tom Lane wrote:
>>> That's not "garbage", I put it there intentionally.

> I don't see where having the row counts in the output adds clarity and it
> just more text for the reader to mentally filter out.

Seems like I'm getting outvoted here.  If there aren't other votes,
I'll change it.

regards, tom lane




Re: valgrind error

2020-06-05 Thread Tom Lane
Noah Misch  writes:
> I can reproduce this on a 2017-vintage CPU with ./configure
> ... USE_SLICING_BY_8_CRC32C=1 and then running "make installcheck-parallel"
> under valgrind-3.15.0 (as packaged by RHEL 7.8).  valgrind.supp has a
> suppression for CRC calculations, but it didn't get the memo when commit
> 4f700bc renamed the function.  The attached patch fixes the suppression.

I can also reproduce this, on RHEL 8.2 which likewise has valgrind-3.15.0,
using the same configuration to force use of that CRC function.  I concur
with your diagnosis that this is just a missed update of the pre-existing
suppression rule.  However, rather than

-   fun:pg_comp_crc32c
+   fun:pg_comp_crc32c*

as you have it, I'd prefer to use

-   fun:pg_comp_crc32c
+   fun:pg_comp_crc32c_sb8

which precisely matches what 4f700bc did.  The other way seems like
it's giving a free pass to problems that could lurk in unrelated CRC
implementations.

regards, tom lane




Improve planner cost estimations for alternative subplans

2020-06-05 Thread Alexey Bashtanov

Hello,

Currently, in case of alternative subplans that do hashed vs per-row 
lookups,

the per-row estimate is used when planning the rest of the query.
It's also coded in not quite an explicit way.

In [1] we found a situation where it leads to a suboptimal plan,
as it bloats the overall cost into large figures,
a decision related to an outer part of the plan look negligible to the 
planner,

and as a result it doesn't elaborate on choosing the optimal one.

The patch is to fix it. Our linear model for costs cannot quite accommodate
the piecewise linear matter of alternative subplans,
so it is based on ugly heuristics and still cannot be very precise,
but I think it's better than the current one.

Thoughts?

Best, Alex

[1] 
https://www.postgresql.org/message-id/flat/ff42b25b-ff03-27f8-ed11-b8255d658cd5%40imap.cc
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b976afb69d..5e2c5ec822 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -92,6 +92,7 @@
 #include "optimizer/planmain.h"
 #include "optimizer/restrictinfo.h"
 #include "parser/parsetree.h"
+#include "utils/float.h"
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
 #include "utils/spccache.h"
@@ -4283,15 +4284,57 @@ cost_qual_eval_walker(Node *node, cost_qual_eval_context *context)
 	else if (IsA(node, AlternativeSubPlan))
 	{
 		/*
-		 * Arbitrarily use the first alternative plan for costing.  (We should
-		 * certainly only include one alternative, and we don't yet have
-		 * enough information to know which one the executor is most likely to
-		 * use.)
+		 * Estimate the cost as some mean of the alternatives.
+		 * It's not uncommon for the alternative costs to be of different
+		 * orders of magnitude, so arithmetic mean would be too biased towards
+		 * the slower one. That's why for per-tuple coefficient we use geometric
+		 * mean.
+		 *
+		 * It's tempting to use geometric mean for startup costs too, but
+		 * one of them can be zero, which will result in substantial
+		 * underestimation. So instead we estimate the cost of returning *first*
+		 * tuple as geometric mean of single-tuple costs for the alternatives.
 		 */
-		AlternativeSubPlan *asplan = (AlternativeSubPlan *) node;
+		List   *subplans = ((AlternativeSubPlan *) node)->subplans;
+		Cost 	per_tuple_mean = 1;
+		Cost 	first_tuple_mean = 1;
+		Cost 	startup_min = get_float8_infinity();
+		Cost 	startup_max = 0;
+		Cost 	startup_mean;
+		ListCell   *lc;
+
+		foreach(lc, subplans)
+		{
+			cost_qual_eval_context subplanContext;
+
+			subplanContext.root = context->root;
+			subplanContext.total.startup = 0;
+			subplanContext.total.per_tuple = 0;
 
-		return cost_qual_eval_walker((Node *) linitial(asplan->subplans),
-	 context);
+			cost_qual_eval_walker((Node *) lfirst(lc), );
+
+			Assert(subplanContext.total.startup >= 0);
+			Assert(subplanContext.total.per_tuple >= 0);
+
+			per_tuple_mean *= subplanContext.total.per_tuple;
+			first_tuple_mean *= subplanContext.total.startup + subplanContext.total.per_tuple;
+			startup_min = float8_min(startup_min, subplanContext.total.startup);
+			startup_max = float8_max(startup_min, subplanContext.total.startup);
+		}
+		per_tuple_mean =
+			pow(per_tuple_mean, 1.0 / list_length(subplans));
+		first_tuple_mean =
+			pow(first_tuple_mean, 1.0 / list_length(subplans));
+		startup_mean = first_tuple_mean - per_tuple_mean;
+		/* make sure this calculation makes a sane value */
+		startup_mean = float8_max(startup_mean, startup_min);
+		startup_mean = float8_min(startup_mean, startup_max);
+
+		context->total.startup += startup_mean;
+		context->total.per_tuple += per_tuple_mean;
+
+		/* We've already recursed into children, skip the generic recursion */
+		return false;
 	}
 	else if (IsA(node, PlaceHolderVar))
 	{
diff --git a/src/test/regress/expected/updatable_views.out b/src/test/regress/expected/updatable_views.out
index 5de53f2782..98787e75e0 100644
--- a/src/test/regress/expected/updatable_views.out
+++ b/src/test/regress/expected/updatable_views.out
@@ -2301,17 +2301,16 @@ SELECT * FROM v1 WHERE a=8;
 
 EXPLAIN (VERBOSE, COSTS OFF)
 UPDATE v1 SET a=100 WHERE snoop(a) AND leakproof(a) AND a < 7 AND a != 6;
-QUERY PLAN 

+ QUERY PLAN  
+-
  Update on public.t1
Update on public.t1
Update on public.t11 t1_1
Update on public.t12 t1_2
Update on public.t111 t1_3
-   ->  Index Scan 

Re: Make more use of RELKIND_HAS_STORAGE()

2020-06-05 Thread Tom Lane
Peter Eisentraut  writes:
> This is a patch to make use of RELKIND_HAS_STORAGE() where appropriate, 
> instead of listing out the relkinds individually.  No behavior change is 
> intended.
> This was originally part of the patch from [0], but it seems worth 
> moving forward independently.

Passes eyeball examination.  I did not try to search for other places
where RELKIND_HAS_STORAGE should be used.

regards, tom lane




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-05 Thread Alvaro Herrera
On 2020-Jun-05, Dave Cramer wrote:

> On Thu, 4 Jun 2020 at 19:46, Alvaro Herrera 
> wrote:

> > Ouch ... so they made IDENT in the replication grammar be a trigger to
> > enter the regular grammar.  Crazy.  No way to put those worms back in
> > the tin now, I guess.
> 
> Is that documented ?

I don't think it is.

> > It is still my opinion that we should prohibit a logical replication
> > connection from being used to do physical replication.  Horiguchi-san,
> > Sawada-san and Masao-san are all of the same opinion.  Dave Cramer (of
> > the JDBC team) is not opposed to the change -- he says they're just
> > using it because they didn't realize they should be doing differently.
> 
> I think my exact words were
> 
> "I don't see this is a valid reason to keep doing something. If it is
> broken then fix it.
> Clients can deal with the change."
> 
> in response to:
> 
> > Well, I don't really think that we can just break a behavior that
> > exists since 9.4 as you could break applications relying on the
> > existing behavior, and that's also the point of Vladimir upthread.
> 
> Which is different than not being opposed to the change. I don't see this
> as broken, and it's quite possible that some of our users are using
> it.

Apologies for misinterpreting.

> It certainly needs to be documented

I'd rather not.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: minor doc fix - garbage in example of result of unnest

2020-06-05 Thread David G. Johnston
On Fri, Jun 5, 2020 at 8:38 AM Tomas Vondra 
wrote:

> On Fri, Jun 05, 2020 at 09:56:54AM -0400, Tom Lane wrote:
> >Pavel Stehule  writes:
> >> diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
> >> index 7c06afd3ea..3b810c0eb4 100644
> >> --- a/doc/src/sgml/func.sgml
> >> +++ b/doc/src/sgml/func.sgml
> >> @@ -17821,7 +17821,6 @@ SELECT NULLIF(value, '(none)') ...
> >>   1
> >>   2
> >>  
> >> -(2 rows in result)
> >> 
> >>
> >
> >That's not "garbage", I put it there intentionally.
> >
>
> Why is it outside the  though? Also, the next unnest
> example does not include the number of rows - why the difference?
>

More generally I think the final unnest listing is probably the best
presentation for a set-returning function, and the one under discussion
should be consistent with it.  The function reference documentation doesn't
seem like a place to introduce strictly psql output variations like /pset
footer on|off.  A set-returning function should be displayed in the example
output as a structured table like the last unnest example.  I don't think
it's necessary for "single column" sets to be an exception.  Since the
examples convert null to (none) there isn't even whitespace concerns in the
output where you want to display a bottom boundary for clarity.

Related, It seems arbitrary that the case-related full block examples do
not have row counts but the aggregate full block examples do.

I don't see where having the row counts in the output adds clarity and it
just more text for the reader to mentally filter out.

David J.


Re: minor doc fix - garbage in example of result of unnest

2020-06-05 Thread Tomas Vondra

On Fri, Jun 05, 2020 at 09:56:54AM -0400, Tom Lane wrote:

Pavel Stehule  writes:

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7c06afd3ea..3b810c0eb4 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17821,7 +17821,6 @@ SELECT NULLIF(value, '(none)') ...
  1
  2
 
-(2 rows in result)

   


That's not "garbage", I put it there intentionally.



Why is it outside the  though? Also, the next unnest
example does not include the number of rows - why the difference?


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: WAL prefetch (another approach)

2020-06-05 Thread Tomas Vondra

Hi,

I've spent some time testing this, mostly from the performance point of
view. I've done a very simple thing, in order to have reproducible test:

1) I've initialized pgbench with scale 8000 (so ~120GB on a machine with
   only 64GB of RAM)

2) created a physical backup, enabled WAL archiving

3) did 1h pgbench run with 32 clients

4) disabled full-page writes and did another 1h pgbench run

Once I had this, I did a recovery using the physical backup and WAL
archive, measuring how long it took to apply each WAL segment. First
without any prefetching (current master), then twice with prefetching.
First with default values (m_io_c=10, distance=256kB) and then with
higher values (100 + 2MB).

I did this on two storage systems I have in the system - NVME SSD and
SATA RAID (3 x 7.2k drives). So, a fast one and slow one.


1) NVME

On the NVME, this generates ~26k WAL segments (~400GB), and each of the
pgbench runs generates ~120M transactions (~33k tps). Of course, wast
majority of the WAL segments ~16k comes from the first run, because
there's a lot of FPI due to the random nature of the workload.

I have not expected a significant improvement from the prefetching, as
the NVME is pretty good in handling random I/O. The total duration looks
like this:

no prefetch prefetch   prefetch2
  10618103859403

So the default is a tiny bit faster, and the more aggressive config
makes it about 10% faster. Not bad, considering the expectations.

Attached is a chart comparing the three runs. There are three clearly
visible parts - first the 1h run with f_p_w=on, with two checkpoints.
That's first ~16k segments. Then there's a bit of a gap before the
second pgbench run was started - I think it's mostly autovacuum etc. And
then at segment ~23k the second pgbench (f_p_w=off) starts.

I think this shows the prefetching starts to help as the number of FPIs
decreases. It's subtle, but it's there.


2) SATA

On SATA it's just ~550 segments (~8.5GB), and the pgbench runs generate
only about 1M transactions. Again, vast majority of the segments comes
from the first run, due to FPI.

In this case, I don't have complete results, but after processing 542
segments (out of the ~550) it looks like this:

no prefetchprefetchprefetch2
   66446635 8282

So the no prefetch and "default" prefetch are roughly on par, but the
"aggressive" prefetch is way slower. I'll get back to this shortly, but
I'd like to point out this is entirely due to the "no FPI" pgbench,
because after the first ~525 initial segments it looks like this:

no prefetchprefetchprefetch2
 58  65   57

So it goes very fast by the initial segments with plenty of FPIs, and
then we get to the "no FPI" segments and the prefetch either does not
help or makes it slower.

Looking at how long it takes to apply the last few segments, it looks
like this:

no prefetchprefetchprefetch2
280 298  478

which is not particularly great, I guess. There however seems to be
something wrong, because with the prefetching I see this in the log:

prefetch:
2020-06-05 02:47:25.970 CEST 1591318045.970 [22961] LOG:  recovery no
longer prefetching: unexpected pageaddr 108/E800 in log segment
0001010800FF, offset 0

prefetch2:
2020-06-05 15:29:23.895 CEST 1591363763.895 [26676] LOG:  recovery no
longer prefetching: unexpected pageaddr 108/E800 in log segment
000101090001, offset 0

Which seems pretty suspicious, but I have no idea what's wrong. I admit
the archive/restore commands are a bit hacky, but I've only seen this
with prefetching on the SATA storage, while all other cases seem to be
just fine. I haven't seen in on NVME (which processes much more WAL).
And the SATA baseline (no prefetching) also worked fine.

Moreover, the pageaddr value is the same in both cases, but the WAL
segments are different (but just one segment apart). Seems strange.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: minor doc fix - garbage in example of result of unnest

2020-06-05 Thread Tom Lane
Pavel Stehule  writes:
> diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
> index 7c06afd3ea..3b810c0eb4 100644
> --- a/doc/src/sgml/func.sgml
> +++ b/doc/src/sgml/func.sgml
> @@ -17821,7 +17821,6 @@ SELECT NULLIF(value, '(none)') ...
>   1
>   2
>  
> -(2 rows in result)
> 
>

That's not "garbage", I put it there intentionally.

regards, tom lane




Re: Internal key management system

2020-06-05 Thread Fabien COELHO


Hello Bruce,

Hmmm. This seels to suggest that interacting with something outside 
should be an option.


Our goal is not to implement every possible security idea someone has,
because we will never finish, and the final result would be too complex
to be unable.


Sure. I'm trying to propose something both simple and extensible, so that 
other people could plug their own KMS if they are not fully satisfied with 
the way the internal pg KMS works, which IMHO should be the case if 
someone is motivated and paranoid enough to setup a KMS in the first 
place.


You will need to explain exactly why having a separate process has value 
over coding/user complexity, and you will need to get agreement from a 
sufficient number of people to move that idea forward.


ISTM that the value is simple: The whole KMS idea turns around a "KEK", 
which is a secret key which allows to unlock/retrieve/recompute many data 
keys, aka DEKs. Loosing the KEK basically means loosing all data keys, 
past, present and possibly future, depending on how the KEK/DEK mechanism 
operates internally.


So the thing you should not want is to lose your KEK.

Keeping it inside pg process means that any pg process compromision would 
result in the KEK to be compromised as well, while the whole point of 
doing this KMS business was to provide security by isolating realms of 
data encryption.


If you provide an interface instead, which I'm advocating, then where the 
KEK is does not concern pg, which has just to ask for DEKs. A compromise 
of pg would compromise local DEKs, but not the KEK "master key". The KEK 
may be somewhere on the same host, in another process, or possibly on 
another host, on some attached specialized quantum hardware inacessible to 
human beings. Postgres should not decide where the user should put its 
KEK, because it would depend on the threat model being addressed that you 
do not know.


From an implementation point of view, what I suggest is reasonably simple 
and allows people to interface with the KMS of their choice, including one 
based on the patch, which would be a demos about what can be done, but 
other systems would be accessible just as well. The other software 
engineering aspect is that a KMS is a complex/sensitive thing, so 
reinventing our own and forcing it on users seems like a bad idea.


From what I understood from the code, the KEK is loaded into postgres 
process. That is what I'm disagreeing with, only needed DEK should be 
there.


One option would be to send the data needing to be encrypted to an
external command, and get the decrypted data back.  In that way, the KEK
is never on the Postgres server.  However, the API for doing such an
interface seems very complex and could lead to failures.


I was more thinking of an interface to retrieve DEKs, but to still keep 
encryption/decryption inside postgres, to limit traffic, but what you 
suggest could be a valid option, so maybe should be allowed.


I disagree with the implementation complexity, though.

Basically the protocol only function is sending "GET 
" and retrieving a response which is either the DEK 
or an error, which looks like a manageable complexity. Attached a 
simplistic server-side implementation of that for illustration.


If you want externalized DEK as well, it would be sending "ENC/DEC 
 " and the response is an error, or the translated 
data. Looks manageable as well. Allowing both approaches looks ok.


Obviously it requires some more thinking and design, but my point is that 
postgres should not hold a KEK, ever, nor presume how DEK are to be 
managed by a DMS, and that is not very difficult to achieve by putting it 
outside of pg and defining how interactions take place. Providing a 
reference/example implementation would be nice as well, and Masahiko-san 
code can be rewrapped quite easily.


--
Fabien.#! /usr/bin/env python3
#
# This (powerful) code is public domain.

import sys
from hashlib import sha256 as H

# obviously the KEK should be retrieved from somewhere else
KEK = b"super secret default KEK"

while True:
try: # parse GET 0xdeadbeef
req = sys.stdin.readline())
assert req[0:6] == "GET 0x" and req[-1] == "\n"
hdata = bytes.fromhex(req[6:-1])
print(f"DEK 0x{H(hdata + KEK).hexdigest()}")
except Exception as e:
print(f"ERR {e}")


Re: significant slowdown of HashAggregate between 9.6 and 10

2020-06-05 Thread Tomas Vondra

On Thu, Jun 04, 2020 at 06:57:58PM -0700, Andres Freund wrote:

Hi,

On 2020-06-04 18:22:03 -0700, Jeff Davis wrote:

On Thu, 2020-06-04 at 11:41 -0700, Andres Freund wrote:
> +/* minimum number of initial hash table buckets */
> +#define HASHAGG_MIN_BUCKETS 256
>
>
> I don't really see much explanation for that part in the commit,
> perhaps
> Jeff can chime in?

I did this in response to a review comment (point #5):


https://www.postgresql.org/message-id/20200219191636.gvdywx32kwbix6kd@development

Tomas suggested a min of 1024, and I thought I was being more
conservative choosing 256. Still too high, I guess?



I can lower it. What do you think is a reasonable minimum?


I don't really see why there needs to be a minimum bigger than 1?



I think you're right. I think I was worried about having to resize the
hash table in case of an under-estimate, and it seemed fine to waste a
tiny bit more memory to prevent that. But this example shows we may need
to scan the hash table sequentially, which means it's not just about
memory consumption. So in hindsight we either don't need the limit at
all, or maybe it could be much lower (IIRC it reduces probability of
collision, but maybe dynahash does that anyway internally).

I wonder if hashjoin has the same issue, but probably not - I don't
think we'll ever scan that internal hash table sequentially.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Removal of currtid()/currtid2() and some table AM cleanup

2020-06-05 Thread Inoue, Hiroshi



On 2020/06/05 15:22, Michael Paquier wrote:

On Wed, Jun 03, 2020 at 10:10:21PM +0900, Inoue, Hiroshi wrote:

On 2020/06/03 11:14, Michael Paquier wrote:

I have been looking at the ODBC driver and the need for currtid() as
well as currtid2(), and as mentioned already in [1], matching with my
lookup of things, these are actually not needed by the driver as long
as we connect to a server newer than 8.2 able to support RETURNING.

Though currtid2() is necessary even for servers which support RETURNING,
I don't object to remove it.

In which cases is it getting used then?


Keyset-driven cursors always detect changes made by other applications
(and themselves). currtid() is necessary to detect the changes.
CTIDs are changed by updates unfortunately.

regards,
Hiroshi Inoue


   From what I can see there is
zero coverage for that part in the tests.  And based on a rough read
of the code, this would get called with LATEST_TUPLE_LOAD being set,
where there is some kind of bulk deletion involved.  Couldn't that be
a problem?
--
Michael




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-05 Thread Dave Cramer
On Thu, 4 Jun 2020 at 19:46, Alvaro Herrera 
wrote:

> On 2020-Jun-04, Andres Freund wrote:
>
> > postgres[52656][1]=# SELECT 1;
> > ┌──┐
> > │ ?column? │
> > ├──┤
> > │1 │
> > └──┘
> > (1 row)
> >
> >
> > I am very much not in love with the way that was implemented, but it's
> > there, and it's used as far as I know (cf tablesync.c).
>
> Ouch ... so they made IDENT in the replication grammar be a trigger to
> enter the regular grammar.  Crazy.  No way to put those worms back in
> the tin now, I guess.
>

Is that documented ?

>
> It is still my opinion that we should prohibit a logical replication
> connection from being used to do physical replication.  Horiguchi-san,
> Sawada-san and Masao-san are all of the same opinion.  Dave Cramer (of
> the JDBC team) is not opposed to the change -- he says they're just
> using it because they didn't realize they should be doing differently.


I think my exact words were

"I don't see this is a valid reason to keep doing something. If it is
broken then fix it.
Clients can deal with the change."

in response to:

Well, I don't really think that we can just break a behavior that
> exists since 9.4 as you could break applications relying on the
> existing behavior, and that's also the point of Vladimir upthread.
>

Which is different than not being opposed to the change. I don't see this
as broken,
and it's quite possible that some of our users are using it. It certainly
needs to be documented

Dave


Re: [PATCH] pg_dump: Add example and link for --encoding option

2020-06-05 Thread 이동욱
I've modified my previous patch because it linked the wrong document so I
fixed it. and I add a highlight to the encoding list for readability.

What do you think about this change?

Thanks,
Dong Wook

2020년 6월 4일 (목) 오후 10:20, 이동욱 님이 작성:

> To let users know what kind of character set
> can be used add examples and a link to --encoding option.
>
> Thanks,
> Dong wook
>


v2-0001-pg_dump-Add-example-and-link-for-encoding-option.patch
Description: Binary data


Re: [Proposal] Page Compression for OLTP

2020-06-05 Thread chenhj
Hi hackers,


> # Page storage(Plan C)
>
> Further, since the size of the compress address file is fixed, the above 
> address file and data file can also be combined into one file
>
> 0   1   2 1230710 1 2
> +===+===+===+ +===+=+=+
> | head  |  1|2  | ... |   | data1   | data2   | ...  
> +===+===+===+ +===+=+=+
>   head  |  address|  data  |

I made a prototype according to the above storage method. Any suggestions are 
welcome.

# Page compress file storage related definitions

/*
* layout of Page Compress file:
*
* - PageCompressHeader
* - PageCompressAddr[]
* - chuncks of PageCompressData
*
*/
typedef struct PageCompressHeader
{
 pg_atomic_uint32 nblocks; /* number of total blocks in this 
segment */
 pg_atomic_uint32 allocated_chunks; /* number of total allocated 
chunks in data area */
 uint16chunk_size; /* size of each chunk, must be 
1/2 1/4 or 1/8 of BLCKSZ */
 uint8algorithm; /* compress algorithm, 1=pglz, 
2=lz4 */
} PageCompressHeader;

typedef struct PageCompressAddr
{
 uint8nchunks;   /* number of chunks for 
this block */
 uint8allocated_chunks; /* number of allocated 
chunks for this block */

 /* variable-length fields, 1 based chunk no array for this block, size of 
the array must be 2, 4 or 8 */
 pc_chunk_number_t chunknos[FLEXIBLE_ARRAY_MEMBER];
} PageCompressAddr;

typedef struct PageCompressData
{
 char page_header[SizeOfPageHeaderData]; /* page header */
 uint32 size;/* size of 
compressed data */
 char data[FLEXIBLE_ARRAY_MEMBER];  /* compressed page, except 
for the page header */
} PageCompressData;


# Usage

Set whether to use compression through storage parameters of tables and indexes

- compress_type
 Set whether to compress and the compression algorithm used, supported values: 
none, pglz, zstd

- compress_chunk_size

 Chunk is the smallest unit of storage space allocated for compressed pages.
 The size of the chunk can only be 1/2, 1/4 or 1/8 of BLCKSZ

- compress_prealloc_chunks

  The number of chunks pre-allocated for each page. The maximum value allowed 
is: BLCKSZ/compress_chunk_size -1.
  If the number of chunks required for a compressed page is less than 
`compress_prealloc_chunks`,
  It allocates `compress_prealloc_chunks` chunks to avoid future storage 
fragmentation when the page needs more storage space.


# Sample

## requirement

- zstd

## build

./configure --with-zstd
make
make install

## create compressed table and index

create table tb1(id int,c1 text);
create table tb1_zstd(id int,c1 text) 
with(compress_type=zstd,compress_chunk_size=1024);
create table tb1_zstd_4(id int,c1 text) 
with(compress_type=zstd,compress_chunk_size=1024,compress_prealloc_chunks=4);

create index tb1_idx_id on tb1(id);
create index tb1_idx_id_zstd on tb1(id) 
with(compress_type=zstd,compress_chunk_size=1024);
create index tb1_idx_id_zstd_4 on tb1(id) 
with(compress_type=zstd,compress_chunk_size=1024,compress_prealloc_chunks=4);

create index tb1_idx_c1 on tb1(c1);
create index tb1_idx_c1_zstd on tb1(c1) 
with(compress_type=zstd,compress_chunk_size=1024);
create index tb1_idx_c1_zstd_4 on tb1(c1) 
with(compress_type=zstd,compress_chunk_size=1024,compress_prealloc_chunks=4);

insert into tb1 select generate_series(1,100),md5(random()::text);
insert into tb1_zstd select generate_series(1,100),md5(random()::text);
insert into tb1_zstd_4 select generate_series(1,100),md5(random()::text);

## show size of table and index

postgres=# \d+
List of relations
Schema |Name| Type  |  Owner   | Persistence | Size  | Description
++---+--+-+---+-
public | tb1| table | postgres | permanent   | 65 MB |
public | tb1_zstd   | table | postgres | permanent   | 37 MB |
public | tb1_zstd_4 | table | postgres | permanent   | 37 MB |
(3 rows)

postgres=# \di+
List of relations
Schema |   Name| Type  |  Owner   | Table | Persistence | Size  | 
Description
+---+---+--+---+-+---+-
public | tb1_idx_c1| index | postgres | tb1   | permanent   | 73 MB |
public | tb1_idx_c1_zstd   | index | postgres | tb1   | permanent   | 36 MB |
public | tb1_idx_c1_zstd_4 | index | postgres | tb1   | permanent   | 41 MB |
public | tb1_idx_id| index | postgres | tb1   | permanent   | 21 MB |
public | tb1_idx_id_zstd   | index | postgres | tb1   | permanent   | 13 MB |
public | tb1_idx_id_zstd_4 | index | postgres | tb1   | permanent   | 15 MB |
(6 rows)


# pgbench performance testing(TPC-B)


minor doc fix - garbage in example of result of unnest

2020-06-05 Thread Pavel Stehule
Hi

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7c06afd3ea..3b810c0eb4 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17821,7 +17821,6 @@ SELECT NULLIF(value, '(none)') ...
  1
  2
 
-(2 rows in result)

   

Regards

Pavel


Re: proposal - function string_to_table

2020-06-05 Thread Pavel Stehule
Hi

čt 4. 6. 2020 v 11:49 odesílatel movead...@highgo.ca 
napsal:

> +{ oid => '2228', descr => 'split delimited text',
> +  proname => 'string_to_table', prorows => '1000', proretset => 't',
> +  prorettype => 'text', proargtypes => 'text text',
> +  prosrc => 'text_to_table' },
> +{ oid => '2282', descr => 'split delimited text with null string',
> +  proname => 'string_to_table', prorows => '1000', proretset => 't',
> +  prorettype => 'text', proargtypes => 'text text text',
> +  prosrc => 'text_to_table_null' },
>
> I go through the patch, and everything looks good to me. But I do not know
> why it needs a 'text_to_table_null()', it's ok to put a 'text_to_table'
> there, I think.
>

It is a convention in Postgres - every SQL unique signature has its own
unique internal C function.

I am sending a refreshed patch.

Regards

Pavel




>
> --
> Regards,
> Highgo Software (Canada/China/Pakistan)
> URL : www.highgo.ca
> EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7c06afd3ea..974499ff2e 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3437,6 +3437,27 @@ repeat('Pg', 4) PgPgPgPg

   
 
+  
+   
+
+ string_to_table
+
+string_to_table ( string text, delimiter text  , nullstr text   )
+set of text
+   
+   
+splits string into table using supplied delimiter and
+optional null string.
+   
+   
+string_to_table('xx~^~yy~^~zz', '~^~', 'yy')
+
+xx
+yy,
+zz
+   
+  
+
   

 
@@ -16717,21 +16738,6 @@ strict $.track.segments[*].location

   
 
-  
-   
-string starts with string
-boolean
-   
-   
-Tests whether the second operand is an initial substring of the first
-operand.
-   
-   
-jsonb_path_query('["John Smith", "Mary Stone", "Bob Johnson"]', '$[*] ? (@ starts with "John")')
-"John Smith"
-   
-  
-
   

 exists ( path_expression )
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 2eaabd6231..650813d8b8 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -26,6 +26,7 @@
 #include "lib/hyperloglog.h"
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
+#include "nodes/execnodes.h"
 #include "parser/scansup.h"
 #include "port/pg_bswap.h"
 #include "regex/regex.h"
@@ -35,6 +36,7 @@
 #include "utils/memutils.h"
 #include "utils/pg_locale.h"
 #include "utils/sortsupport.h"
+#include "utils/tuplestore.h"
 #include "utils/varlena.h"
 
 
@@ -92,6 +94,16 @@ typedef struct
 	pg_locale_t locale;
 } VarStringSortSupport;
 
+/*
+ * Holds target metadata used for split string to array or to table.
+ */
+typedef struct
+{
+	ArrayBuildState	*astate;
+	Tuplestorestate *tupstore;
+	TupleDesc	tupdesc;
+} SplitStringTargetData;
+
 /*
  * This should be large enough that most strings will fit, but small enough
  * that we feel comfortable putting it on the stack
@@ -139,7 +151,7 @@ static bytea *bytea_substring(Datum str,
 			  bool length_not_specified);
 static bytea *bytea_overlay(bytea *t1, bytea *t2, int sp, int sl);
 static void appendStringInfoText(StringInfo str, const text *t);
-static Datum text_to_array_internal(PG_FUNCTION_ARGS);
+static bool text_to_array_internal(FunctionCallInfo fcinfo, SplitStringTargetData *tstate);
 static text *array_to_text_internal(FunctionCallInfo fcinfo, ArrayType *v,
 	const char *fldsep, const char *null_string);
 static StringInfo makeStringAggState(FunctionCallInfo fcinfo);
@@ -4679,7 +4691,19 @@ text_isequal(text *txt1, text *txt2, Oid collid)
 Datum
 text_to_array(PG_FUNCTION_ARGS)
 {
-	return text_to_array_internal(fcinfo);
+	SplitStringTargetData tstate;
+
+	/* reset tstate */
+	memset(, 0, sizeof(tstate));
+
+	if (!text_to_array_internal(fcinfo, ))
+		PG_RETURN_NULL();
+
+	if (!tstate.astate)
+		PG_RETURN_ARRAYTYPE_P(construct_empty_array(TEXTOID));
+
+	PG_RETURN_ARRAYTYPE_P(makeArrayResult(tstate.astate,
+		  CurrentMemoryContext));
 }
 
 /*
@@ -4693,16 +4717,98 @@ text_to_array(PG_FUNCTION_ARGS)
 Datum
 text_to_array_null(PG_FUNCTION_ARGS)
 {
-	return text_to_array_internal(fcinfo);
+	return text_to_array(fcinfo);
+}
+
+/*
+ * text_to_table
+ * Parse input string and returns substrings as a table.
+ */
+Datum
+text_to_table(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo	   *rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+	SplitStringTargetData tstate;
+	MemoryContext		old_cxt;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsi == NULL || !IsA(rsi, ReturnSetInfo))
+		ereport(ERROR,
+(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+
+	if (!(rsi->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+		

Re: OpenSSL 3.0.0 compatibility

2020-06-05 Thread Peter Eisentraut

On 2020-05-30 11:29, Peter Eisentraut wrote:

I suggest to update the test data in PG13+, since we require OpenSSL
1.0.1 there.  For the older branches, I would look into changing the
test driver setup so that it loads a custom openssl.cnf that loads the
legacy providers.


I have pushed your 0002 patch (the one that updates the test data) to 
master.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Make more use of RELKIND_HAS_STORAGE()

2020-06-05 Thread Peter Eisentraut
This is a patch to make use of RELKIND_HAS_STORAGE() where appropriate, 
instead of listing out the relkinds individually.  No behavior change is 
intended.


This was originally part of the patch from [0], but it seems worth 
moving forward independently.



[0]: 
https://www.postgresql.org/message-id/flat/dc35a398-37d0-75ce-07ea-1dd71d98f8ec%402ndquadrant.com


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From e5135e2324ff38dc03998af00268f96b6725cd23 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 5 Jun 2020 08:57:28 +0200
Subject: [PATCH] Make more use of RELKIND_HAS_STORAGE()

Make use of RELKIND_HAS_STORAGE() where appropriate, instead of
listing out the relkinds individually.  No behavior change intended.
---
 src/backend/catalog/heap.c  |  7 +---
 src/backend/postmaster/pgstat.c |  6 +--
 src/backend/utils/adt/dbsize.c  | 73 +
 3 files changed, 31 insertions(+), 55 deletions(-)

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index e393c93a45..9c45544815 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1943,13 +1943,8 @@ heap_drop_with_catalog(Oid relid)
/*
 * Schedule unlinking of the relation's physical files at commit.
 */
-   if (rel->rd_rel->relkind != RELKIND_VIEW &&
-   rel->rd_rel->relkind != RELKIND_COMPOSITE_TYPE &&
-   rel->rd_rel->relkind != RELKIND_FOREIGN_TABLE &&
-   rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
-   {
+   if (RELKIND_HAS_STORAGE(rel->rd_rel->relkind))
RelationDropStorage(rel);
-   }
 
/*
 * Close relcache entry, but *keep* AccessExclusiveLock on the relation
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index d7f99d9944..166d8e3d15 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1807,11 +1807,7 @@ pgstat_initstats(Relation rel)
charrelkind = rel->rd_rel->relkind;
 
/* We only count stats for things that have storage */
-   if (!(relkind == RELKIND_RELATION ||
- relkind == RELKIND_MATVIEW ||
- relkind == RELKIND_INDEX ||
- relkind == RELKIND_TOASTVALUE ||
- relkind == RELKIND_SEQUENCE))
+   if (!RELKIND_HAS_STORAGE(relkind))
{
rel->pgstat_info = NULL;
return;
diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c
index 840664429e..2320c06a9b 100644
--- a/src/backend/utils/adt/dbsize.c
+++ b/src/backend/utils/adt/dbsize.c
@@ -874,25 +874,18 @@ pg_relation_filenode(PG_FUNCTION_ARGS)
PG_RETURN_NULL();
relform = (Form_pg_class) GETSTRUCT(tuple);
 
-   switch (relform->relkind)
+   if (RELKIND_HAS_STORAGE(relform->relkind))
{
-   case RELKIND_RELATION:
-   case RELKIND_MATVIEW:
-   case RELKIND_INDEX:
-   case RELKIND_SEQUENCE:
-   case RELKIND_TOASTVALUE:
-   /* okay, these have storage */
-   if (relform->relfilenode)
-   result = relform->relfilenode;
-   else/* Consult the relation 
mapper */
-   result = RelationMapOidToFilenode(relid,
-   
  relform->relisshared);
-   break;
-
-   default:
-   /* no storage, return NULL */
-   result = InvalidOid;
-   break;
+   if (relform->relfilenode)
+   result = relform->relfilenode;
+   else/* Consult the relation mapper 
*/
+   result = RelationMapOidToFilenode(relid,
+   
  relform->relisshared);
+   }
+   else
+   {
+   /* no storage, return NULL */
+   result = InvalidOid;
}
 
ReleaseSysCache(tuple);
@@ -951,38 +944,30 @@ pg_relation_filepath(PG_FUNCTION_ARGS)
PG_RETURN_NULL();
relform = (Form_pg_class) GETSTRUCT(tuple);
 
-   switch (relform->relkind)
+   if (RELKIND_HAS_STORAGE(relform->relkind))
+   {
+   /* This logic should match RelationInitPhysicalAddr */
+   if (relform->reltablespace)
+   rnode.spcNode = relform->reltablespace;
+   else
+   rnode.spcNode = MyDatabaseTableSpace;
+   if (rnode.spcNode == GLOBALTABLESPACE_OID)
+   rnode.dbNode = InvalidOid;
+   else
+   rnode.dbNode 

Re: BufFileRead() error signalling

2020-06-05 Thread Michael Paquier
On Fri, Jun 05, 2020 at 06:03:59PM +1200, Thomas Munro wrote:
> I didn't change BufFileWrite() to be void, to be friendly to existing
> callers outside the tree (if there are any), though I removed all the
> code that checks the return code.  We can make it void later.

Missing one entry in AppendStringToManifest().  It sounds right to not
change the signature of the routine on back-branches to any ABI
breakages.  It think that it could change on HEAD.

Anyway, why are we sure that it is fine to not complain even if
BufFileWrite() does a partial write?  fwrite() is mentioned at the top
of the thread, but why is that OK?

> For the future: it feels a bit like we're missing a one line way to
> say "read this many bytes and error out if you run out".

-   ereport(ERROR,
-   (errcode_for_file_access(),
-errmsg("could not write block %ld of temporary file:
-   %m",
-   blknum)));
-   }
+   elog(ERROR, "could not seek block %ld temporary file", blknum);

You mean "in temporary file" in the new message, no?

+   ereport(ERROR,
+   (errcode_for_file_access(),
+errmsg("could not write to \"%s\" : %m",
+   FilePathName(thisfile;

Nit: "could not write [to] file \"%s\": %m" is a more common error
string.
--
Michael


signature.asc
Description: PGP signature


Re: valgrind error

2020-06-05 Thread Noah Misch
On Sun, May 10, 2020 at 09:29:05AM -0400, Andrew Dunstan wrote:
> On 4/18/20 9:15 AM, Andrew Dunstan wrote:
> > I was just trying to revive lousyjack, my valgrind buildfarm animal
> > which has been offline for 12 days, after having upgraded the machine
> > (fedora 31, gcc 9.3.1, valgrind 3.15) and noticed lots of errors like this:

> > {
> >    
> >    Memcheck:Value8
> >    fun:pg_comp_crc32c_sb8
> >    fun:XLogRecordAssemble
> >    fun:XLogInsert
> >    fun:LogCurrentRunningXacts
> >    fun:LogStandbySnapshot
> >    fun:CreateCheckPoint
> >    fun:CheckpointerMain
> >    fun:AuxiliaryProcessMain
> >    fun:StartChildProcess
> >    fun:reaper
> >    obj:/usr/lib64/libpthread-2.30.so
> >    fun:select
> >    fun:ServerLoop
> >    fun:PostmasterMain
> >    fun:main
> > }

> After many hours of testing I have a culprit for this. The error appears
> with valgrind 3.15.0  with everything else held constant. 3.14.0  does
> not produce the problem.

I suspect 3.15.0 is just better at tracking the uninitialized data.  A
more-remote possibility is valgrind-3.14.0 emulating sse42.  That would make
pg_crc32c_sse42_available() return true, avoiding the pg_comp_crc32c_sb8().

> andrew@freddo:bf (master)*$ lscpu
...
> Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
> fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl
> nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy
> svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs
> skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save
> 
> 
> I did not manage to reproduce this anywhere else, tried on various
> physical, Virtualbox and Docker instances.

I can reproduce this on a 2017-vintage CPU with ./configure
... USE_SLICING_BY_8_CRC32C=1 and then running "make installcheck-parallel"
under valgrind-3.15.0 (as packaged by RHEL 7.8).  valgrind.supp has a
suppression for CRC calculations, but it didn't get the memo when commit
4f700bc renamed the function.  The attached patch fixes the suppression.
Author: Noah Misch 
Commit: Noah Misch 

Refresh function name in CRC-associated Valgrind suppressions.

Back-patch to 9.5, where commit 4f700bcd20c087f60346cb8aefd0e269be8e2157
first appeared.

Reviewed by FIXME.  Reported by Andrew Dunstan.

Discussion: 
https://postgr.es/m/4dfabec2-a3ad-0546-2d62-f816c97ed...@2ndquadrant.com

diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a22..acdb620 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -45,7 +45,7 @@
padding_XLogRecData_CRC
Memcheck:Value8
 
-   fun:pg_comp_crc32c
+   fun:pg_comp_crc32c*
fun:XLogRecordAssemble
 }
 
@@ -89,7 +89,7 @@
 {
padding_twophase_CRC
Memcheck:Value8
-   fun:pg_comp_crc32c
+   fun:pg_comp_crc32c*
fun:EndPrepare
 }
 


Re: OpenSSL 3.0.0 compatibility

2020-06-05 Thread Michael Paquier
On Mon, Jun 01, 2020 at 10:44:17AM +0200, Peter Eisentraut wrote:
> Then we can stick a OPENSSL_API_COMPAT=908 into at least PG10, 11, and 12,
> and probably also into PG9.5 and 9.6 without harm.

FWIW, I am fine that for PG >= 10, and I don't think that it is worth
bothering with PG <= 9.6.
--
Michael


signature.asc
Description: PGP signature


Re: A wrong index choose issue because of inaccurate statistics

2020-06-05 Thread Pavel Stehule
pá 5. 6. 2020 v 8:19 odesílatel David Rowley  napsal:

> On Mon, 1 Jun 2020 at 01:24, Andy Fan  wrote:
> > The one-line fix describe the exact idea in my mind:
> >
> > +++ b/src/backend/optimizer/path/costsize.c
> > @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root,
> double loop_count,
> >
> > cpu_run_cost += cpu_per_tuple * tuples_fetched;
> >
> > +   /*
> > +* To make the planner more robust to handle some inaccurate
> statistics
> > +* issue, we will add a extra cost to qpquals so that the less
> qpquals
> > +* the lower cost it has.
> > +*/
> > +   cpu_run_cost += 0.01 * list_length(qpquals);
> > +
> >
> > This change do fix the issue above, but will it make some other cases
> worse? My
> > answer is no based on my current knowledge, and this is most important
> place
> > I want to get advised. The mainly impact I can figure out is: it not only
> > change the cost difference between (a, b) and (a, c) it also cause the
> cost
> > difference between Index scan on (a, c) and seq scan.  However the
> > cost different between index scan and seq scan are huge by practice so
> > I don't think this impact is harmful.
>
> Didn't that idea already get shot down in the final paragraph on [1]?
>
> I understand that you wish to increase the cost by some seemingly
> innocent constant to fix your problem case.  Here are my thoughts
> about that: Telling lies is not really that easy to pull off. Bad
> liers think it's easy and good ones know it's hard. The problem is
> that the lies can start small, but then at some point the future you
> must fashion some more lies to account for your initial lie. Rinse and
> repeat that a few times and before you know it, your course is set
> well away from the point of truth.  I feel the part about "rinse and
> repeat" applies reasonably well to how join costing works.  The lie is
> likely to be amplified as the join level gets deeper.
>
> I think you need to think of a more generic solution and propose that
> instead.  There are plenty of other quirks in the planner that can
> cause suffering due to inaccurate or non-existing statistics. For
> example, due to how we multiply individual selectivity estimates,
> having a few correlated columns in a join condition can cause the
> number of join rows to be underestimated. Subsequent joins can then
> end up making bad choices on which join operator to use based on those
> inaccurate row estimates.  There's also a problem with WHERE  ORDER
> BY col LIMIT n; sometimes choosing an index that provides pre-sorted
> input to the ORDER BY but cannot use  as an indexable condition.
> We don't record any stats to make better choices there, maybe we
> should, but either way, we're taking a bit risk there as all the rows
> matching  might be right at the end of the index and we might need
> to scan the entire thing before hitting the LIMIT. For now, we just
> assume completely even distribution of rows. i.e. If there are 50 rows
> estimated in the path and the limit is for 5 rows, then we'll assume
> we need to read 10% of those before finding all the ones we need. In
> reality, everything matching  might be 95% through the index and we
> could end up reading 100% of rows. That particular problem is not just
> caused by the uneven distribution of rows in the index, but also from
> selectivity underestimation.
>
> I'd more recently wondered if we shouldn't have some sort of "risk"
> factor to the cost model. I really don't have ideas on how exactly we
> would calculate the risk factor in all cases, but in this case,  say
> the risk factor always starts out as 1. We could multiply that risk
> factor by some >1 const each time we find another index filter qual.
> add_path() can prefer lower risk paths when the costs are similar.
> Unsure what the exact add_path logic would be. Perhaps a GUC would
> need to assist with the decision there.   Likewise, with
> NestedLoopPaths which have a large number of join quals, the risk
> factor could go up a bit with those so that we take a stronger
> consideration for hash or merge joins instead.
>
>
I thought about these ideas too. And I am not alone.

https://hal.archives-ouvertes.fr/hal-01316823/document

Regards

Pavel

Anyway, it's pretty much a large research subject which would take a
> lot of work to iron out even just the design. It's likely not a
> perfect idea, but I think it has a bit more merit that trying to
> introduce lies to the cost modal to account for a single case where
> there is a problem.
>
> David
>
> [1]
> https://www.postgresql.org/message-id/20200529001602.eu7vuiouuuiclpgb%40development
>
>
>


Re: Removal of currtid()/currtid2() and some table AM cleanup

2020-06-05 Thread Michael Paquier
On Wed, Jun 03, 2020 at 10:10:21PM +0900, Inoue, Hiroshi wrote:
> On 2020/06/03 11:14, Michael Paquier wrote:
>> I have been looking at the ODBC driver and the need for currtid() as
>> well as currtid2(), and as mentioned already in [1], matching with my
>> lookup of things, these are actually not needed by the driver as long
>> as we connect to a server newer than 8.2 able to support RETURNING.
> 
> Though currtid2() is necessary even for servers which support RETURNING,
> I don't object to remove it.

In which cases is it getting used then?  From what I can see there is
zero coverage for that part in the tests.  And based on a rough read
of the code, this would get called with LATEST_TUPLE_LOAD being set,
where there is some kind of bulk deletion involved.  Couldn't that be
a problem?
--
Michael


signature.asc
Description: PGP signature


Re: A wrong index choose issue because of inaccurate statistics

2020-06-05 Thread David Rowley
On Mon, 1 Jun 2020 at 01:24, Andy Fan  wrote:
> The one-line fix describe the exact idea in my mind:
>
> +++ b/src/backend/optimizer/path/costsize.c
> @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double 
> loop_count,
>
> cpu_run_cost += cpu_per_tuple * tuples_fetched;
>
> +   /*
> +* To make the planner more robust to handle some inaccurate 
> statistics
> +* issue, we will add a extra cost to qpquals so that the less qpquals
> +* the lower cost it has.
> +*/
> +   cpu_run_cost += 0.01 * list_length(qpquals);
> +
>
> This change do fix the issue above, but will it make some other cases worse? 
> My
> answer is no based on my current knowledge, and this is most important place
> I want to get advised. The mainly impact I can figure out is: it not only
> change the cost difference between (a, b) and (a, c) it also cause the cost
> difference between Index scan on (a, c) and seq scan.  However the
> cost different between index scan and seq scan are huge by practice so
> I don't think this impact is harmful.

Didn't that idea already get shot down in the final paragraph on [1]?

I understand that you wish to increase the cost by some seemingly
innocent constant to fix your problem case.  Here are my thoughts
about that: Telling lies is not really that easy to pull off. Bad
liers think it's easy and good ones know it's hard. The problem is
that the lies can start small, but then at some point the future you
must fashion some more lies to account for your initial lie. Rinse and
repeat that a few times and before you know it, your course is set
well away from the point of truth.  I feel the part about "rinse and
repeat" applies reasonably well to how join costing works.  The lie is
likely to be amplified as the join level gets deeper.

I think you need to think of a more generic solution and propose that
instead.  There are plenty of other quirks in the planner that can
cause suffering due to inaccurate or non-existing statistics. For
example, due to how we multiply individual selectivity estimates,
having a few correlated columns in a join condition can cause the
number of join rows to be underestimated. Subsequent joins can then
end up making bad choices on which join operator to use based on those
inaccurate row estimates.  There's also a problem with WHERE  ORDER
BY col LIMIT n; sometimes choosing an index that provides pre-sorted
input to the ORDER BY but cannot use  as an indexable condition.
We don't record any stats to make better choices there, maybe we
should, but either way, we're taking a bit risk there as all the rows
matching  might be right at the end of the index and we might need
to scan the entire thing before hitting the LIMIT. For now, we just
assume completely even distribution of rows. i.e. If there are 50 rows
estimated in the path and the limit is for 5 rows, then we'll assume
we need to read 10% of those before finding all the ones we need. In
reality, everything matching  might be 95% through the index and we
could end up reading 100% of rows. That particular problem is not just
caused by the uneven distribution of rows in the index, but also from
selectivity underestimation.

I'd more recently wondered if we shouldn't have some sort of "risk"
factor to the cost model. I really don't have ideas on how exactly we
would calculate the risk factor in all cases, but in this case,  say
the risk factor always starts out as 1. We could multiply that risk
factor by some >1 const each time we find another index filter qual.
add_path() can prefer lower risk paths when the costs are similar.
Unsure what the exact add_path logic would be. Perhaps a GUC would
need to assist with the decision there.   Likewise, with
NestedLoopPaths which have a large number of join quals, the risk
factor could go up a bit with those so that we take a stronger
consideration for hash or merge joins instead.

Anyway, it's pretty much a large research subject which would take a
lot of work to iron out even just the design. It's likely not a
perfect idea, but I think it has a bit more merit that trying to
introduce lies to the cost modal to account for a single case where
there is a problem.

David

[1] 
https://www.postgresql.org/message-id/20200529001602.eu7vuiouuuiclpgb%40development




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-05 Thread Amit Kapila
On Fri, May 29, 2020 at 8:31 PM Dilip Kumar  wrote:
>
> Apart from this one more fix in 0005,  basically, CheckLiveXid was
> never reset, so I have fixed that as well.
>

I have made a number of modifications in the 0001 patch and attached
is the result.  I have changed/added comments, done some cosmetic
cleanup, and ran pgindent.  The most notable change is to remove the
below code change:
DecodeXactOp()
{
..
- * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+ * However, it's critical to process records with subxid assignment even
  * when the snapshot is being built: it is possible to get later records
  * that require subxids to be properly assigned.
  */
  if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
- info != XLOG_XACT_ASSIGNMENT)
+ !TransactionIdIsValid(XLogRecGetTopXid(r)))
..
}

I have not only removed the change done by the patch but the check
related to XLOG_XACT_ASSIGNMENT as well.  That check has been added by
commit bac2fae05c to ensure that we process XLOG_XACT_ASSIGNMENT even
if snapshot state is not SNAPBUILD_FULL_SNAPSHOT.  Now, with this
patch that is not required because we are making the subtransaction
and top-level transaction much earlier than this.  I have verified
that it doesn't reopen the bug by running the test provided in the
original report [1].

Let me know what you think of the changes?  If you find them okay,
then feel to include them in the next patch-set.

[1] - 
https://www.postgresql.org/message-id/CAONYFtOv%2BEr1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


v27-0001-Immediately-WAL-log-subtransaction-and-top-level.patch
Description: Binary data


Re: Removal of currtid()/currtid2() and some table AM cleanup

2020-06-05 Thread Michael Paquier
On Thu, Jun 04, 2020 at 10:59:05AM -0700, Andres Freund wrote:
> What's the point of that change? I think the differentiation between
> heapam_handler.c and heapam.c could be clearer, but if anything, I'd
> argue that heap_get_latest_tid is sufficiently low-level that it'd
> belong in heapam.c.

Well, heap_get_latest_tid() is only called in heapam_handler.c if
anything, as it is not used elsewhere and not publish it.  And IMO we
should try to encourage using table_get_latest_tid() instead if some
plugins need that.  Anyway, if you are opposed to this change, I won't
push hard for it either.
--
Michael


signature.asc
Description: PGP signature


Re: BufFileRead() error signalling

2020-06-05 Thread Thomas Munro
On Thu, May 28, 2020 at 7:10 PM Michael Paquier  wrote:
> On Wed, May 27, 2020 at 11:59:59AM -0400, Alvaro Herrera wrote:
> > In the discussion that led to 811b6e36a9e2 I did suggest to use "read
> > only M of N" instead, but there wasn't enough discussion on that fine
> > point so we settled on what you now call prevalent without a lot of
> > support specifically on that.  I guess it was enough of an improvement
> > over what was there.  But like Robert, I too prefer the wording that
> > includes "only" and "bytes" over the wording that doesn't.
> >
> > I'll let it be known that from a translator's point of view, it's a
> > ten-seconds job to update a fuzzy string from not including "only" and
> > "bytes" to one that does.  So let's not make that an argument for not
> > changing.
>
> Using "only" would be fine by me, though I tend to prefer the existing
> one.  Now I think that we should avoid "bytes" to not have to worry
> about pluralization of error messages.  This has been a concern in the
> past (see e5d11b9 and the likes).

Sorry for the delay in producing a new patch.  Here's one that tries
to take into account the various feedback in this thread:

Ibrar said:
> You are checking file->dirty twice, first before calling the function
> and within the function too. Same for the Assert.

Fixed.

Robert, Melanie, Michael and Alvaro put forward various arguments
about the form of "short read" and block seek error message.  While
removing bogus cases of "%m", I changed them all to say "read only %zu
of %zu bytes" in the same place.  I removed the bogus "%m" from
BufFileSeek() and BufFileSeekBlock() callers (its call to
BufFileFlush() already throws).  I added block numbers to the error
messages about failure to seek by block.

Following existing practice, I made write failure assume ENOSPC if
errno is 0, so that %m would make sense (I am not sure what platform
reports out-of-space that way, but there are certainly a lot of copies
of that code in the tree so it seemed to be missing here).

I didn't change BufFileWrite() to be void, to be friendly to existing
callers outside the tree (if there are any), though I removed all the
code that checks the return code.  We can make it void later.

For the future: it feels a bit like we're missing a one line way to
say "read this many bytes and error out if you run out".
From ab6f90f5e5f3b6e70de28a4ee734e732c2a8c30b Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 30 Nov 2019 12:44:47 +1300
Subject: [PATCH] Redesign error handling in buffile.c.

Convert error handling in buffile.c functions to raise errors rather
than expecting callers to check the return value, and improve the error
messages.

This fixes various cases that forgot to check for errors, or couldn't
check for errors because EOF was indistinguishable from error, or
checked for errors unless they fell on certain boundaries which they
assumed to be EOF.

This had been identified by code inspection, but was later identified as
the probable cause of a crash report involving a corrupted hash join
spill file when disk space ran out.

Back-patch to all releases.  For now, BufFileWrite() still returns a
redundant value so as not to change the function prototype for the
benefit of any extensions that might be using it.

Reported-by: Amit Khandekar 
Reported-by: Alvaro Herrera 
Reviewed-by: Melanie Plageman 
Reviewed-by: Alvaro Herrera 
Reviewed-by: Robert Haas 
Reviewed-by: Ibrar Ahmed 
Reviewed-by: Michael Paquier 
Discussion: https://postgr.es/m/CA%2BhUKGJE04G%3D8TLK0DLypT_27D9dR8F1RQgNp0jK6qR0tZGWOw%40mail.gmail.com
---
 src/backend/access/gist/gistbuildbuffers.c | 23 -
 src/backend/executor/nodeHashjoin.c| 24 --
 src/backend/replication/backup_manifest.c  |  2 +-
 src/backend/storage/file/buffile.c | 51 +++-
 src/backend/utils/sort/logtape.c   | 18 ---
 src/backend/utils/sort/sharedtuplestore.c  | 21 
 src/backend/utils/sort/tuplestore.c| 56 ++
 7 files changed, 91 insertions(+), 104 deletions(-)

diff --git a/src/backend/access/gist/gistbuildbuffers.c b/src/backend/access/gist/gistbuildbuffers.c
index 4b562d8d3f..7ff513a5fe 100644
--- a/src/backend/access/gist/gistbuildbuffers.c
+++ b/src/backend/access/gist/gistbuildbuffers.c
@@ -757,26 +757,19 @@ gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
 static void
 ReadTempFileBlock(BufFile *file, long blknum, void *ptr)
 {
+	size_t		nread;
+
 	if (BufFileSeekBlock(file, blknum) != 0)
-		elog(ERROR, "could not seek temporary file: %m");
-	if (BufFileRead(file, ptr, BLCKSZ) != BLCKSZ)
-		elog(ERROR, "could not read temporary file: %m");
+		elog(ERROR, "could not seek block %ld in temporary file", blknum);
+	if ((nread = BufFileRead(file, ptr, BLCKSZ)) != BLCKSZ)
+		elog(ERROR, "could not read temporary file: read only %zu of %zu bytes",
+			 nread, (size_t) BLCKSZ);
 }
 
 static void
 WriteTempFileBlock(BufFile *file, long blknum,