Re: In PG12, query with float calculations is slower than PG11

2020-02-06 Thread Andres Freund
Hi, 

On February 6, 2020 11:42:30 PM PST, keisuke kuroda 
 wrote:
>Hi,
>
>I have been testing with newer compiler (clang-7)
>and result is a bit different at least with clang-7.
>Compiling PG 12.1 (even without patch) with clang-7
>results in __isinf() no longer being a bottleneck,
>that is, you don't see it in profiler at all.

I don't think that's necessarily the right conclusion. What's quite possibly 
happening is that you do not see the external isinf function anymore, because 
it is implemented as an intrinsic,  but that there still are more computations 
being done. Due to the changed order of the isinf checks. You'd have to compare 
with 11 using the same compiler.

Andres


>* result(PostgreSQL 12.1 (even without patch))
>
>postgres=# EXPLAIN (ANALYZE on, VERBOSE on, BUFFERS on)
> select (2 * a) , (2 * b) , (2 * c), (2 * d),  (2 * e)
> from realtest;
>
>QUERY PLAN
>---
>Seq Scan on public.realtest  (cost=0.00..288697.59 rows=1115
>width=40)
>(actual time=0.012..3878.284 rows=1001 loops=1)
>   Output: ('2'::double precision * a), ('2'::double precision * b),
>('2'::double precision * c), ('2'::double precision * d), ('2'::double
>precision * e)
>   Buffers: shared hit=63695
> Planning Time: 0.038 ms
> Execution Time: 4533.767 ms
>(5 rows)
>
>Samples: 5K of event 'cpu-clock', Event count (approx.): 127500
>Overhead  Command   Shared Object  Symbol
>  33.92%  postgres  postgres   [.] ExecInterpExpr
>  13.27%  postgres  postgres   [.] float84mul
>  10.86%  postgres  [vdso] [.] __vdso_clock_gettime
>   5.49%  postgres  postgres   [.] tts_buffer_heap_getsomeattrs
>   3.96%  postgres  postgres   [.] ExecScan
>   3.25%  postgres  libc-2.17.so   [.] __clock_gettime
>   3.16%  postgres  postgres   [.] heap_getnextslot
>   2.41%  postgres  postgres   [.] tts_virtual_clear
>   2.39%  postgres  postgres   [.] SeqNext
>   2.22%  postgres  postgres   [.] InstrStopNode
>
>Best Regards,
>Keisuke Kuroda
>
>2020年2月7日(金) 3:48 Andres Freund :
>
>> Hi,
>>
>> On 2020-02-06 11:03:51 -0500, Tom Lane wrote:
>> > Andres seems to be of the opinion that the compiler should be
>willing
>> > to ignore the semantic requirements of the C standard in order
>> > to rearrange the code back into the cheaper order.  That sounds
>like
>> > wishful thinking to me ... even if it actually works on his
>compiler,
>> > it certainly isn't going to work for everyone.
>>
>> Sorry, but, uh, what are you talking about?  Please tell me which
>single
>> standards violation I'm advocating for?
>>
>> I was asking about the inlining bit because the first email of the
>topic
>> explained that as the problem, which I don't believe can be the full
>> explanation - and it turns out it isn't. As Amit Langote's followup
>> email explained, there's the whole issue of the order of checks being
>> inverted - which is clearly bad. And wholly unrelated to inlining.
>>
>> And I asked about __isinf() being used because there are issues with
>> accidentally ending up with the non-intrinsic version of isinf() when
>> not using gcc, due to badly written standard library headers.
>>
>>
>> > The patch looks unduly invasive to me, but I think that it might be
>> > right that we should go back to a macro-based implementation,
>because
>> > otherwise we don't have a good way to be certain that the function
>> > parameter won't get evaluated first.
>>
>> I'd first like to see some actual evidence of this being a problem,
>> rather than just the order of the checks.
>>
>>
>> > (Another reason to do so is so that the file/line numbers generated
>> > for the error reports go back to being at least a little bit
>useful.)
>> > We could use local variables within the macro to avoid double
>evals,
>> > if anyone thinks that's actually important --- I don't.
>>
>> I don't think that's necessarily a good idea. In fact, I think we
>should
>> probably do the exact opposite, and move the error messages further
>out
>> of line. All these otherwise very small functions having their own
>> ereports makes them much bigger. Our low code density, and the
>resulting
>> rate of itlb misses, is pretty significant cost (cf [1]).
>>
>> master:
>>textdata bss dec hex filename
>>   36124  44  65   362338d89 float.o
>> error messages moved out of line:
>>textdata bss dec hex filename
>>   32883  44  65   3299280e0 float.o
>>
>> Taking int4pl as an example - solely because it is simpler assembly
>to
>> look at - we get:
>>
>> master:
>>0x004ac190 <+0>: mov0x30(%rdi),%rax
>>0x004ac194 <+4>: add0x20(%rdi),%eax
>>0x004ac197 <+7>: jo 0x4ac19c 
>>0x004ac199 <+9>: cltq
>>0x004ac19b <+11>:retq
>>0x004ac19c <+12>:push   %rbp
>> 

Re: In PG12, query with float calculations is slower than PG11

2020-02-06 Thread keisuke kuroda
Hi,

I have been testing with newer compiler (clang-7)
and result is a bit different at least with clang-7.
Compiling PG 12.1 (even without patch) with clang-7
results in __isinf() no longer being a bottleneck,
that is, you don't see it in profiler at all.

So, there is no issue for people who use the modern clang toolchain,
but maybe that's not everyone.
So there would still be some interest in doing something about this.

* clang

bash-4.2$ which clang
/opt/rh/llvm-toolset-7.0/root/usr/bin/clang

bash-4.2$ clang -v
clang version 7.0.1 (tags/RELEASE_701/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rh/llvm-toolset-7.0/root/usr/bin
Found candidate GCC installation:
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7
Found candidate GCC installation:
/opt/rh/devtoolset-8/root/usr/lib/gcc/x86_64-redhat-linux/8
Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/4.8.2
Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/4.8.5
Selected GCC installation:
/opt/rh/devtoolset-8/root/usr/lib/gcc/x86_64-redhat-linux/8
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64

** pg_config

---
CONFIGURE = '--prefix=/var/lib/pgsql/pgsql/12.1'
'CC=/opt/rh/llvm-toolset-7.0/root/usr/bin/clang'
'PKG_CONFIG_PATH=/opt/rh/llvm-toolset-7.0/root/usr/lib64/pkgconfig'
CC = /opt/rh/llvm-toolset-7.0/root/usr/bin/clang
---

* result(PostgreSQL 12.1 (even without patch))

postgres=# EXPLAIN (ANALYZE on, VERBOSE on, BUFFERS on)
 select (2 * a) , (2 * b) , (2 * c), (2 * d),  (2 * e)
 from realtest;

QUERY PLAN
---
 Seq Scan on public.realtest  (cost=0.00..288697.59 rows=1115 width=40)
(actual time=0.012..3878.284 rows=1001 loops=1)
   Output: ('2'::double precision * a), ('2'::double precision * b),
('2'::double precision * c), ('2'::double precision * d), ('2'::double
precision * e)
   Buffers: shared hit=63695
 Planning Time: 0.038 ms
 Execution Time: 4533.767 ms
(5 rows)

Samples: 5K of event 'cpu-clock', Event count (approx.): 127500
Overhead  Command   Shared Object  Symbol
  33.92%  postgres  postgres   [.] ExecInterpExpr
  13.27%  postgres  postgres   [.] float84mul
  10.86%  postgres  [vdso] [.] __vdso_clock_gettime
   5.49%  postgres  postgres   [.] tts_buffer_heap_getsomeattrs
   3.96%  postgres  postgres   [.] ExecScan
   3.25%  postgres  libc-2.17.so   [.] __clock_gettime
   3.16%  postgres  postgres   [.] heap_getnextslot
   2.41%  postgres  postgres   [.] tts_virtual_clear
   2.39%  postgres  postgres   [.] SeqNext
   2.22%  postgres  postgres   [.] InstrStopNode

Best Regards,
Keisuke Kuroda

2020年2月7日(金) 3:48 Andres Freund :

> Hi,
>
> On 2020-02-06 11:03:51 -0500, Tom Lane wrote:
> > Andres seems to be of the opinion that the compiler should be willing
> > to ignore the semantic requirements of the C standard in order
> > to rearrange the code back into the cheaper order.  That sounds like
> > wishful thinking to me ... even if it actually works on his compiler,
> > it certainly isn't going to work for everyone.
>
> Sorry, but, uh, what are you talking about?  Please tell me which single
> standards violation I'm advocating for?
>
> I was asking about the inlining bit because the first email of the topic
> explained that as the problem, which I don't believe can be the full
> explanation - and it turns out it isn't. As Amit Langote's followup
> email explained, there's the whole issue of the order of checks being
> inverted - which is clearly bad. And wholly unrelated to inlining.
>
> And I asked about __isinf() being used because there are issues with
> accidentally ending up with the non-intrinsic version of isinf() when
> not using gcc, due to badly written standard library headers.
>
>
> > The patch looks unduly invasive to me, but I think that it might be
> > right that we should go back to a macro-based implementation, because
> > otherwise we don't have a good way to be certain that the function
> > parameter won't get evaluated first.
>
> I'd first like to see some actual evidence of this being a problem,
> rather than just the order of the checks.
>
>
> > (Another reason to do so is so that the file/line numbers generated
> > for the error reports go back to being at least a little bit useful.)
> > We could use local variables within the macro to avoid double evals,
> > if anyone thinks that's actually important --- I don't.
>
> I don't think that's necessarily a good idea. In fact, I think we should
> probably do the exact opposite, and move the error messages further out
> of line. All these otherwise very small functions having their own
> ereports makes them much bigger. Our low code density, and the resulting
> rate of itlb misses, is pretty significant cost (cf [1]).
>
> master:
>textdata 

Re: Adding a test for speculative insert abort case

2020-02-06 Thread Andres Freund
Hi,

On 2019-08-21 13:59:00 -0700, Melanie Plageman wrote:
> >> > +step "controller_print_speculative_locks" { SELECT
> >> locktype,classid,objid,mode,granted FROM pg_locks WHERE
> >> locktype='speculative
> >> > +token' ORDER BY granted; }
> >>
> >> I think showing the speculative locks is possibly going to be unreliable
> >> - the release time of speculative locks is IIRC not that reliable. I
> >> think it could e.g. happen that speculative locks are held longer
> >> because autovacuum spawned an analyze in the background.
> >>
> >>
> > I actually think having the "controller_print_speculative_locks"
> > wouldn't be a problem because we have not released the advisory lock
> > on 4 in s2 that allows it to complete its speculative insertion and so
> > s1 will still be in speculative wait.

Hm. At the very least it'd have to be restricted to only match locks in
the same database - e.g. for parallel installcheck it is common for
there to be other concurrent tests.  I'll add that when committing, no
need for a new version.

I'm also a bit concerned that adding the pg_locks query would mean we can't
run the test in parallel with others, if we ever finally get around to
adding a parallel isolationtester schedule (which is really needed, it's
too slow as is).
https://postgr.es/m/20180124231006.z7spaz5gkzbdvob5@alvherre.pgsql
But I guess we we'll just deal with not running this test in parallel.


> > The step that might be a problem if autovacuum delays release of the
> > speculative locks is the "controller_show" step, because, at that
> > point, if the lock wasn't released, then s1 would still be waiting and
> > wouldn't have updated.
> >
> So, what should we do about this? Do you agree that the "controller_show"
> step would be a problem?

It hasn't caused failures so far, I think. Or are you saying you think
it's more likely to cause failures in the added tests?

Had planned to commit now, but I'm not able to think through the state
transitions at this hour, apparently :(. I'll try to do it tomorrow
morning.

Greetings,

Andres Freund




Re: WIP: expression evaluation improvements

2020-02-06 Thread Andres Freund
Hi,

On 2019-10-28 23:58:11 -0700, Soumyadeep Chakraborty wrote:
> > Sorry for not replying to that earlier.  I'm not quite sure it's
> > actually worthwhile doing so - did you try to measure any memory / cpu
> > savings?
> 
> No problem, thanks for the reply! Unfortunately, I did not do anything
> significant in terms of mem/cpu measurements. However, I have noticed
> non-trivial differences between optimized and unoptimized .bc files
> that were dumped from time to time.

Could you expand on what you mean here? Are you saying that you got
significantly better optimization results by doing function optimization
early on?  That'd be surprising imo?

Greetings,

Andres Freund




Re: WIP: expression evaluation improvements

2020-02-06 Thread Andres Freund
Hi,

On 2019-10-28 23:58:11 -0700, Soumyadeep Chakraborty wrote:
> > Cool! I'll probably merge that into my patch (with attribution of
> > course).
> >
> > I wonder if it'd nicer to not have separate C variables for all of
> > these, and instead look them up on-demand from the module loaded in
> > llvm_create_types(). Not sure.
> 
> Great! It is much nicer indeed. Attached version 2 with your suggested
> changes.
> (v2-0001-Rely-on-llvmjit_types-for-building-EvalFunc-calls.patch)
> Used the same testing method as above.

I've comitted a (somewhat evolved) version of this patch. I think it
really improves the code!

My changes largely were to get rid of the LLVMGetNamedFunction() added
to each opcode implementation, to also convert the ExecEval* functions
we were calling directly, to remove the other functions in llvmjit.h,
and finally to rebase it onto master, from the patch series in this
thread.

I do wonder about adding a variadic wrapper like the one introduced here
more widely, seems like it could simplify a number of places. If we then
redirected all function calls through a common wrapper, for LLVMBuildCall,
that also validated parameter count (and perhaps types), I think it'd be
easier to develop...

Thanks!

Andres




Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

2020-02-06 Thread Kasahara Tatsuhito
Hi,

On Fri, Feb 7, 2020 at 1:27 PM Kyotaro Horiguchi
 wrote:
> It seems that nkeys and key are useless. Since every table_beginscan_*
> functions have distinct parameter sets, don't we remove them from
> table_beginscan_tid?
Yeah, actually, when calling table_beginscan_tid(), nkeys is set to 0
and * key is set to NULL,
so these are not used at the moment.

I removed unnecessary arguments from table_beginscan_tid().

Attache the v5 patch.


-- 
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com


fix_tidscan_issues_v5.patch
Description: Binary data


Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

2020-02-06 Thread Kyotaro Horiguchi
At Fri, 7 Feb 2020 12:27:26 +0900, Kasahara Tatsuhito 
 wrote in 
> > IIUC setting SO_ALLOW_STRAT, SO_ALLOW_SYNC and SO_ALLOW_PAGEMODE has
> > no meaning during tid scan. I think v11 also should be the same.
> Thanks for your check, and useful advise.
> I was wondering if I should keep these flags, but I confirmed that I
> can remove these from TidScan's flags.
> 
> > Why did you remove SO_TYPE_TIDSCAN from the previous version patch? I
> > think it's better to add it and then we can set only SO_TYPE_TIDSCAN
> > to the scan option of tid scan.
> Because, currently SO_TYPE_TIDSCAN is not used anywhere.
> So I thought it might be better to avoid adding a new flag.
> However, as you said, this flag may be useful for the future tid scan
> feature (like [1])
> 
> Attach v4 patch. I re-added SO_TYPE_TIDSCAN to ScanOptions and
> table_beginscan_tid.
> And I remove unnecessary SO_ALLOW_* flags.

+table_beginscan_tid(Relation rel, Snapshot snapshot,
+   int nkeys, struct ScanKeyData *key)
+{
+   uint32  flags = SO_TYPE_TIDSCAN;
+
+   return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, 
flags);

It seems that nkeys and key are useless. Since every table_beginscan_*
functions have distinct parameter sets, don't we remove them from
table_beginscan_tid?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: typos in comments and user docs

2020-02-06 Thread Justin Pryzby
On Fri, Feb 07, 2020 at 09:26:04AM +0530, Amit Kapila wrote:
> On Fri, Feb 7, 2020 at 8:41 AM Justin Pryzby  wrote:
> >
> > On Fri, Feb 07, 2020 at 08:33:40AM +0530, Amit Kapila wrote:
> > > On Thu, Feb 6, 2020 at 7:26 PM Justin Pryzby  wrote:
> > > >
> > > > On Thu, Feb 06, 2020 at 04:43:18PM +0530, Amit Kapila wrote:
> > > > > On Thu, Feb 6, 2020 at 10:45 AM Michael Paquier  
> > > > > wrote:
> > > > > >
> > > > > > On Thu, Feb 06, 2020 at 08:47:14AM +0530, Amit Kapila wrote:
> > > > > > > Your changes look fine to me on the first read.  I will push this 
> > > > > > > to
> > > > > > > HEAD unless there are any objections.   If we want them in
> > > > > > > back-branches, we might want to probably segregate the changes 
> > > > > > > based
> > > > > > > on the branch until those apply.
> > > > > >
> > > > > > +1.  It would be nice to back-patch the user-visible changes in the
> > > > > > docs.
> > > > > >
> > > > >
> > > > > Fair enough, Justin, is it possible for you to segregate the changes
> > > > > that can be backpatched?
> > > >
> > > > Looks like the whole patch can be applied to master and v12 [0].
> > >
> 
> I tried your patch master and it failed to apply.
> (Stripping trailing CRs from patch; use --binary to disable.)
> patching file doc/src/sgml/bloom.sgml
> (Stripping trailing CRs from patch; use --binary to disable.)
> patching file doc/src/sgml/config.sgml
> Hunk #1 FAILED at 4318.
> 1 out of 1 hunk FAILED -- saving rejects to file doc/src/sgml/config.sgml.rej

I think you applied the first patch, which I corrected here.
https://www.postgresql.org/message-id/20200206135640.GG403%40telsasoft.com

Just rechecked it works for master and v12.

$ git checkout -b test2  origin/master
Branch test2 set up to track remote branch master from origin.
Switched to a new branch 'test2'
$ patch -p1 <0001-spelling-and-typos.patch
patching file doc/src/sgml/bloom.sgml
patching file doc/src/sgml/ref/alter_table.sgml
patching file doc/src/sgml/sources.sgml
patching file src/backend/access/transam/README.parallel
patching file src/backend/storage/buffer/bufmgr.c
patching file src/backend/storage/sync/sync.c
patching file src/include/access/tableam.h

$ patch -p1 <0001-spelling-and-typos.patch
patching file doc/src/sgml/bloom.sgml
patching file doc/src/sgml/ref/alter_table.sgml
Hunk #1 succeeded at 220 (offset -2 lines).
patching file doc/src/sgml/sources.sgml
patching file src/backend/access/transam/README.parallel
patching file src/backend/storage/buffer/bufmgr.c
Hunk #1 succeeded at 4268 (offset -23 lines).
patching file src/backend/storage/sync/sync.c
patching file src/include/access/tableam.h
Hunk #1 succeeded at 1167 (offset -18 lines).

The bloom patch there works for v11.
Attached now another version for v10-.
>From 43c6b9ec80d26a858387ac657043988b8b52e812 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 6 Feb 2020 22:14:20 -0600
Subject: [PATCH] Bloom patch for v10 and v9.6

---
 doc/src/sgml/bloom.sgml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/src/sgml/bloom.sgml b/doc/src/sgml/bloom.sgml
index 87d0ad7..84da235 100644
--- a/doc/src/sgml/bloom.sgml
+++ b/doc/src/sgml/bloom.sgml
@@ -65,7 +65,7 @@
  
   Number of bits generated for each index column. Each parameter's name
   refers to the number of the index column that it controls.  The default
-  is 2 bits and maximum is 4095.  Parameters for
+  is 2 bits and the maximum is 4095.  Parameters for
   index columns not actually used are ignored.
  
 
-- 
2.7.4



Re: typos in comments and user docs

2020-02-06 Thread Amit Kapila
On Fri, Feb 7, 2020 at 8:41 AM Justin Pryzby  wrote:
>
> On Fri, Feb 07, 2020 at 08:33:40AM +0530, Amit Kapila wrote:
> > On Thu, Feb 6, 2020 at 7:26 PM Justin Pryzby  wrote:
> > >
> > > On Thu, Feb 06, 2020 at 04:43:18PM +0530, Amit Kapila wrote:
> > > > On Thu, Feb 6, 2020 at 10:45 AM Michael Paquier  
> > > > wrote:
> > > > >
> > > > > On Thu, Feb 06, 2020 at 08:47:14AM +0530, Amit Kapila wrote:
> > > > > > Your changes look fine to me on the first read.  I will push this to
> > > > > > HEAD unless there are any objections.   If we want them in
> > > > > > back-branches, we might want to probably segregate the changes based
> > > > > > on the branch until those apply.
> > > > >
> > > > > +1.  It would be nice to back-patch the user-visible changes in the
> > > > > docs.
> > > > >
> > > >
> > > > Fair enough, Justin, is it possible for you to segregate the changes
> > > > that can be backpatched?
> > >
> > > Looks like the whole patch can be applied to master and v12 [0].
> >

I tried your patch master and it failed to apply.
(Stripping trailing CRs from patch; use --binary to disable.)
patching file doc/src/sgml/bloom.sgml
(Stripping trailing CRs from patch; use --binary to disable.)
patching file doc/src/sgml/config.sgml
Hunk #1 FAILED at 4318.
1 out of 1 hunk FAILED -- saving rejects to file doc/src/sgml/config.sgml.rej

> > If we decide to backpatch, then why not try to backpatch as far as
> > possible (till 9.5)?  If so, then it would be better to separate
> > changes which can be backpatched till 9.5, if that is tedious, then
> > maybe we can just back-patch (in 12) bloom.sgml change as a separate
> > commit and rest commit it in HEAD only.  What do you think?
>
> I don't think I was clear.  My original doc review patches were limited to
> this:
>
> On Sat, Mar 30, 2019 at 05:43:33PM -0500, Justin Pryzby wrote:
> > I reviewed docs like this:
> > git log -p remotes/origin/REL_11_STABLE..HEAD -- doc
>
>
> STABLE..REL_12_STABLE.  So after a few minutes earlier today of cherry-pick, I
> concluded that only bloom.sgml is applicable further back than v12.  Probably,
> I either noticed that minor issue at the same time as nearby doc changes in
> v12(?), or maybe noticed that issue later, independently of doc review, but
> then tacked it on to the previous commit, for lack of any better place.
>

I am still not 100% clear, it is better if you can prepare a separate
patch which can be backpatched and the rest that we can apply to HEAD.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: SyncRepGetSyncStandbysPriority() vs. SIGHUP

2020-02-06 Thread Kyotaro Horiguchi
At Wed, 5 Feb 2020 23:45:52 -0800, Noah Misch  wrote in 
> Buildfarm runs have triggered the assertion at the end of
> SyncRepGetSyncStandbysPriority():
..
> On my development system, this delay injection reproduces the failure:
> 
> --- a/src/backend/replication/syncrep.c
> +++ b/src/backend/replication/syncrep.c
> @@ -399,6 +399,8 @@ SyncRepInitConfig(void)
>  {
> int priority;
>  
> +   pg_usleep(100 * 1000);

Though I couldn't see that, but actually that can happen.

That happens if:

  all potentially-sync standbys have lower priority than the number of
  syncrep list members.

  the number of sync standbys is short.

  the number of the potentially-sync standbys is enough.

If all of the above are true, the while (priority <-= lowest_priority)
doesn't loops and goes into the assertion.

> SyncRepInitConfig() is the function responsible for updating, after SIGHUP,
> the sync_standby_priority values that SyncRepGetSyncStandbysPriority()
> consults.  The assertion holds if each walsender's sync_standby_priority (in
> shared memory) accounts for the latest synchronous_standby_names GUC value.
> That ceases to hold for brief moments after a SIGHUP that changes the
> synchronous_standby_names GUC value.

Agreed.

> I think the way to fix this is to nominate one process to update all
> sync_standby_priority values after SIGHUP.  That process should acquire
> SyncRepLock once per ProcessConfigFile(), not once per walsender.  If
> walsender startup occurs at roughly the same time as a SIGHUP, the new
> walsender should avoid computing sync_standby_priority based on a GUC value
> different from the one used for the older walsenders.

If we update the priority of all walsenders at once, the other
walsenders may calculate required LSN using the old configuration with
the new priority. I'm not sure the probability of happening this, but
that causes similar situation.

The priority calcuation happens for every standby repliy. So if
there're some standbys with wrong priority, They will catch up at
receiving the next standby reply. Thus just asuuming walsenders with
out-of-range priority as non-sync standbys can "fix" it, I believe.
pg_stat_get_wal_senders() reveals such inconsistent state but I don't
think it's not worth addressing.

> Would anyone like to fix this?  I could add it to my queue, but it would wait
> a year or more.

The attached does that. Isn't it enough?

# The more significant problem is I haven't suceeded to replay the
# problem..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center
>From ac080100b692700688054630d77190d63027798c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 7 Feb 2020 12:40:56 +0900
Subject: [PATCH] Fix handling of tentative state after chainging
 synchronous_standby_names

When reloading config after updating synchronous_standby_names,
walsenders don't recalculate their sync-priority exactly at the same
time. SyncRepGetSyncStandbysPriority may fail under such inconsistent
situation.  Fix that by assuming a walsender as non-sync if it has a
priority number that is impossible against the
synchronous_standby_names.
---
 src/backend/replication/syncrep.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index c284103b54..2689421bcd 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -842,6 +842,16 @@ SyncRepGetSyncStandbysPriority(bool *am_sync)
 		if (this_priority == 0)
 			continue;
 
+		/*
+		 * Walsenders don't reload configuration at the same time. A walsender
+		 * may have out-of-range priority before updating by the reload. Treat
+		 * them the same way as non-syncronous standbys.  Priority is going to
+		 * be fixed soon and we can work with consistent priorities at the next
+		 * standby-reply message.
+		 */
+		if (this_priority > lowest_priority)
+			continue;
+		
 		/* Must have a valid flush position */
 		if (XLogRecPtrIsInvalid(flush))
 			continue;
-- 
2.18.2



Re: Expose lock group leader pid in pg_stat_activity

2020-02-06 Thread Michael Paquier
On Thu, Feb 06, 2020 at 09:23:33AM +0100, Julien Rouhaud wrote:
> While on the topic, is there any reason why the backend stays a group leader
> for the rest of its lifetime, and should we change that?

Nothing happens without a reason.  a1c1af2 is the original commit, and
the thread is here:
https://www.postgresql.org/message-id/ca+tgmoapgkdy_z0w9mhqzcgso2t_t-4_v36dxakim+x_fyp...@mail.gmail.com

By looking at the surroundings, there are a couple of assumptions
behind the timing of the shutdown for the workers and the leader. 
I have not studied much the details on that, but my guess is that it
makes the handling of the leader shutting down before its workers
easier.  Robert or Amit likely know all the details here.

> Also, while reading ProcKill, I noticed a typo in a comment:
> 
> /*
>  * Detach from any lock group of which we are a member.  If the leader
> -* exist before all other group members, it's PGPROC will remain allocated
> +* exist before all other group members, its PGPROC will remain allocated
>  * until the last group process exits; that process must return the
>  * leader's PGPROC to the appropriate list.
>  */

Thanks, fixed.
--
Michael


signature.asc
Description: PGP signature


Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

2020-02-06 Thread Kasahara Tatsuhito
Hi,

On Thu, Feb 6, 2020 at 11:01 PM Masahiko Sawada
 wrote:
>
> On Thu, 6 Feb 2020 at 19:12, Kasahara Tatsuhito
>  wrote:
> I've tested predicate locking including promotion cases with v3 patch
> and it works fine.
>
> +table_beginscan_tid(Relation rel, Snapshot snapshot,
> +   int nkeys, struct ScanKeyData *key)
> +{
> +   uint32  flags = SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE;
>
> IIUC setting SO_ALLOW_STRAT, SO_ALLOW_SYNC and SO_ALLOW_PAGEMODE has
> no meaning during tid scan. I think v11 also should be the same.
Thanks for your check, and useful advise.
I was wondering if I should keep these flags, but I confirmed that I
can remove these from TidScan's flags.

> Why did you remove SO_TYPE_TIDSCAN from the previous version patch? I
> think it's better to add it and then we can set only SO_TYPE_TIDSCAN
> to the scan option of tid scan.
Because, currently SO_TYPE_TIDSCAN is not used anywhere.
So I thought it might be better to avoid adding a new flag.
However, as you said, this flag may be useful for the future tid scan
feature (like [1])

Attach v4 patch. I re-added SO_TYPE_TIDSCAN to ScanOptions and
table_beginscan_tid.
And I remove unnecessary SO_ALLOW_* flags.

Best regards,

[1]
https://www.postgresql.org/message-id/flat/CAKJS1f-%2BJJpm6B_NThUWzFv4007zAjObBXX1CBHE_bH9nOAvSw%40mail.gmail.com#1ae648acdc2df930b19218b6026135d3

-- 
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com


fix_tidscan_issues_v4.patch
Description: Binary data


[PATCH] Support built control file in PGXS VPATH builds

2020-02-06 Thread Craig Ringer
pgxs.mk assumes that if $(EXTENSION) is set, a file
$(EXTENSION).control must exist in the $(srcdir).

Extensions that need to support multiple Pg versions, multiple
variants of the extension, etc may need to template their extension
control file. PGXS's assumption prevents those extensions from
supporting read-only source trees for true vpath builds.

A workaround is to ignore the EXTENSION field in PGXS and leave it
unset, then set MODULEDIR to the value you would've set EXTENSION to
and install your control file with DATA_built . But that's more than a
tad ugly.

The attached patch fixes this by having PGXS resolve
$(EXTENSION).control along the VPATH.

Before:

/usr/bin/install: cannot stat
'/the/srcdir/path/the_extension.control': No such file or directory
make: *** [/the/postgres/path/lib/postgresql/pgxs/src/makefiles/pgxs.mk:229:
install] Error 1

After: no error, extension control file is found in builddir.

There's no reference to $(EXTENSION) outside pgxs.mk so this shouldn't
have any wider consequences.

The extension is responsible for the build rule for the control file,
like in DATA_built etc.

Please backpatch this build fix.

I could supply an alternative patch that follows PGXS's existing
convention of using a _built suffix, allowing the extension to specify
either EXTENSION or EXTENSION_built. For backward compat we'd have to
allow both to be set so long as they have the same value. Personally I
dislike this pattern and prefer to just resolve it in normal Make
fashion without caring if it's a built file or not, especially for the
EXTENSION var, so I'd prefer the first variant.

Frankly I'd rather we got rid of all the $(VAR) and $(VAR_built) stuff
entirely and just let make do proper vpath resolution. But I'm sure
it's that way for a reason...

I have a few other cleanup/fixup patches in the pipe for PGXS and
Makefile.global but I have to tidy them up a bit first. One to
eliminate undefined variables use, another to allow vpath directives
to be used instead of the big VPATH variable hammer. Keep an eye out.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise
From b187e31cb4e27892b8ca71e5e3feb3de389ee1e1 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Fri, 7 Feb 2020 11:05:24 +0800
Subject: [PATCH] PGXS support for built control files

Teach PGXS how to find the extension control file via the vpath,
rather than assuming that it's always in the $(srcdir).

This allows extensions with more complex needs to template their extension
files without breaking vpath support.

Extensions can work around this issue on older PostgreSQL releases by:

* Setting MODULE_DIR instead of EXTENSION
* Adding their extension control file to DATA_built
---
 src/makefiles/pgxs.mk | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/src/makefiles/pgxs.mk b/src/makefiles/pgxs.mk
index 037307c076..dcd3b3e48c 100644
--- a/src/makefiles/pgxs.mk
+++ b/src/makefiles/pgxs.mk
@@ -207,7 +207,6 @@ endef
 
 # end of HEADERS_* stuff
 
-
 all: $(PROGRAM) $(DATA_built) $(HEADER_allbuilt) $(SCRIPTS_built) $(addsuffix $(DLSUFFIX), $(MODULES)) $(addsuffix .control, $(EXTENSION))
 
 ifeq ($(with_llvm), yes)
@@ -223,11 +222,18 @@ include $(top_srcdir)/src/Makefile.shlib
 all: all-lib
 endif # MODULE_big
 
-
-install: all installdirs
+# Resolve the extension control file along the VPATH, so we can find both built
+# and static file control files.
+#
 ifneq (,$(EXTENSION))
-	$(INSTALL_DATA) $(addprefix $(srcdir)/, $(addsuffix .control, $(EXTENSION))) '$(DESTDIR)$(datadir)/extension/'
-endif # EXTENSION
+install_controlfile: $(addsuffix .control, $(EXTENSION)) | installdirs
+	$(INSTALL_DATA) $< '$(DESTDIR)$(datadir)/extension/'
+else
+install_controlfile: ;
+endif
+
+
+install: all installdirs install_controlfile
 ifneq (,$(DATA)$(DATA_built))
 	$(INSTALL_DATA) $(addprefix $(srcdir)/, $(DATA)) $(DATA_built) '$(DESTDIR)$(datadir)/$(datamoduledir)/'
 endif # DATA
-- 
2.24.1



Re: typos in comments and user docs

2020-02-06 Thread Justin Pryzby
On Fri, Feb 07, 2020 at 08:33:40AM +0530, Amit Kapila wrote:
> On Thu, Feb 6, 2020 at 7:26 PM Justin Pryzby  wrote:
> >
> > On Thu, Feb 06, 2020 at 04:43:18PM +0530, Amit Kapila wrote:
> > > On Thu, Feb 6, 2020 at 10:45 AM Michael Paquier  
> > > wrote:
> > > >
> > > > On Thu, Feb 06, 2020 at 08:47:14AM +0530, Amit Kapila wrote:
> > > > > Your changes look fine to me on the first read.  I will push this to
> > > > > HEAD unless there are any objections.   If we want them in
> > > > > back-branches, we might want to probably segregate the changes based
> > > > > on the branch until those apply.
> > > >
> > > > +1.  It would be nice to back-patch the user-visible changes in the
> > > > docs.
> > > >
> > >
> > > Fair enough, Justin, is it possible for you to segregate the changes
> > > that can be backpatched?
> >
> > Looks like the whole patch can be applied to master and v12 [0].
> 
> If we decide to backpatch, then why not try to backpatch as far as
> possible (till 9.5)?  If so, then it would be better to separate
> changes which can be backpatched till 9.5, if that is tedious, then
> maybe we can just back-patch (in 12) bloom.sgml change as a separate
> commit and rest commit it in HEAD only.  What do you think?

I don't think I was clear.  My original doc review patches were limited to
this:

On Sat, Mar 30, 2019 at 05:43:33PM -0500, Justin Pryzby wrote:
> I reviewed docs like this:
> git log -p remotes/origin/REL_11_STABLE..HEAD -- doc


STABLE..REL_12_STABLE.  So after a few minutes earlier today of cherry-pick, I
concluded that only bloom.sgml is applicable further back than v12.  Probably,
I either noticed that minor issue at the same time as nearby doc changes in
v12(?), or maybe noticed that issue later, independently of doc review, but
then tacked it on to the previous commit, for lack of any better place.

Justin




Re: typos in comments and user docs

2020-02-06 Thread Amit Kapila
On Thu, Feb 6, 2020 at 7:26 PM Justin Pryzby  wrote:
>
> On Thu, Feb 06, 2020 at 04:43:18PM +0530, Amit Kapila wrote:
> > On Thu, Feb 6, 2020 at 10:45 AM Michael Paquier  wrote:
> > >
> > > On Thu, Feb 06, 2020 at 08:47:14AM +0530, Amit Kapila wrote:
> > > > Your changes look fine to me on the first read.  I will push this to
> > > > HEAD unless there are any objections.   If we want them in
> > > > back-branches, we might want to probably segregate the changes based
> > > > on the branch until those apply.
> > >
> > > +1.  It would be nice to back-patch the user-visible changes in the
> > > docs.
> > >
> >
> > Fair enough, Justin, is it possible for you to segregate the changes
> > that can be backpatched?
>
> Looks like the whole patch can be applied to master and v12 [0].
>

If we decide to backpatch, then why not try to backpatch as far as
possible (till 9.5)?  If so, then it would be better to separate
changes which can be backpatched till 9.5, if that is tedious, then
maybe we can just back-patch (in 12) bloom.sgml change as a separate
commit and rest commit it in HEAD only.  What do you think?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Internal key management system

2020-02-06 Thread Andres Freund
Hi,

On 2020-02-07 11:18:29 +0900, Masahiko Sawada wrote:
> Another idea we discussed is to internally integrate pgcrypto with the
> key management system.

Perhaps this has already been discussed (I only briefly looked): I'd
strongly advise against having any new infrastrure depend on
pgcrypto. Its code quality imo is well below our standards and contains
serious red flags like very outdated copies of cryptography algorithm
implementations.  I think we should consider deprecating and removing
it, not expanding its use.  It certainly shouldn't be involved in any
potential disk encryption system at a later stage.

Greetings,

Andres Freund




Re: Internal key management system

2020-02-06 Thread Masahiko Sawada
On Fri, 7 Feb 2020 at 05:30, Robert Haas  wrote:
>
> On Mon, Feb 3, 2020 at 10:18 PM Masahiko Sawada
>  wrote:
> > > I'm lost. If pg_{en,de}crypt and pg_kmgr_unwrap are functions, what
> > > prevent users to:
> > >
> > >SELECT pg_kmgr_unwrap('');
> > >
> > > so as to recover the key, somehow in contradiction to "allows user to
> > > encrypt and decrypt data without knowing the actual key".
> >
> > I might be missing your point but the above '' is the wrapped key
> > wrapped by the master key stored in PostgreSQL server. So user doesn't
> > need to know the raw secret key to encrypt/decrypt the data. Even if a
> > malicious user gets '' they cannot know the actual secret key
> > without the master key. pg_kmgr_wrap and pg_kmgr_unwrap are functions
> > and it's possible for user to know the raw secret key by using
> > pg_kmgr_unwrap(). The master key stored in PostgreSQL server never be
> > revealed.
>
> I think I have the same confusion as Fabien. Isn't it bad if somebody
> just runs pg_kmgr_unwrap() and records the return value? Now they've
> stolen your encryption key, it seems.

This feature protects data from disk thefts but cannot protect data
from attackers who are able to access PostgreSQL server. In this
design application side still is responsible for managing the wrapped
secret in order to protect it from attackers. This is the same as when
we use pgcrypto now. The difference is that data is safe even if
attackers steal the wrapped secret and the disk. The data cannot be
decrypted either without the passphrase which can be stored to other
key management systems or without accessing postgres server. IOW for
example, attackers can get the data if they get the wrapped secret
managed by application side then run pg_kmgr_unwrap() to get the
secret and then steal the disk.

Another idea we discussed is to internally integrate pgcrypto with the
key management system. That is, the key management system has one
master key and provides a C function to pass the master key to other
postgres modules. pgcrypto uses that function and provides new
encryption and decryption functions something like
pg_encrypt_with_key() and pg_decrypt_with_key(). Which
encrypts/decrypts the given data by the master key stored in database
cluster. That way user still doesn't have to know the encryption key
and we can protect data from disk thefts. But the down side would be
that we have only one encryption key and that we might need to change
pgcrypto much.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Prevent pg_basebackup running as root

2020-02-06 Thread Andres Freund
Hi,

On 2020-02-01 18:34:33 +0900, Michael Paquier wrote:
> Seeing nothing, committed v2.

For reference: As a consequence of the discussion starting at
https://www.postgresql.org/message-id/20200205172259.GW3195%40tamriel.snowman.net
this patch has been reverted, at least for now.

Greetings,

Andres Freund




Re: table partitioning and access privileges

2020-02-06 Thread Amit Langote
On Fri, Feb 7, 2020 at 1:16 AM Fujii Masao  wrote:
> On 2020/02/03 14:26, Amit Langote wrote:
> > On Mon, Feb 3, 2020 at 2:07 PM Fujii Masao  
> > wrote:
> >> On 2020/02/03 11:05, Amit Langote wrote:
> >>> Okay.  How about the attached?
> >>
> >> Thanks for the patches! You added the note just after the description
> >> about row level security on inherited table, but isn't it better to
> >> add it before that? Attached patch does that. Thought?
> >
> > Yeah, that might be a better flow for that paragraph.
>
> Pushed! Thanks!

Thank you.

> >>> Maybe, we should also note the LOCK TABLE exception?
> >>
> >> Yes.
> >
> > Note that, unlike TRUNCATE, LOCK TABLE exception exists in HEAD too,
> > but maybe you're aware of that.
>
> Yes, so I will review your patch getting rid of
> LOCK TABLE exception.

Attached updated patch.

Regards,
Amit
From 77bf8ac02734b75f8a1e7c61bfc84e1b37189893 Mon Sep 17 00:00:00 2001
From: Amit Langote 
Date: Wed, 22 Jan 2020 15:43:41 +0900
Subject: [PATCH v2] Don't check child's LOCK privilege when locked recursively

---
 src/backend/commands/lockcmds.c  | 33 +++-
 src/test/regress/expected/lock.out   |  8 ++--
 src/test/regress/expected/privileges.out |  7 +++
 src/test/regress/sql/lock.sql|  7 ++-
 src/test/regress/sql/privileges.sql  |  6 ++
 5 files changed, 28 insertions(+), 33 deletions(-)

diff --git a/src/backend/commands/lockcmds.c b/src/backend/commands/lockcmds.c
index 329ab849c0..d8cafc42bb 100644
--- a/src/backend/commands/lockcmds.c
+++ b/src/backend/commands/lockcmds.c
@@ -28,7 +28,7 @@
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
-static void LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool nowait, Oid 
userid);
+static void LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool nowait);
 static AclResult LockTableAclCheck(Oid relid, LOCKMODE lockmode, Oid userid);
 static void RangeVarCallbackForLockTable(const RangeVar *rv, Oid relid,

 Oid oldrelid, void *arg);
@@ -59,7 +59,7 @@ LockTableCommand(LockStmt *lockstmt)
if (get_rel_relkind(reloid) == RELKIND_VIEW)
LockViewRecurse(reloid, lockstmt->mode, 
lockstmt->nowait, NIL);
else if (recurse)
-   LockTableRecurse(reloid, lockstmt->mode, 
lockstmt->nowait, GetUserId());
+   LockTableRecurse(reloid, lockstmt->mode, 
lockstmt->nowait);
}
 }
 
@@ -108,35 +108,26 @@ RangeVarCallbackForLockTable(const RangeVar *rv, Oid 
relid, Oid oldrelid,
 /*
  * Apply LOCK TABLE recursively over an inheritance tree
  *
- * We use find_inheritance_children not find_all_inheritors to avoid taking
- * locks far in advance of checking privileges.  This means we'll visit
- * multiply-inheriting children more than once, but that's no problem.
+ * This doesn't check permission to perform LOCK TABLE on the child tables,
+ * because getting here means that the user has permission to lock the
+ * parent which is enough.
  */
 static void
-LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool nowait, Oid userid)
+LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool nowait)
 {
List   *children;
ListCell   *lc;
 
-   children = find_inheritance_children(reloid, NoLock);
+   children = find_all_inheritors(reloid, NoLock, NULL);
 
foreach(lc, children)
{
Oid childreloid = lfirst_oid(lc);
-   AclResult   aclresult;
 
-   /* Check permissions before acquiring the lock. */
-   aclresult = LockTableAclCheck(childreloid, lockmode, userid);
-   if (aclresult != ACLCHECK_OK)
-   {
-   char   *relname = get_rel_name(childreloid);
-
-   if (!relname)
-   continue;   /* child concurrently 
dropped, just skip it */
-   aclcheck_error(aclresult, 
get_relkind_objtype(get_rel_relkind(childreloid)), relname);
-   }
+   /* Parent already locked. */
+   if (childreloid == reloid)
+   continue;
 
-   /* We have enough rights to lock the relation; do so. */
if (!nowait)
LockRelationOid(childreloid, lockmode);
else if (!ConditionalLockRelationOid(childreloid, lockmode))
@@ -162,8 +153,6 @@ LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool 
nowait, Oid userid)
UnlockRelationOid(childreloid, lockmode);
continue;
}
-
-   LockTableRecurse(childreloid, lockmode, nowait, userid);
}
 }
 
@@ -241,7 +230,7 @@ LockViewRecurse_walker(Node *node, LockViewRecurse_context 
*context)
if (relkind == RELKIND_VIEW)
   

Re: Getting rid of some more lseek() calls

2020-02-06 Thread Andres Freund
Hi,

On 2020-02-07 12:38:27 +1300, Thomas Munro wrote:
> Continuing the work done in commits 0dc8ead4 and c24dcd0c, here are a
> few more places where we could throw away some code by switching to
> pg_pread() and pg_pwrite().

Nice.



> From 5723976510f30385385628758d7118042c4e4bf6 Mon Sep 17 00:00:00 2001
> From: Thomas Munro 
> Date: Fri, 7 Feb 2020 12:04:43 +1300
> Subject: [PATCH 1/3] Use pg_pread() and pg_pwrite() in slru.c.
> 
> This avoids lseek() system calls at every SLRU I/O, as was
> done for relation files in commit c24dcd0c.
> ---
>  src/backend/access/transam/slru.c | 25 -
>  1 file changed, 4 insertions(+), 21 deletions(-)
> 
> diff --git a/src/backend/access/transam/slru.c 
> b/src/backend/access/transam/slru.c
> index d5b7a08f73..f9efb22311 100644
> --- a/src/backend/access/transam/slru.c
> +++ b/src/backend/access/transam/slru.c
> @@ -646,7 +646,7 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
>   SlruShared  shared = ctl->shared;
>   int segno = pageno / SLRU_PAGES_PER_SEGMENT;
>   int rpageno = pageno % SLRU_PAGES_PER_SEGMENT;
> - int offset = rpageno * BLCKSZ;
> + off_t   offset = rpageno * BLCKSZ;
>   charpath[MAXPGPATH];
>   int fd;
>  
> @@ -676,17 +676,9 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
>   return true;
>   }
>  
> - if (lseek(fd, (off_t) offset, SEEK_SET) < 0)
> - {
> - slru_errcause = SLRU_SEEK_FAILED;
> - slru_errno = errno;
> - CloseTransientFile(fd);
> - return false;
> - }
> -
>   errno = 0;
>   pgstat_report_wait_start(WAIT_EVENT_SLRU_READ);
> - if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
> + if (pg_pread(fd, shared->page_buffer[slotno], BLCKSZ, offset) != BLCKSZ)
>   {
>   pgstat_report_wait_end();
>   slru_errcause = SLRU_READ_FAILED;
> @@ -726,7 +718,7 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int 
> slotno, SlruFlush fdata)
>   SlruShared  shared = ctl->shared;
>   int segno = pageno / SLRU_PAGES_PER_SEGMENT;
>   int rpageno = pageno % SLRU_PAGES_PER_SEGMENT;
> - int offset = rpageno * BLCKSZ;
> + off_t   offset = rpageno * BLCKSZ;
>   charpath[MAXPGPATH];
>   int fd = -1;
>  
> @@ -836,18 +828,9 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int 
> slotno, SlruFlush fdata)
>   }
>   }
>  
> - if (lseek(fd, (off_t) offset, SEEK_SET) < 0)
> - {
> - slru_errcause = SLRU_SEEK_FAILED;
> - slru_errno = errno;
> - if (!fdata)
> - CloseTransientFile(fd);
> - return false;
> - }
> -
>   errno = 0;
>   pgstat_report_wait_start(WAIT_EVENT_SLRU_WRITE);
> - if (write(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
> + if (pg_pwrite(fd, shared->page_buffer[slotno], BLCKSZ, offset) != 
> BLCKSZ)
>   {
>   pgstat_report_wait_end();
>   /* if write didn't set errno, assume problem is no disk space */
> -- 
> 2.23.0

Hm. This still leaves us with one source of SLRU_SEEK_FAILED. And that's
really just for getting the file size. Should we rename that?

Perhaps we should just replace lseek(SEEK_END) with fstat()? Or at least
one wrapper function for getting the size? It seems ugly to change fd
positions just for the purpose of getting file sizes (and also implies
more kernel level locking, I believe).


> From 95d7187172f2ac6c08dc92f1043e1662b0dab4ac Mon Sep 17 00:00:00 2001
> From: Thomas Munro 
> Date: Fri, 7 Feb 2020 12:04:57 +1300
> Subject: [PATCH 2/3] Use pg_pwrite() in rewriteheap.c.
> 
> This removes an lseek() call.
> ---
>  src/backend/access/heap/rewriteheap.c | 9 +
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/src/backend/access/heap/rewriteheap.c 
> b/src/backend/access/heap/rewriteheap.c
> index 5869922ff8..9c29bc0e0f 100644
> --- a/src/backend/access/heap/rewriteheap.c
> +++ b/src/backend/access/heap/rewriteheap.c
> @@ -1156,13 +1156,6 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
>   path, (uint32) xlrec->offset)));
>   pgstat_report_wait_end();
>  
> - /* now seek to the position we want to write our data to */
> - if (lseek(fd, xlrec->offset, SEEK_SET) != xlrec->offset)
> - ereport(ERROR,
> - (errcode_for_file_access(),
> -  errmsg("could not seek to end of file \"%s\": 
> %m",
> - path)));
> -
>   data = XLogRecGetData(r) + sizeof(*xlrec);
>  
>   len = xlrec->num_mappings * sizeof(LogicalRewriteMappingData);
> @@ -1170,7 +1163,7 @@ 

Re: Improve errors when setting incorrect bounds for SSL protocols

2020-02-06 Thread Michael Paquier
On Thu, Feb 06, 2020 at 11:30:40PM +0100, Daniel Gustafsson wrote:
> Or change to the v1 patch in this thread, which avoids the problem by doing it
> in the OpenSSL code.  It's a shame to have generic TLS functionality be 
> OpenSSL
> specific when everything else TLS has been abstracted, but not working is
> clearly a worse option.

The v1 would work just fine considering that, as the code would be
invoked in a context where all GUCs are already loaded.  That's too
late before the release though, so I have reverted 41aadee, and
attached is a new patch to consider with improvements compared to v1
mainly in the error messages.
--
Michael
diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 18d3fff068..5b772fd58c 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -68,8 +68,7 @@ static bool SSL_initialized = false;
 static bool dummy_ssl_passwd_cb_called = false;
 static bool ssl_is_server_start;
 
-static int	ssl_protocol_version_to_openssl(int v, const char *guc_name,
-			int loglevel);
+static int	ssl_protocol_version_to_openssl(int v);
 
 /*  */
 /*		 Public interface		*/
@@ -80,6 +79,8 @@ be_tls_init(bool isServerStart)
 {
 	STACK_OF(X509_NAME) *root_cert_list = NULL;
 	SSL_CTX*context;
+	int			ssl_ver_min = -1;
+	int			ssl_ver_max = -1;
 
 	/* This stuff need be done only once. */
 	if (!SSL_initialized)
@@ -188,13 +189,19 @@ be_tls_init(bool isServerStart)
 
 	if (ssl_min_protocol_version)
 	{
-		int			ssl_ver = ssl_protocol_version_to_openssl(ssl_min_protocol_version,
-			  "ssl_min_protocol_version",
-			  isServerStart ? FATAL : LOG);
+		ssl_ver_min = ssl_protocol_version_to_openssl(ssl_min_protocol_version);
 
-		if (ssl_ver == -1)
+		if (ssl_ver_min == -1)
+		{
+			ereport(isServerStart ? FATAL : LOG,
+	(errmsg("%s setting %s not supported by this build",
+			"ssl_min_protocol_version",
+			GetConfigOption("ssl_min_protocol_version",
+			false, false;
 			goto error;
-		if (!SSL_CTX_set_min_proto_version(context, ssl_ver))
+		}
+
+		if (!SSL_CTX_set_min_proto_version(context, ssl_ver_min))
 		{
 			ereport(isServerStart ? FATAL : LOG,
 	(errmsg("could not set minimum SSL protocol version")));
@@ -204,13 +211,19 @@ be_tls_init(bool isServerStart)
 
 	if (ssl_max_protocol_version)
 	{
-		int			ssl_ver = ssl_protocol_version_to_openssl(ssl_max_protocol_version,
-			  "ssl_max_protocol_version",
-			  isServerStart ? FATAL : LOG);
+		ssl_ver_max = ssl_protocol_version_to_openssl(ssl_max_protocol_version);
 
-		if (ssl_ver == -1)
+		if (ssl_ver_max == -1)
+		{
+			ereport(isServerStart ? FATAL : LOG,
+	(errmsg("%s setting %s not supported by this build",
+			"ssl_max_protocol_version",
+			GetConfigOption("ssl_max_protocol_version",
+			false, false;
 			goto error;
-		if (!SSL_CTX_set_max_proto_version(context, ssl_ver))
+		}
+
+		if (!SSL_CTX_set_max_proto_version(context, ssl_ver_max))
 		{
 			ereport(isServerStart ? FATAL : LOG,
 	(errmsg("could not set maximum SSL protocol version")));
@@ -218,6 +231,23 @@ be_tls_init(bool isServerStart)
 		}
 	}
 
+	/* Check compatibility of min/max protocols */
+	if (ssl_min_protocol_version &&
+		ssl_max_protocol_version)
+	{
+		/*
+		 * No need to check for invalid values (-1) for each protocol number
+		 * as the code above would have already generated an error.
+		 */
+		if (ssl_ver_min > ssl_ver_max)
+			ereport(isServerStart ? FATAL : LOG,
+	(errmsg("could not set SSL protocol version range"),
+	 errdetail("\"%s\" cannot be higher than \"%s\"",
+			   "ssl_min_protocol_version",
+			   "ssl_max_protocol_version")));
+		goto error;
+	}
+
 	/* disallow SSL session tickets */
 	SSL_CTX_set_options(context, SSL_OP_NO_TICKET);
 
@@ -1271,15 +1301,14 @@ X509_NAME_to_cstring(X509_NAME *name)
  * guc.c independent of OpenSSL availability and version.
  *
  * If a version is passed that is not supported by the current OpenSSL
- * version, then we log with the given loglevel and return (if we return) -1.
- * If a nonnegative value is returned, subsequent code can assume it's working
- * with a supported version.
+ * version, then we return -1.  If a nonnegative value is returned,
+ * subsequent code can assume it's working with a supported version.
  *
  * Note: this is rather similar to libpq's routine in fe-secure-openssl.c,
  * so make sure to update both routines if changing this one.
  */
 static int
-ssl_protocol_version_to_openssl(int v, const char *guc_name, int loglevel)
+ssl_protocol_version_to_openssl(int v)
 {
 	switch (v)
 	{
@@ -1307,9 +1336,5 @@ ssl_protocol_version_to_openssl(int v, const char *guc_name, int loglevel)
 #endif
 	}
 
-	ereport(loglevel,
-			(errmsg("%s setting %s not supported by this build",
-	guc_name,
-	GetConfigOption(guc_name, false, 

Re: typedef SegmentNumber

2020-02-06 Thread Thomas Munro
On Fri, Feb 7, 2020 at 1:00 PM Thomas Munro  wrote:
> [1] https://www.postgresql.org/message-id/flat/

That should be:

[1] 
https://www.postgresql.org/message-id/flat/20190429130321.GA14886%40alvherre.pgsql#7e4ed274b6552d6c5e18a069579321c9




typedef SegmentNumber

2020-02-06 Thread Thomas Munro
Hello,

Here's a rebase of a refactoring patch that got lost behind a filing
cabinet on another thread even though there seemed to be some
agreement that we probably want something like this[1].  It introduces
a new type SegmentNumber, instead of using BlockNumber to represent
segment numbers.

[1] https://www.postgresql.org/message-id/flat/


0001-Introduce-SegmentNumber-typedef-for-relation-segment.patch
Description: Binary data


Getting rid of some more lseek() calls

2020-02-06 Thread Thomas Munro
Hello,

Continuing the work done in commits 0dc8ead4 and c24dcd0c, here are a
few more places where we could throw away some code by switching to
pg_pread() and pg_pwrite().


0001-Use-pg_pread-and-pg_pwrite-in-slru.c.patch
Description: Binary data


0002-Use-pg_pwrite-in-rewriteheap.c.patch
Description: Binary data


0003-Use-pg_pwrite-in-walreceiver.c.patch
Description: Binary data


Re: Improve errors when setting incorrect bounds for SSL protocols

2020-02-06 Thread Daniel Gustafsson
> On 6 Feb 2020, at 20:04, Tom Lane  wrote:

> I think this should be reverted.  Perhaps there's a way to do it without
> these problems, but we failed to find one in the past.

Or change to the v1 patch in this thread, which avoids the problem by doing it
in the OpenSSL code.  It's a shame to have generic TLS functionality be OpenSSL
specific when everything else TLS has been abstracted, but not working is
clearly a worse option.

cheers ./daniel



Re: Memory-Bounded Hash Aggregation

2020-02-06 Thread Peter Geoghegan
On Thu, Feb 6, 2020 at 12:01 AM Heikki Linnakangas  wrote:
> It wasn't strictly required for what I was hacking on then. IIRC it
> would have saved some memory during sorting, but Peter G felt that it
> wasn't worth the trouble, because he made some other changes around the
> same time, which made it less important

FWIW, I am not opposed to the patch at all. I would be quite happy to
get rid of a bunch of code in tuplesort.c that apparently isn't really
necessary anymore (by removing polyphase merge).

All I meant back in 2016 was that "pausing" tapes was orthogonal to my
own idea of capping the number of tapes that could be used by
tuplesort.c. The 500 MAXORDER cap thing hadn't been committed yet when
I explained this in the message you linked to, and it wasn't clear if
it would ever be committed (Robert committed it about a month
afterwards, as it turned out). Capping the size of the merge heap made
marginal sorts faster overall, since a more cache efficient merge heap
more than made up for having more than one merge pass overall (thanks
to numerous optimizations added in 2016, some of which were your
work).

I also said that the absolute overhead of tapes was not that important
back in 2016. Using many tapes within tuplesort.c can never happen
anyway (with the 500 MAXORDER cap). Maybe the use of logtape.c by hash
aggregate changes the picture there now. Even if it doesn't, I still
think that your patch is a good idea.

-- 
Peter Geoghegan




Re: Internal key management system

2020-02-06 Thread Cary Huang
Since the user does not need to know the master secret key used to cipher the 
data, I don't think we should expose "pg_kmgr_unwrap("")" SQL function to 
the user at all.

The wrapped key "" is stored in control data and it is possible to obtain 
by malicious user and steal the key by running SELECT pg_kmgr_unwrap(""). 

Even the user is righteous, it may not be very straightforward for that user to 
obtain the wrapped key "" to use in the unwrap function.



pg_kmgr_(un)wrap function is in discussion because encrypt and decrypt function 
require the master secret key as input argument. 

I would suggest using cluster passphrase as input instead of master key, so the 
user does not have to obtain the master key using pg_kmgr_unwrap("") in 
order to use the encrypt and decrypt function. 

The passphrase is in fact not stored anywhere in the system and we have to be 
careful that this passphrase is not shown in any activity log



so instead of:

--


INSERT INTO tbl VALUES (pg_encrypt('user data', pg_kmgr_unwrap('x'));

SELECT pg_decrypt(secret_column, pg_kmgr_unwrap('x')) FROM tbl;



it would become:

--

INSERT INTO tbl VALUES (pg_encrypt('user data', 'cluster_pass_phrase');

SELECT pg_decrypt(secret_column, 'cluster_pass_phrase') FROM tbl;



pg_decrypt will then have to:



1. derive the cluster pass phrase into KEK and HMAC key 

2. verify pass phrase by comparing MAC

3. unwrap the key - Sehrope suggests a good approach to make wrap/unwrap 
function more secure by adding MAC verification and randomed IV instead of 
default. I think it is good

4. decrypt the data

5. return



Using passphrase instead of master key to encrypt and decrypt function will 
also make front end tool integration simpler, as the front end tool also do not 
need to know the master key so it does not need to derive KEK or unwrap the 
key...etc. 

Not sure if you guys agree?



Thanks!



Cary Huang

-

HighGo Software Inc. (Canada)

mailto:cary.hu...@highgo.ca

http://www.highgo.ca






 On Thu, 06 Feb 2020 12:30:02 -0800 Robert Haas  
wrote 



On Mon, Feb 3, 2020 at 10:18 PM Masahiko Sawada 
 wrote: 
> > I'm lost. If pg_{en,de}crypt and pg_kmgr_unwrap are functions, what 
> > prevent users to: 
> > 
> >SELECT pg_kmgr_unwrap(''); 
> > 
> > so as to recover the key, somehow in contradiction to "allows user to 
> > encrypt and decrypt data without knowing the actual key". 
> 
> I might be missing your point but the above '' is the wrapped key 
> wrapped by the master key stored in PostgreSQL server. So user doesn't 
> need to know the raw secret key to encrypt/decrypt the data. Even if a 
> malicious user gets '' they cannot know the actual secret key 
> without the master key. pg_kmgr_wrap and pg_kmgr_unwrap are functions 
> and it's possible for user to know the raw secret key by using 
> pg_kmgr_unwrap(). The master key stored in PostgreSQL server never be 
> revealed. 
 
I think I have the same confusion as Fabien. Isn't it bad if somebody 
just runs pg_kmgr_unwrap() and records the return value? Now they've 
stolen your encryption key, it seems. 
 
-- 
Robert Haas 
EnterpriseDB: http://www.enterprisedb.com 
The Enterprise PostgreSQL Company

Re: relcache sometimes initially ignores has_generated_stored

2020-02-06 Thread Andres Freund
On 2020-02-06 21:29:58 +0100, Peter Eisentraut wrote:
> On 2020-01-15 19:11, Andres Freund wrote:
> > /*
> >  * Set up constraint/default info
> >  */
> > if (has_not_null || ndef > 0 ||
> > attrmiss || relation->rd_rel->relchecks)
> > test, i.e. there are no defaults.
> 
> > It does still strike me as not great that we can get a different
> > relcache entry, even if transient, depending on whether there are other
> > reasons to create a TupleConstr. Say a NOT NULL column.
> > 
> > I'm inclined to think we should just also check has_generated_stored in
> > the if quoted above?
> 
> Fixed that way.

Thanks.




Re: Internal key management system

2020-02-06 Thread Robert Haas
On Mon, Feb 3, 2020 at 10:18 PM Masahiko Sawada
 wrote:
> > I'm lost. If pg_{en,de}crypt and pg_kmgr_unwrap are functions, what
> > prevent users to:
> >
> >SELECT pg_kmgr_unwrap('');
> >
> > so as to recover the key, somehow in contradiction to "allows user to
> > encrypt and decrypt data without knowing the actual key".
>
> I might be missing your point but the above '' is the wrapped key
> wrapped by the master key stored in PostgreSQL server. So user doesn't
> need to know the raw secret key to encrypt/decrypt the data. Even if a
> malicious user gets '' they cannot know the actual secret key
> without the master key. pg_kmgr_wrap and pg_kmgr_unwrap are functions
> and it's possible for user to know the raw secret key by using
> pg_kmgr_unwrap(). The master key stored in PostgreSQL server never be
> revealed.

I think I have the same confusion as Fabien. Isn't it bad if somebody
just runs pg_kmgr_unwrap() and records the return value? Now they've
stolen your encryption key, it seems.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: relcache sometimes initially ignores has_generated_stored

2020-02-06 Thread Peter Eisentraut

On 2020-01-15 19:11, Andres Freund wrote:

/*
 * Set up constraint/default info
 */
if (has_not_null || ndef > 0 ||
attrmiss || relation->rd_rel->relchecks)
test, i.e. there are no defaults.



It does still strike me as not great that we can get a different
relcache entry, even if transient, depending on whether there are other
reasons to create a TupleConstr. Say a NOT NULL column.

I'm inclined to think we should just also check has_generated_stored in
the if quoted above?


Fixed that way.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-02-06 Thread Robert Haas
On Wed, Feb 5, 2020 at 9:32 AM Julien Rouhaud  wrote:
> There's also the possibility to reserve 1 bit of the hash to know if
> this is a utility command or not, although I don't recall right now
> all the possible issues with utility commands and some special
> handling of them.  I'll work on it before the next commitfest.

FWIW, I don't really see why it would be bad to have 0 mean that
"there's no query ID for some reason" without caring whether that's
because the current statement is a utility statement or because
there's no statement in progress at all or whatever else. The user
probably doesn't need our help to distinguish between "no statement"
and "utility statement", right?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Dumping/restoring fails on inherited generated column

2020-02-06 Thread Peter Eisentraut

On 2020-02-03 20:32, Tom Lane wrote:
> Things are evidently also going wrong for "gtest1_1".  In that case
> the generated property is inherited from the parent gtest1, so we
> shouldn't be emitting anything ... how come the patch fails to
> make it do that?

This is fixed by the attached new patch.  It needed an additional check 
in flagInhAttrs().



This is showing us at least two distinct problems.  Now as for
"gtest30_1", what we have is that in the parent table "gtest30", column b
exists but it has no default; the generated property is only added
at the child table gtest30_1.  So we need to emit ALTER COLUMN SET
GENERATED ALWAYS for gtest30_1.b.  HEAD is already doing the wrong
thing there (it's emitting the expression, but as a plain default
not GENERATED).  And this patch makes it emit nothing, even worse.
I think the key point here is that "attislocal" refers to whether the
column itself is locally defined, not to whether its default is.


This is a bit of a mess.  Let me explain my thinking on generated 
columns versus inheritance.


If a parent table has a generated column, then any inherited column must 
also be generated and use the same expression.  (Otherwise querying the 
parent table would produce results that are inconsistent with the 
generation expression if the rows come from the child table.)


If a parent table has a column that is not generated, then I think it 
would be semantically sound if a child table had that same column but 
generated.  However, I think it would be very tricky to support this 
correctly, and it doesn't seem useful enough, so I'd rather not do it.


That's what the gtest30_1 case above shows.  It's not even clear whether 
it's possible to dump this correctly in all cases because the syntax 
that you allude to "turn this existing column into a generated column" 
does not exist.


Note that the gtest30 test case is new in master.  I'm a bit confused 
why things were done that way, and I'll need to revisit this.  I've also 
found a few more issues with how certain combinations of DDL can create 
similar situations that arguably don't make sense, and I'll continue to 
look into those.  Basically, my contention is that gtest30_1 should not 
be allowed to exist like that.


However, the pg_dump issue is separate from those because it affects a 
case that is clearly legitimate.  So assuming that we end up agreeing on 
a version of the attached pg_dump patch, I would like to get that into 
the next minor releases and then investigate the other issues separately.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 76249c9a64caaaf4c3206d5ca4ff72cc186dc416 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 6 Feb 2020 14:50:01 +0100
Subject: [PATCH v3] pg_dump: Fix dumping of inherited generated columns

Generation expressions of generated columns are always inherited, so
there is no need to set them separately in child tables, and there is
no syntax to do so either.  The code previously used the code paths
for the handling of default values, for which different rules apply;
in particular it might want to set a default value explicitly for an
inherited column.  For generated columns, just skip all that.

Discussion: 
https://www.postgresql.org/message-id/flat/15830.1575468847%40sss.pgh.pa.us
---
 src/bin/pg_dump/common.c  | 8 
 src/bin/pg_dump/pg_dump.c | 9 +
 2 files changed, 17 insertions(+)

diff --git a/src/bin/pg_dump/common.c b/src/bin/pg_dump/common.c
index 3549f7bc08..170c87823d 100644
--- a/src/bin/pg_dump/common.c
+++ b/src/bin/pg_dump/common.c
@@ -476,6 +476,14 @@ flagInhAttrs(DumpOptions *dopt, TableInfo *tblinfo, int 
numTables)
if (tbinfo->attisdropped[j])
continue;
 
+   /*
+* Skip generated columns; it is not possible for an 
inherited
+* column to have a different generation expression 
that the
+* parent.
+*/
+   if (tbinfo->attgenerated[j])
+   continue;
+
foundNotNull = false;
foundDefault = false;
for (k = 0; k < numParents; k++)
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 33e58fa287..5fcb89093e 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -8492,6 +8492,15 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int 
numTables)
if (tbinfo->attisdropped[adnum - 1])
continue;
 
+   /*
+* Ignore generated columns on child tables 
unless they have a
+* local definition.  Generation expressions 
are always
+* 

Re: Improve errors when setting incorrect bounds for SSL protocols

2020-02-06 Thread Tom Lane
Michael Paquier  writes:
> On Thu, Jan 16, 2020 at 10:00:52AM +0900, Michael Paquier wrote:
>> Thanks for the review.  Let's wait a couple of days to see if others
>> have objections or more comments about this patch, but I'd like to
>> fix the issue and backpatch down to 12 where the parameters have been
>> introduced.

> And committed.

I just happened to look at this patch while working on the release notes.
I think this is a bad idea and very probably creates worse problems than
it fixes.  As we have learned painfully in the past, you can't have GUC
check or assign hooks that look at other GUC variables, because that
creates order-of-operations problems.  If a postgresql.conf update is
trying to change both values (hardly an unlikely scenario, for this
pair of variables) then the checks are going to be comparing against the
old values of the other variables, leading to either incorrect rejections
of valid states or incorrect acceptances of invalid states.  It's pure
accident that the particular cases tested in the regression tests behave
sanely.

I think this should be reverted.  Perhaps there's a way to do it without
these problems, but we failed to find one in the past.

regards, tom lane




Re: Make ringbuffer threshold and ringbuffer sizes configurable?

2020-02-06 Thread Robert Haas
On Thu, Feb 6, 2020 at 1:52 PM Andres Freund  wrote:
> I admit it's awkward. I think we possibly could still just make the size
> displayed in bytes in either case, reducing that issue a *bit*?

That seems like it makes it even more confusing, honestly.

> > It'd sort of be nicer to have two separate GUCs,
> > one measured as a multiple and the other measured in bytes, but maybe
> > that's just exchanging one form of confusion for another.
>
> We don't really have a good way to deal with GUCs where setting one
> precludes the other, especially when those GUCs should be changable at
> runtime :(.

It can work if one of the GUCs is king, and the other one takes effect
only the first one is set to some value that means "ignore me". We
have a number of examples of that, e.g. autovacuum_work_mem,
autovacuum_vacuum_cost_limit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Make ringbuffer threshold and ringbuffer sizes configurable?

2020-02-06 Thread Andres Freund
Hi,

On 2020-02-06 13:15:04 -0500, Robert Haas wrote:
> On Wed, Feb 5, 2020 at 11:00 PM Andres Freund  wrote:
> > I.e. to maintain the current defaults, seqscan_ringbuffer_threshold
> > would be -4.0, but could be also be set to an absolute 4GB (converted to
> > pages). Probably would want a GUC show function that displays
> > proportional values in a nice way.
> 
> I think this is kind of awkward given that our GUC system attributes
> units to everything.

I admit it's awkward. I think we possibly could still just make the size
displayed in bytes in either case, reducing that issue a *bit*?


> It'd sort of be nicer to have two separate GUCs,
> one measured as a multiple and the other measured in bytes, but maybe
> that's just exchanging one form of confusion for another.

We don't really have a good way to deal with GUCs where setting one
precludes the other, especially when those GUCs should be changable at
runtime :(.

Greetings,

Andres Freund




Re: In PG12, query with float calculations is slower than PG11

2020-02-06 Thread Andres Freund
Hi,

On 2020-02-06 11:03:51 -0500, Tom Lane wrote:
> Andres seems to be of the opinion that the compiler should be willing
> to ignore the semantic requirements of the C standard in order
> to rearrange the code back into the cheaper order.  That sounds like
> wishful thinking to me ... even if it actually works on his compiler,
> it certainly isn't going to work for everyone.

Sorry, but, uh, what are you talking about?  Please tell me which single
standards violation I'm advocating for?

I was asking about the inlining bit because the first email of the topic
explained that as the problem, which I don't believe can be the full
explanation - and it turns out it isn't. As Amit Langote's followup
email explained, there's the whole issue of the order of checks being
inverted - which is clearly bad. And wholly unrelated to inlining.

And I asked about __isinf() being used because there are issues with
accidentally ending up with the non-intrinsic version of isinf() when
not using gcc, due to badly written standard library headers.


> The patch looks unduly invasive to me, but I think that it might be
> right that we should go back to a macro-based implementation, because
> otherwise we don't have a good way to be certain that the function
> parameter won't get evaluated first.

I'd first like to see some actual evidence of this being a problem,
rather than just the order of the checks.


> (Another reason to do so is so that the file/line numbers generated
> for the error reports go back to being at least a little bit useful.)
> We could use local variables within the macro to avoid double evals,
> if anyone thinks that's actually important --- I don't.

I don't think that's necessarily a good idea. In fact, I think we should
probably do the exact opposite, and move the error messages further out
of line. All these otherwise very small functions having their own
ereports makes them much bigger. Our low code density, and the resulting
rate of itlb misses, is pretty significant cost (cf [1]).

master:
   textdata bss dec hex filename
  36124  44  65   362338d89 float.o
error messages moved out of line:
   textdata bss dec hex filename
  32883  44  65   3299280e0 float.o

Taking int4pl as an example - solely because it is simpler assembly to
look at - we get:

master:
   0x004ac190 <+0>: mov0x30(%rdi),%rax
   0x004ac194 <+4>: add0x20(%rdi),%eax
   0x004ac197 <+7>: jo 0x4ac19c 
   0x004ac199 <+9>: cltq
   0x004ac19b <+11>:retq
   0x004ac19c <+12>:push   %rbp
   0x004ac19d <+13>:lea0x1a02c4(%rip),%rsi# 0x64c468
   0x004ac1a4 <+20>:xor%r8d,%r8d
   0x004ac1a7 <+23>:lea0x265da1(%rip),%rcx# 0x711f4f 
<__func__.26823>
   0x004ac1ae <+30>:mov$0x30b,%edx
   0x004ac1b3 <+35>:mov$0x14,%edi
   0x004ac1b8 <+40>:callq  0x586060 
   0x004ac1bd <+45>:lea0x147e0e(%rip),%rdi# 0x5f3fd2
   0x004ac1c4 <+52>:xor%eax,%eax
   0x004ac1c6 <+54>:callq  0x5896a0 
   0x004ac1cb <+59>:mov$0x382,%edi
   0x004ac1d0 <+64>:mov%eax,%ebp
   0x004ac1d2 <+66>:callq  0x589540 
   0x004ac1d7 <+71>:mov%eax,%edi
   0x004ac1d9 <+73>:mov%ebp,%esi
   0x004ac1db <+75>:xor%eax,%eax
   0x004ac1dd <+77>:callq  0x588fb0 

out-of-line error:
   0x004b04e0 <+0>: mov0x30(%rdi),%rax
   0x004b04e4 <+4>: add0x20(%rdi),%eax
   0x004b04e7 <+7>: jo 0x4b04ec 
   0x004b04e9 <+9>: cltq
   0x004b04eb <+11>:retq
   0x004b04ec <+12>:push   %rax
   0x004b04ed <+13>:callq  0x115e17 

With the out-of-line error, we can fit multiple of these functions into one
cache line. With the inline error, not even one.

Greetings,

Andres Freund

[1] https://twitter.com/AndresFreundTec/status/1214305610172289024




Re: In PG12, query with float calculations is slower than PG11

2020-02-06 Thread Robert Haas
On Thu, Feb 6, 2020 at 11:04 AM Tom Lane  wrote:
> So it appears to me that what commit 6bf0bc842 did in this area was
> not just wrong, but disastrously so.  Before that, we had a macro that
> evaluated isinf(val) before it evaluated the inf_is_valid condition.
> Now we have check_float[48]_val which do it the other way around.
> That would be okay if the inf_is_valid condition were cheap to
> evaluate, but in common code paths it's actually twice as expensive
> as isinf().

Well, if the previous coding was a deliberate attempt to dodge this
performance issue, the evidence seems to be well-concealed. Neither
the comments for that macro nor the related commit messages make any
mention of it. When subtle things like this are performance-critical,
good comments are pretty critical, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Make ringbuffer threshold and ringbuffer sizes configurable?

2020-02-06 Thread Robert Haas
On Wed, Feb 5, 2020 at 11:00 PM Andres Freund  wrote:
> I.e. to maintain the current defaults, seqscan_ringbuffer_threshold
> would be -4.0, but could be also be set to an absolute 4GB (converted to
> pages). Probably would want a GUC show function that displays
> proportional values in a nice way.

I think this is kind of awkward given that our GUC system attributes
units to everything. It'd sort of be nicer to have two separate GUCs,
one measured as a multiple and the other measured in bytes, but maybe
that's just exchanging one form of confusion for another.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Make ringbuffer threshold and ringbuffer sizes configurable?

2020-02-06 Thread Andres Freund
Hi,

On 2020-02-06 07:18:00 +0100, Laurenz Albe wrote:
> On Wed, 2020-02-05 at 20:00 -0800, Andres Freund wrote:
> > The ringbuffers we use for seqscans, vacuum, copy etc can cause very
> > drastic slowdowns (see e.g. [1]), an can cause some workloads to
> > practically never end up utilizing shared buffers. ETL workloads
> > e.g. regularly fight with that problem.
> > 
> > I think it would make sense to have seqscan_ringbuffer_threshold,
> > {bulkread,bulkwrite,vacuum}_ringbuffer_size. I think they often sensibly
> > are set in proportion of shared_buffers, so I suggest defining them as
> > floats, where negative values divide shared_buffers, whereas positive
> > values are absolute sizes, and 0 disables the use of ringbuffers.
> 
> Sounds reasonable.

> I feel that it should be as few GUCs as possible, so I think that
> having one per type of operation might be too granular.

They already are set to different sizes, so I don't really see how
that's something doable. The relevant bits of code are:

BufferAccessStrategy
GetAccessStrategy(BufferAccessStrategyType btype)
{
BufferAccessStrategy strategy;
int ring_size;

/*
 * Select ring size to use.  See buffer/README for rationales.
 *
 * Note: if you change the ring size for BAS_BULKREAD, see also
 * SYNC_SCAN_REPORT_INTERVAL in access/heap/syncscan.c.
 */
switch (btype)
{
case BAS_NORMAL:
/* if someone asks for NORMAL, just give 'em a 
"default" object */
return NULL;

case BAS_BULKREAD:
ring_size = 256 * 1024 / BLCKSZ;
break;
case BAS_BULKWRITE:
ring_size = 16 * 1024 * 1024 / BLCKSZ;
break;
case BAS_VACUUM:
ring_size = 256 * 1024 / BLCKSZ;
break;

and


/*
 * If the table is large relative to NBuffers, use a bulk-read access
 * strategy and enable synchronized scanning (see syncscan.c).  Although
 * the thresholds for these features could be different, we make them 
the
 * same so that there are only two behaviors to tune rather than four.
 * (However, some callers need to be able to disable one or both of 
these
 * behaviors, independently of the size of the table; also there is a 
GUC
 * variable that can disable synchronized scanning.)
 *
 * Note that table_block_parallelscan_initialize has a very similar 
test;
 * if you change this, consider changing that one, too.
 */
if (!RelationUsesLocalBuffers(scan->rs_base.rs_rd) &&
scan->rs_nblocks > NBuffers / 4)
{
allow_strat = (scan->rs_base.rs_flags & SO_ALLOW_STRAT) != 0;
allow_sync = (scan->rs_base.rs_flags & SO_ALLOW_SYNC) != 0;
}
else
allow_strat = allow_sync = false;


> This should of course also be a storage parameter that can be
> set per table.

I honestly don't quite get that. What precisely is this addressing? I
mean fine, I can add that, but it's sufficiently more complicated than
the GUCs, and I don't really forsee that being particularly useful to
tune on a per table basis. A lot of the reason behind having the
ringbuffers is about managing the whole system impact, rather than
making individual table fast/slow.

- Andres




Re: ALTER tbl rewrite loses CLUSTER ON index

2020-02-06 Thread Alvaro Herrera
On 2020-Feb-06, Justin Pryzby wrote:

> I wondered if it wouldn't be better if CLUSTER ON was stored in pg_class as 
> the
> Oid of a clustered index, rather than a boolean in pg_index.

Maybe.  Do you want to try a patch?

> That likely would've avoided (or at least exposed) this issue.
> And avoids the possibility of having two indices marked as "clustered".
> These would be more trivial:
> mark_index_clustered
> /* We need to find the index that has indisclustered set. */

You need to be careful when dropping the index ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: table partitioning and access privileges

2020-02-06 Thread Fujii Masao




On 2020/02/03 14:26, Amit Langote wrote:

On Mon, Feb 3, 2020 at 2:07 PM Fujii Masao  wrote:

On 2020/02/03 11:05, Amit Langote wrote:

Okay.  How about the attached?


Thanks for the patches! You added the note just after the description
about row level security on inherited table, but isn't it better to
add it before that? Attached patch does that. Thought?


Yeah, that might be a better flow for that paragraph.


Pushed! Thanks!


Maybe, we should also note the LOCK TABLE exception?


Yes.


Note that, unlike TRUNCATE, LOCK TABLE exception exists in HEAD too,
but maybe you're aware of that.


Yes, so I will review your patch getting rid of
LOCK TABLE exception.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters




Re: In PG12, query with float calculations is slower than PG11

2020-02-06 Thread Tom Lane
So it appears to me that what commit 6bf0bc842 did in this area was
not just wrong, but disastrously so.  Before that, we had a macro that
evaluated isinf(val) before it evaluated the inf_is_valid condition.
Now we have check_float[48]_val which do it the other way around.
That would be okay if the inf_is_valid condition were cheap to
evaluate, but in common code paths it's actually twice as expensive
as isinf().

Andres seems to be of the opinion that the compiler should be willing
to ignore the semantic requirements of the C standard in order
to rearrange the code back into the cheaper order.  That sounds like
wishful thinking to me ... even if it actually works on his compiler,
it certainly isn't going to work for everyone.

The patch looks unduly invasive to me, but I think that it might be
right that we should go back to a macro-based implementation, because
otherwise we don't have a good way to be certain that the function
parameter won't get evaluated first.  (Another reason to do so is
so that the file/line numbers generated for the error reports go back
to being at least a little bit useful.)  We could use local variables
within the macro to avoid double evals, if anyone thinks that's
actually important --- I don't.

I think the current code is probably also misusing unlikely(),
and that the right way would be more like

if (unlikely(isinf(val) && !(inf_is_valid)))

regards, tom lane




Re: Do not check unlogged indexes on standby

2020-02-06 Thread Alvaro Herrera
On 2020-Feb-05, Peter Geoghegan wrote:

> The second item genereated another thread a little after this thread.
> Everything was handled on this other thread. Ultimately, I rejected
> the enhancement on the grounds that it wasn't safe on standbys in the
> face of concurrent splits and deletions [1].
> 
> I believe that all of the items discussed on this thread have been
> resolved. Did I miss a CF entry or something?

Nah.  I just had one of the messages flagged in my inbox, and I wasn't
sure what had happened since the other thread was not referenced in this
one.  I wasn't looking at any CF entries.

Thanks for the explanation,

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: ALTER tbl rewrite loses CLUSTER ON index

2020-02-06 Thread Justin Pryzby
I wondered if it wouldn't be better if CLUSTER ON was stored in pg_class as the
Oid of a clustered index, rather than a boolean in pg_index.

That likely would've avoided (or at least exposed) this issue.
And avoids the possibility of having two indices marked as "clustered".
These would be more trivial:
mark_index_clustered
/* We need to find the index that has indisclustered set. */




Re: bad logging around broken restore_command

2020-02-06 Thread Fujii Masao



On 2020/02/06 1:10, Jeff Janes wrote:

If the restore command claims to have succeeded, but fails to have created a file with 
the right name (due to typos or escaping or quoting issues, for example), no complaint is 
issued at the time, and it then fails later with a relatively uninformative error message 
like "could not locate required checkpoint record".

     if (rc == 0)
     {
         /*
          * command apparently succeeded, but let's make sure the file is
          * really there now and has the correct size.
          */
         if (stat(xlogpath, _buf) == 0)
         {..
         }
         else
         {
             /* stat failed */
             if (errno != ENOENT)
                 ereport(FATAL,
                         (errcode_for_file_access(),
                          errmsg("could not stat file \"%s\": %m",
                                 xlogpath)));
         }

I don't see why ENOENT is thought to deserve the silent treatment.  It seems weird that success 
gets logged ("restored log file \"%s\" from archive"), but one particular type 
of unexpected failure does not.


Agreed.


I've attached a patch which emits a LOG message for ENOENT.


Isn't it better to use "could not stat file" message even in ENOENT case?
If yes, something like message that you used in the patch should be
logged as DETAIL or HINT message. So, what about the attached patch?

Regards,


--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters
diff --git a/src/backend/access/transam/xlogarchive.c 
b/src/backend/access/transam/xlogarchive.c
index 188b73e752..0a28b7a562 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -265,11 +265,11 @@ RestoreArchivedFile(char *path, const char *xlogfname,
else
{
/* stat failed */
-   if (errno != ENOENT)
-   ereport(FATAL,
-   (errcode_for_file_access(),
-errmsg("could not stat file 
\"%s\": %m",
-   xlogpath)));
+   int elevel = (errno == ENOENT) ? LOG : FATAL;
+   ereport(elevel,
+   (errcode_for_file_access(),
+errmsg("could not stat file \"%s\": 
%m", xlogpath),
+errdetail("restore_command returned a 
zero exit status, but stat() failed.")));
}
}
 


Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

2020-02-06 Thread Masahiko Sawada
On Thu, 6 Feb 2020 at 19:12, Kasahara Tatsuhito
 wrote:
>
> HI,
>
> On Thu, Feb 6, 2020 at 3:24 PM Fujii Masao  
> wrote:
> > > I added a new (but minimal)  isolation test for the case of  tid scan.
> > > (v12 and HEAD will be failed this test. v11 and HEAD with my patch
> > > will be passed)
> >
> > Isn't this test scenario a bit overkill? We can simply test that
> > as follows, instead.
> > CREATE TABLE test_tidscan AS SELECT 1 AS id;
> > BEGIN ISOLATION LEVEL SERIALIZABLE;
> > SELECT * FROM test_tidscan WHERE ctid = '(0,1)';
> > SELECT locktype, mode FROM pg_locks WHERE pid = pg_backend_pid() AND mode = 
> > 'SIReadLock';
> > COMMIT;
> >
> > In the expected file, the result of query looking at pg_locks
> > should be matched with the following.
> >
> >   locktype |mode
> > --+
> >   tuple| SIReadLock
> Thanks for your reply.
> Hmm, it's an simple and  might be the better way than adding isolation test.
>
> I added above test case to regress/sql/tidscan.sql.
> Attach the patch.
>

I've tested predicate locking including promotion cases with v3 patch
and it works fine.

+table_beginscan_tid(Relation rel, Snapshot snapshot,
+   int nkeys, struct ScanKeyData *key)
+{
+   uint32  flags = SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE;

IIUC setting SO_ALLOW_STRAT, SO_ALLOW_SYNC and SO_ALLOW_PAGEMODE has
no meaning during tid scan. I think v11 also should be the same.

Why did you remove SO_TYPE_TIDSCAN from the previous version patch? I
think it's better to add it and then we can set only SO_TYPE_TIDSCAN
to the scan option of tid scan.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: typos in comments and user docs

2020-02-06 Thread Justin Pryzby
On Thu, Feb 06, 2020 at 04:43:18PM +0530, Amit Kapila wrote:
> On Thu, Feb 6, 2020 at 10:45 AM Michael Paquier  wrote:
> >
> > On Thu, Feb 06, 2020 at 08:47:14AM +0530, Amit Kapila wrote:
> > > Your changes look fine to me on the first read.  I will push this to
> > > HEAD unless there are any objections.   If we want them in
> > > back-branches, we might want to probably segregate the changes based
> > > on the branch until those apply.
> >
> > +1.  It would be nice to back-patch the user-visible changes in the
> > docs.
> >
> 
> Fair enough, Justin, is it possible for you to segregate the changes
> that can be backpatched?

Looks like the whole patch can be applied to master and v12 [0].

My original thread from last year was about docs added in v12, so bloom.sgml is
the only user-facing doc which can be backpatched.  README.parallel and
bufmgr.c changes could be backpatched but I agree it's not necessary.

Note, the bloom typo seems to complete a change that was started here:

|commit 31ff51adc855e3ffe8e3c20e479b8d1a4508feb8
|Author: Alexander Korotkov 
|Date:   Mon Oct 22 00:23:26 2018 +0300
|
|Fix some grammar errors in bloom.sgml
|
|Discussion: 
https://postgr.es/m/CAEepm%3D3sijpGr8tXdyz-7EJJZfhQHABPKEQ29gpnb7-XSy%2B%3D5A%40mail.gmail.com
|Reported-by: Thomas Munro
|Backpatch-through: 9.6

Justin

[0] modulo a fix for a typo which I introduced in another patch in this branch,
which shouldn't have been in this patch; fixed in the attached.
>From a1780229e024e2e4b9a0549bcd516bb80b2d5a8d Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 9 May 2019 21:13:55 -0500
Subject: [PATCH] spelling and typos

---
 doc/src/sgml/bloom.sgml| 2 +-
 doc/src/sgml/ref/alter_table.sgml  | 2 +-
 doc/src/sgml/sources.sgml  | 4 ++--
 src/backend/access/transam/README.parallel | 2 +-
 src/backend/storage/buffer/bufmgr.c| 2 +-
 src/backend/storage/sync/sync.c| 2 +-
 src/include/access/tableam.h   | 2 +-
 7 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/bloom.sgml b/doc/src/sgml/bloom.sgml
index 6eeadde..c341b65 100644
--- a/doc/src/sgml/bloom.sgml
+++ b/doc/src/sgml/bloom.sgml
@@ -65,7 +65,7 @@
  
   Number of bits generated for each index column. Each parameter's name
   refers to the number of the index column that it controls.  The default
-  is 2 bits and maximum is 4095.  Parameters for
+  is 2 bits and the maximum is 4095.  Parameters for
   index columns not actually used are ignored.
  
 
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 5de3676..a22770c 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -222,7 +222,7 @@ WITH ( MODULUS numeric_literal, REM
 
  
   SET NOT NULL may only be applied to a column
-  providing none of the records in the table contain a
+  provided none of the records in the table contain a
   NULL value for the column.  Ordinarily this is
   checked during the ALTER TABLE by scanning the
   entire table; however, if a valid CHECK constraint is
diff --git a/doc/src/sgml/sources.sgml b/doc/src/sgml/sources.sgml
index 5831ec4..b5d28e7 100644
--- a/doc/src/sgml/sources.sgml
+++ b/doc/src/sgml/sources.sgml
@@ -511,7 +511,7 @@ Hint:   the addendum
 

 There are functions in the backend that will double-quote their own output
-at need (for example, format_type_be()).  Do not put
+as needed (for example, format_type_be()).  Do not put
 additional quotes around the output of such functions.

 
@@ -880,7 +880,7 @@ BETTER: unrecognized node type: 42
  practices.
 
 
- Features from later revision of the C standard or compiler specific
+ Features from later revisions of the C standard or compiler specific
  features can be used, if a fallback is provided.
 
 
diff --git a/src/backend/access/transam/README.parallel b/src/backend/access/transam/README.parallel
index 85e5840..99c588d 100644
--- a/src/backend/access/transam/README.parallel
+++ b/src/backend/access/transam/README.parallel
@@ -169,7 +169,7 @@ differently because of them.  Right now, we don't even allow that.
 At the end of a parallel operation, which can happen either because it
 completed successfully or because it was interrupted by an error, parallel
 workers associated with that operation exit.  In the error case, transaction
-abort processing in the parallel leader kills of any remaining workers, and
+abort processing in the parallel leader kills off any remaining workers, and
 the parallel leader then waits for them to die.  In the case of a successful
 parallel operation, the parallel leader does not send any signals, but must
 wait for workers to complete and exit of their own volition.  In either
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index aba3960..5880054 100644

Re: Index Skip Scan

2020-02-06 Thread Kyotaro Horiguchi

Sorry, I forgot to write more significant thing.

On 2020/02/06 21:22, Kyotaro Horiguchi wrote:

At Thu, 6 Feb 2020 11:57:07 +0100, Dmitry Dolgov <9erthali...@gmail.com> wrote 
in

On Thu, Feb 06, 2020 at 10:24:50AM +0900, Kyotaro Horiguchi wrote:
At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthali...@gmail.com> wrote 
in
We could add an additional parameter "in_cursor" to
ExecSupportBackwardScan and let skip scan return false if in_cursor is
true, but I'm not sure it's acceptable.

I also was thinking about whether it's possible to use
ExecSupportBackwardScan here, but skip scan is just a mode of an
index/indexonly scan. Which means that ExecSupportBackwardScan also need
to know somehow if this mode is being used, and then, since this
function is called after it's already decided to use skip scan in the
resulting plan, somehow correct the plan (exclude skipping and try to
find next best path?) - do I understand your suggestion correct?


No. I thought of the opposite thing. I meant that
IndexSupportsBackwardScan returns false if Index(Only)Scan is
going to do skip scan. But I found that the function doesn't have
access to plan node nor executor node.  So I wrote as the follows.


I didn't thought so hardly, but a bit of confirmation told me that
IndexSupportsBackwardScan returns fixed flag for AM.  It seems that
things are not that simple.


regards.

--

Kyotaro Horiguchi








Re: Index Skip Scan

2020-02-06 Thread Kyotaro Horiguchi
At Thu, 6 Feb 2020 11:57:07 +0100, Dmitry Dolgov <9erthali...@gmail.com> wrote 
in 
> > On Thu, Feb 06, 2020 at 10:24:50AM +0900, Kyotaro Horiguchi wrote:
> > At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthali...@gmail.com> 
> > wrote in
> > We could add an additional parameter "in_cursor" to
> > ExecSupportBackwardScan and let skip scan return false if in_cursor is
> > true, but I'm not sure it's acceptable.
> 
> I also was thinking about whether it's possible to use
> ExecSupportBackwardScan here, but skip scan is just a mode of an
> index/indexonly scan. Which means that ExecSupportBackwardScan also need
> to know somehow if this mode is being used, and then, since this
> function is called after it's already decided to use skip scan in the
> resulting plan, somehow correct the plan (exclude skipping and try to
> find next best path?) - do I understand your suggestion correct?

I didn't thought so hardly, but a bit of confirmation told me that
IndexSupportsBackwardScan returns fixed flag for AM.  It seems that
things are not that simple.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: typo in set_rel_consider_parallel()

2020-02-06 Thread Amit Langote
On Thu, Feb 6, 2020 at 20:11 Amit Kapila  wrote:

> On Thu, Feb 6, 2020 at 12:10 PM Amit Kapila 
> wrote:
> >
> > On Thu, Feb 6, 2020 at 11:41 AM Amit Langote 
> wrote:
> > >
> > > Hi,
> > >
> > > Attached fixes $subject.
> > >
> >
> > LGTM.  I will push this later today.
> >
>
> Pushed.


Thanks Amit.

Regards,
Amit

>


Re: typos in comments and user docs

2020-02-06 Thread Amit Kapila
On Thu, Feb 6, 2020 at 10:45 AM Michael Paquier  wrote:
>
> On Thu, Feb 06, 2020 at 08:47:14AM +0530, Amit Kapila wrote:
> > Your changes look fine to me on the first read.  I will push this to
> > HEAD unless there are any objections.   If we want them in
> > back-branches, we might want to probably segregate the changes based
> > on the branch until those apply.
>
> +1.  It would be nice to back-patch the user-visible changes in the
> docs.
>

Fair enough, Justin, is it possible for you to segregate the changes
that can be backpatched?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: typo in set_rel_consider_parallel()

2020-02-06 Thread Amit Kapila
On Thu, Feb 6, 2020 at 12:10 PM Amit Kapila  wrote:
>
> On Thu, Feb 6, 2020 at 11:41 AM Amit Langote  wrote:
> >
> > Hi,
> >
> > Attached fixes $subject.
> >
>
> LGTM.  I will push this later today.
>

Pushed.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Index Skip Scan

2020-02-06 Thread Dmitry Dolgov
> On Thu, Feb 06, 2020 at 10:24:50AM +0900, Kyotaro Horiguchi wrote:
> At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthali...@gmail.com> 
> wrote in
> > > On Tue, Feb 04, 2020 at 08:34:09PM +, Floris Van Nee wrote:
> > >
> > > > this point me and Jesper inclined to go with the second option. But 
> > > > maybe
> > > > I'm missing something, are there any other suggestions?
> > >
> > > Unfortunately I figured this would need a more invasive fix. I tend to 
> > > agree that it'd be better to not skip in situations like this. I think 
> > > it'd make most sense to make any plan for these 'prepare/fetch' queries 
> > > would not use skip, but rather a materialize node, right?
> >
> > Yes, sort of, without a skip scan it would be just an index only scan
> > with unique on top. Actually it's not immediately clean how to achieve
> > this, since at the moment, when planner is deciding to consider index
> > skip scan, there is no information about neither direction nor whether
> > we're dealing with a cursor. Maybe we can somehow signal to the decision
> > logic that the root was a DeclareCursorStmt by e.g. introducing a new
> > field to the query structure (or abusing an existing one, since
> > DeclareCursorStmt is being processed by standard_ProcessUtility, just
> > for a test I've tried to use utilityStmt of a nested statement hoping
> > that it's unused and it didn't break tests yet).
>
> Umm.  I think it's a wrong direction.  While defining a cursor,
> default scrollability is decided based on the query allows backward
> scan or not.  That is, the definition of backward-scan'ability is not
> just whether it can scan from the end toward the beginning, but
> whether it can go back and forth freely or not.  In that definition,
> the *current* skip scan does not supporting backward scan.  If we want
> to allow descending order-by in a query, we should support scrollable
> cursor, too.
>
> We could add an additional parameter "in_cursor" to
> ExecSupportBackwardScan and let skip scan return false if in_cursor is
> true, but I'm not sure it's acceptable.

I also was thinking about whether it's possible to use
ExecSupportBackwardScan here, but skip scan is just a mode of an
index/indexonly scan. Which means that ExecSupportBackwardScan also need
to know somehow if this mode is being used, and then, since this
function is called after it's already decided to use skip scan in the
resulting plan, somehow correct the plan (exclude skipping and try to
find next best path?) - do I understand your suggestion correct?




Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

2020-02-06 Thread Kasahara Tatsuhito
HI,

On Thu, Feb 6, 2020 at 3:24 PM Fujii Masao  wrote:
> > I added a new (but minimal)  isolation test for the case of  tid scan.
> > (v12 and HEAD will be failed this test. v11 and HEAD with my patch
> > will be passed)
>
> Isn't this test scenario a bit overkill? We can simply test that
> as follows, instead.
> CREATE TABLE test_tidscan AS SELECT 1 AS id;
> BEGIN ISOLATION LEVEL SERIALIZABLE;
> SELECT * FROM test_tidscan WHERE ctid = '(0,1)';
> SELECT locktype, mode FROM pg_locks WHERE pid = pg_backend_pid() AND mode = 
> 'SIReadLock';
> COMMIT;
>
> In the expected file, the result of query looking at pg_locks
> should be matched with the following.
>
>   locktype |mode
> --+
>   tuple| SIReadLock
Thanks for your reply.
Hmm, it's an simple and  might be the better way than adding isolation test.

I added above test case to regress/sql/tidscan.sql.
Attach the patch.

Best regards,
-- 
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com


fix_tidscan_issues_v3.patch
Description: Binary data


Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-02-06 Thread Amit Kapila
On Wed, Feb 5, 2020 at 9:42 AM Dilip Kumar  wrote:
>
> >
> > I am not able to understand the change in
> > v8-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.  Do you have
> > any explanation for the same?
>
> It appears that in ReorderBufferCommitChild we are always setting the
> final_lsn of the subxacts so it should not be invalid.  For testing, I
> have changed this as an assert and checked but it never hit.  So maybe
> we can remove this change.
>

Tomas, do you remember anything about this change?  We are talking
about below change:

From: Tomas Vondra 
Date: Thu, 26 Sep 2019 19:14:45 +0200
Subject: [PATCH v8 11/13] BUGFIX: set final_lsn for subxacts before cleanup

---
 src/backend/replication/logical/reorderbuffer.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/src/backend/replication/logical/reorderbuffer.c
b/src/backend/replication/logical/reorderbuffer.c
index fe4e57c..beb6cd2 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1327,6 +1327,10 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)

  subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);

+ /* make sure subtxn has final_lsn */
+ if (subtxn->final_lsn == InvalidXLogRecPtr)
+ subtxn->final_lsn = txn->final_lsn;
+

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: ALTER tbl rewrite loses CLUSTER ON index

2020-02-06 Thread Amit Langote
On Thu, Feb 6, 2020 at 10:31 AM Amit Langote  wrote:
> On Thu, Feb 6, 2020 at 8:44 AM Justin Pryzby  wrote:
> > On Wed, Feb 05, 2020 at 03:53:45PM +0900, Amit Langote wrote:
> > > diff --git a/src/test/regress/sql/alter_table.sql 
> > > b/src/test/regress/sql/alter_table.sql
> > > +-- alter type shouldn't lose clustered index
> >
> > My only suggestion is to update the comment
> > +-- alter type rewrite/rebuild should preserve cluster marking on index
>
> Sure, done.

Deja vu.  Last two messages weren't sent to the list; updated patch attached.

Thanks,
Amit
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index f599393473..bab3ddf67c 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -490,6 +490,7 @@ static void ATPostAlterTypeParse(Oid oldId, Oid oldRelId, 
Oid refRelId,
 static void RebuildConstraintComment(AlteredTableInfo *tab, int pass,
 Oid 
objid, Relation rel, List *domname,
 const 
char *conname);
+static void PreserveClusterOn(AlteredTableInfo *tab, int pass, Oid indoid);
 static void TryReuseIndex(Oid oldId, IndexStmt *stmt);
 static void TryReuseForeignKey(Oid oldId, Constraint *con);
 static ObjectAddress ATExecAlterColumnGenericOptions(Relation rel, const char 
*colName,
@@ -11832,6 +11833,9 @@ ATPostAlterTypeParse(Oid oldId, Oid oldRelId, Oid 
refRelId, char *cmd,
newcmd->def = (Node *) stmt;
tab->subcmds[AT_PASS_OLD_INDEX] =
lappend(tab->subcmds[AT_PASS_OLD_INDEX], 
newcmd);
+
+   /* Preserve index's indisclustered property, if set. */
+   PreserveClusterOn(tab, AT_PASS_OLD_INDEX, oldId);
}
else if (IsA(stm, AlterTableStmt))
{
@@ -11868,6 +11872,9 @@ ATPostAlterTypeParse(Oid oldId, Oid oldRelId, Oid 
refRelId, char *cmd,

 rel,

 NIL,

 indstmt->idxname);
+
+   /* Preserve index's indisclustered 
property, if set. */
+   PreserveClusterOn(tab, 
AT_PASS_OLD_INDEX, indoid);
}
else if (cmd->subtype == AT_AddConstraint)
{
@@ -11990,6 +11997,38 @@ RebuildConstraintComment(AlteredTableInfo *tab, int 
pass, Oid objid,
tab->subcmds[pass] = lappend(tab->subcmds[pass], newcmd);
 }
 
+/*
+ * For a table's index that is to be recreated due to PostAlterType
+ * processing, preserve its indisclustered property by issuing ALTER TABLE
+ * CLUSTER ON command on the table that will run after the command to recreate
+ * the index.
+ */
+static void
+PreserveClusterOn(AlteredTableInfo *tab, int pass, Oid indoid)
+{
+   HeapTuple   indexTuple;
+   Form_pg_index indexForm;
+
+   Assert(OidIsValid(indoid));
+   Assert(pass == AT_PASS_OLD_INDEX);
+
+   indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(indoid));
+   if (!HeapTupleIsValid(indexTuple))
+   elog(ERROR, "cache lookup failed for index %u", indoid);
+   indexForm = (Form_pg_index) GETSTRUCT(indexTuple);
+
+   if (indexForm->indisclustered)
+   {
+   AlterTableCmd *newcmd = makeNode(AlterTableCmd);
+
+   newcmd->subtype = AT_ClusterOn;
+   newcmd->name = get_rel_name(indoid);
+   tab->subcmds[pass] = lappend(tab->subcmds[pass], newcmd);
+   }
+
+   ReleaseSysCache(indexTuple);
+}
+
 /*
  * Subroutine for ATPostAlterTypeParse().  Calls out to CheckIndexCompatible()
  * for the real analysis, then mutates the IndexStmt based on that verdict.
diff --git a/src/test/regress/expected/alter_table.out 
b/src/test/regress/expected/alter_table.out
index e73ce4b6f3..09c00eef05 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -4258,3 +4258,37 @@ alter table at_test_sql_partop attach partition 
at_test_sql_partop_1 for values
 drop table at_test_sql_partop;
 drop operator class at_test_sql_partop using btree;
 drop function at_test_sql_partop;
+-- alter type rewrite/rebuild should preserve cluster marking on index
+create table alttype_cluster (a int);
+create index alttype_cluster_a on alttype_cluster (a);
+alter table alttype_cluster cluster on alttype_cluster_a;
+select indisclustered from pg_index where indrelid = 
'alttype_cluster'::regclass;
+ indisclustered 
+
+ t
+(1 row)
+
+alter table alttype_cluster alter a type bigint;
+select indisclustered from pg_index where 

replication_origin and replication_origin_lsn usage on subscriber

2020-02-06 Thread Amit Kapila
During logical decoding, we send replication_origin and
replication_origin_lsn when we decode commit.  In pgoutput_begin_txn,
we send values for these two but never used on the subscriber side.
Though we have provided a function (logicalrep_read_origin) to read
these two values but that is not used in code anywhere.

I think this is primarily for external application usage, but it is
not very clear how will they use it.  As far as I understand, the
value of origin can be used to avoid loops in bi-directional
replication, and origin_lsn can be used to track how far subscriber
has recevied changes.  I am not sure about this and particularly how
origin_lsn can be used in external applications.

This has come up in the discussion of the "logical streaming of large
in-progress transactions" [1]. Basically, we are not sure when to send
these values during streaming as we don't know its clear usage.

Thoughts?

[1] - 
https://www.postgresql.org/message-id/CAFiTN-skHvSWDHV66qpzMfnHH6AvsE2YAjvh4Kt613E8ZD8WoQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [HACKERS] WIP: Aggregation push-down

2020-02-06 Thread Richard Guo
Hi,

I've been looking at the 'make_join_rel()' part of the patch, and I'm
aware that, if we are joining A and B, a 'grouped join rel (AB)' would
be created besides the 'plain join rel (AB)', and paths are added by 1)
applying partial aggregate to the paths of the 'plain join rel (AB)', or
2) joining grouped A to plain B or joining plain A to grouped B.

This is a smart idea. One issue I can see is that some logic would have
to be repeated several times. For example, the validity check for the
same proposed join would be performed at most three times by
join_is_legal().

I'm thinking of another idea that, instead of using a separate
RelOptInfo for the grouped rel, we add in RelOptInfo a
'grouped_pathlist' for the grouped paths, just like how we implement
parallel query via adding 'partial_pathlist'.

For base rel, if we decide it can produce grouped paths, we create the
grouped paths by applying partial aggregation to the paths in 'pathlist'
and add them into 'grouped_pathlist'.

For join rel (AB), we can create the grouped paths for it by:
1) joining a grouped path from 'A->grouped_pathlist' to a plain path
from 'B->pathlist', or vice versa;
2) applying partial aggregation to the paths in '(AB)->pathlist'.

This is basically the same idea, just moves the grouped paths from the
grouped rel to a 'grouped_pathlist'. With it we should not need to make
any code changes to 'make_join_rel()'. Most code changes would happen in
'add_paths_to_joinrel()'.

Will this idea work? Is it better or worse?

Thanks
Richard


Re: Expose lock group leader pid in pg_stat_activity

2020-02-06 Thread Julien Rouhaud
On Thu, Feb 06, 2020 at 09:24:16AM +0900, Michael Paquier wrote:
> On Wed, Feb 05, 2020 at 07:57:20AM +0100, Julien Rouhaud wrote:
> > This looks good, thanks a lot!
> 
> Thanks for double-checking.  And done.

Thanks!

While on the topic, is there any reason why the backend stays a group leader
for the rest of its lifetime, and should we change that?

Also, while reading ProcKill, I noticed a typo in a comment:

/*
 * Detach from any lock group of which we are a member.  If the leader
-* exist before all other group members, it's PGPROC will remain allocated
+* exist before all other group members, its PGPROC will remain allocated
 * until the last group process exits; that process must return the
 * leader's PGPROC to the appropriate list.
 */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 32df8c85a1..eb321f72ea 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -851,7 +851,7 @@ ProcKill(int code, Datum arg)
 
/*
 * Detach from any lock group of which we are a member.  If the leader
-* exist before all other group members, it's PGPROC will remain 
allocated
+* exist before all other group members, its PGPROC will remain 
allocated
 * until the last group process exits; that process must return the
 * leader's PGPROC to the appropriate list.
 */


Re: Identifying user-created objects

2020-02-06 Thread Masahiko Sawada
On Thu, 6 Feb 2020 at 16:53, Amit Langote  wrote:
>
> On Thu, Feb 6, 2020 at 4:31 PM Michael Paquier  wrote:
> > On Thu, Feb 06, 2020 at 04:25:47PM +0900, Amit Langote wrote:
> > > About the implementation, how about defining a static inline function,
> > > say is_user_object(), next to FirstNormalObjectId's definition and
> > > make pg_is_user_object() call it?  There are a few placed in the
> > > backend code that perform the same computation as pg_is_user_object(),
> > > which could be changed to use is_user_object() instead.
> >
> > FWIW, if we bother adding SQL functions for that, my first impression
> > was to have three functions, each one of them returning:
> > - FirstNormalObjectId
> > - FirstGenbkiObjectId
> > - FirstNormalObjectId
>
> Did you miss FirstBootstrapObjectId by any chance?
>
> I see the following ranges as defined in transam.h.
>
> 1-(FirstGenbkiObjectId - 1): manually assigned OIDs
> FirstGenbkiObjectId-(FirstBootstrapObjectId - 1): genbki.pl assigned OIDs
> FirstBootstrapObjectId-(FirstNormalObjectId - 1): initdb requested
> FirstNormalObjectId or greater: user-defined objects
>
> Sawada-san's proposal covers #4.  Do we need an SQL function for the
> first three?  IOW, would the distinction between OIDs belonging to the
> first three ranges be of interest to anyone except core PG hackers?

Yeah I thought of these three values but I'm also not sure it's worth for users.

If we have these functions returning the values respectively, when we
want to check if an oid is assigned during initdb we will end up with
doing something like 'WHERE oid >= pg_first_bootstrap_oid() and oid <
pg_first_normal_oid()', which is not intuitive, I think. Users have to
remember the order of these values.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Identifying user-created objects

2020-02-06 Thread Michael Paquier
On Thu, Feb 06, 2020 at 04:52:48PM +0900, Amit Langote wrote:
> On Thu, Feb 6, 2020 at 4:31 PM Michael Paquier  wrote:
>> FWIW, if we bother adding SQL functions for that, my first impression
>> was to have three functions, each one of them returning:
>> - FirstNormalObjectId
>> - FirstGenbkiObjectId
>> - FirstNormalObjectId
> 
> Did you miss FirstBootstrapObjectId by any chance?

Yep, incorrect copy-pasto.
--
Michael


signature.asc
Description: PGP signature


Re: Memory-Bounded Hash Aggregation

2020-02-06 Thread Heikki Linnakangas

On 05/02/2020 21:56, Jeff Davis wrote:

On Tue, 2020-02-04 at 18:10 +0200, Heikki Linnakangas wrote:

I'd love to change the LogicalTape API so that you could allocate
and
free tapes more freely. I wrote a patch to do that, as part of
replacing
tuplesort.c's polyphase algorithm with a simpler one (see [1]), but
I
never got around to committing it. Maybe the time is ripe to do that
now?


It's interesting that you wrote a patch to pause the tapes a while ago.
Did it just fall through the cracks or was there a problem with it?

Is pause/resume functionality required, or is it good enough that
rewinding a tape frees the buffer, to be lazily allocated later?


It wasn't strictly required for what I was hacking on then. IIRC it 
would have saved some memory during sorting, but Peter G felt that it 
wasn't worth the trouble, because he made some other changes around the 
same time, which made it less important 
(https://www.postgresql.org/message-id/CAM3SWZS0nwOPoJQHvxugA9kKPzky2QC2348TTWdSStZOkke5tg%40mail.gmail.com). 
I dropped the ball on both patches then, but I still think they would be 
worthwhile.


- Heikki