date:20200331

Re: [HACKERS] Restricting maximum keep segments by repslots

2020-03-31 Thread Kyotaro Horiguchi

At Tue, 31 Mar 2020 18:01:36 -0300, Alvaro Herrera  
wrote in 
> I noticed some other things:
> 
> 1. KeepLogSeg sends a warning message when slots fall behind.  To do
> this, it searches for "the most affected slot", that is, the slot that
> lost the most data.  But it seems to me that that's a bit pointless; if
> a slot data, it's now useless and anything that was using that slot must
> be recreated.  If you only know what's the most affected slot, it's not
> possible to see which *other* slots are affected.  It doesn't matter if
> the slot missed one segment or twenty segments or  segments -- the
> slot is now useless, or it is not useless.  I think we should list the
> slot that was *least* affected, i.e., the slot that lost the minimum
> amount of segments; then the user knows that all slots that are older
> than that one are *also* affected.

Mmm. v17-0001 patch  [1] shows it as the following:

> WARNING:  some replication slots have lost required WAL segments
> DETAIL:  Slot s1 lost 8 segment(s).
> WARNING:  some replication slots have lost required WAL segments
> DETAIL:  Slots s1, s2, s3 lost at most 9 segment(s).

And it is removed following a comment as [2] :p 

I restored the feature in simpler shape in v22.

[1] 
https://www.postgresql.org/message-id/flat/20191224.212614.633369820509385571.horikyota.ntt%40gmail.com#cbc193425b95edd166a5c6d42fd579c6
[2] 
https://www.postgresql.org/message-id/20200123.212854.658794168913258596.horikyota.ntt%40gmail.com

> 2. KeepLogSeg ignores slots that are active.  I guess the logic here is
> that if a slot is active, then it'll keep going until it catches up and
> we don't need to do anything about the used disk space.  But that seems
> a false premise, because if a standby is so slow that it cannot keep up,
> it will eventually run the master out of diskspace even if it's active
> all the time.  So I'm not seeing the reasoning that makes it useful to
> skip checking active slots.

Right. I unconsciously assumed synchronous replication. It should be
removed. Fixed.

> (BTW I don't think you need to keep that many static variables in that
> function.  Just the slot name should be sufficient, I think ... or maybe
> even the *pointer* to the slot that was last reported.

Agreed. Fixed.

> I think if a slot is behind and it lost segments, we should kill the
> walsender that's using it, and unreserve the segments.  So maybe
> something like

At Tue, 31 Mar 2020 19:07:49 -0300, Alvaro Herrera  
wrote in 
> > I think we should kill(SIGTERM) the walsender using the slot 
> > (slot->active_pid),
> > then acquire the slot and set it to some state indicating that it is now
> > useless, no longer reserving WAL; so when the walsender is restarted, it
> > will find the slot cannot be used any longer.
> 
> Ah, I see ioguix already pointed this out and the response was that the
> walsender stops by itself.  Hmm.  I suppose this works too ... it seems
> a bit fragile, but maybe I'm too sensitive.  Do we have other opinions
> on this point?

Yes it the check is performed after every block-read so walsender
doesn't seem to send a wrong record. The 0002 added that for
per-record basis so it can be said useless. But things get simpler by
killing such walsenders under a subtle condition, I think.

In the attached, 0002 removed and added walsender-kill code.

> I sense some attempt to salvage slots that are reading a segment
that is
> "outdated" and removed, but for which the walsender has an open file
> descriptor.  (This appears to be the "losing" state.) This seems
> dangerous, for example the segment might be recycled and is being
> overwritten with different data.  Trying to keep track of that seems
> doomed.  And even if the walsender can still read that data, it's only a
> matter of time before the next segment is also removed.  So keeping the
> walsender alive is futile; it only delays the inevitable.

Agreed.

The attached is v22, only one patch file.

- 0002 is removed

- I didn't add "unknown" status in wal_status, because it is quite
  hard to explain reasonably. Instead, I added the following comment.

+* Find the oldest extant segment file. We get 1 until checkpoint 
removes
+* the first WAL segment file since startup, which causes the status 
being
+* wrong under certain abnormal conditions but that doesn't actually 
harm.

- Changed the message in KeepLogSeg as described above.

- Don't ignore inactive slots in KeepLogSeg.

- Out-of-sync walsenders are killed immediately.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 4afea34a5ad748fddb2061ec6ef0f2fc6e66ba6c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 19 Dec 2018 12:43:57 +0900
Subject: [PATCH v22] Add WAL relief vent for replication slots

Replication slot is useful to maintain replication connection in the
configurations where replication is so delayed that connection is
broken. On the other hand so many WAL files can fill up disk that the
master

Re: backup manifests

2020-03-31 Thread Noah Misch

On Tue, Mar 31, 2020 at 03:50:34PM -0700, Andres Freund wrote:
> On 2020-03-31 14:10:34 -0400, Robert Haas wrote:
> > +/*
> > + * Attempt to parse the WAL files required to restore from backup using
> > + * pg_waldump.
> > + */
> > +static void
> > +parse_required_wal(validator_context *context, char *pg_waldump_path,
> > +  char *wal_directory, manifest_wal_range 
> > *first_wal_range)
> > +{
> > +   manifest_wal_range *this_wal_range = first_wal_range;
> > +
> > +   while (this_wal_range != NULL)
> > +   {
> > +   char *pg_waldump_cmd;
> > +
> > +   pg_waldump_cmd = psprintf("\"%s\" --quiet --path=\"%s\" 
> > --timeline=%u --start=%X/%X --end=%X/%X\n",
> > +  pg_waldump_path, wal_directory, this_wal_range->tli,
> > +  (uint32) (this_wal_range->start_lsn >> 32),
> > +  (uint32) this_wal_range->start_lsn,
> > +  (uint32) (this_wal_range->end_lsn >> 32),
> > +  (uint32) this_wal_range->end_lsn);
> > +   if (system(pg_waldump_cmd) != 0)
> > +   report_backup_error(context,
> > +   "WAL parsing 
> > failed for timeline %u",
> > +   
> > this_wal_range->tli);
> > +
> > +   this_wal_range = this_wal_range->next;
> > +   }
> > +}
> 
> Should we have a function to properly escape paths in cases like this?
> Not that it's likely or really problematic, but the quoting for path
> could be "circumvented".

Are you looking for appendShellString(), or something different?

wraparound dangers in snapshot too old

2020-03-31 Thread Andres Freund

Hi,

I am trying to change the snapshot too old infrastructure so it
cooperates with my snapshot scalability patch. While trying to
understand the code sufficiently, I think I found a fairly serious
issue:

To map the time-based old_snapshot_threshold to an xid that can be used
as a cutoff for heap_page_prune(), we maintain a ringbuffer of
old_snapshot_threshold + 10 entries in
oldSnapshotControlData->xid_by_minute[]. TransactionIdLimitedForOldSnapshots()
uses that to (at least that's the goal) increase the horizon used for
pruning.

The problem is that there's no protection again the xids in the
ringbuffer getting old enough to wrap around. Given that practical uses
of old_snapshot_threshold are likely to be several hours to several
days, that's not particularly hard to hit.

That then has the consequence that we can use an xid that's either from
"from the future" (if bigger than the current xid), or more recent than
appropriate (if it wrapped far enough to be below nextxid, but not yet
older than OldestXmin) as the OldestXmin argument to heap_page_prune().

Which in turn means that we can end up pruning much more recently
removed rows than intended.


While that'd be bad on its own, the big problem is that we won't detect
that case on access, in contrast to the way old_snapshot_threshold is
intended to work. The reason for that is detecting these errors happens
on the basis of timestamps - which obviously do not wrap around.

Greetings,

Andres Freund

Re: control max length of parameter values logged

2020-03-31 Thread Justin Pryzby

Hi,

On Wed, Apr 01, 2020 at 01:52:48AM +0100, Alexey Bashtanov wrote:
> +++ b/doc/src/sgml/config.sgml
> +  xreflabel="log_parameter_max_length">
> +  log_parameter_max_length 
> (integer)
> +  
> +   log_parameter_max_length configuration 
> parameter
> +  
> +  
> +  
> +   
> +If greater than zero, bind parameter values reported in non-error
> +statement-logging messages are trimmed to no more than this many 
> bytes.

Can I suggest to say:

"Limit bind parameter values reported by non-error statement-logging messages
to this many bytes". Or,

"The maximum length of bind parameter values to log with non-error
statement-logging messages".

> --- a/src/backend/utils/misc/guc.c
> +++ b/src/backend/utils/misc/guc.c
> @@ -2855,6 +2857,28 @@ static struct config_int ConfigureNamesInt[] =
>   NULL, NULL, NULL
>   },
>  
> + {
> + {"log_parameter_max_length", PGC_SUSET, LOGGING_WHAT,
> + gettext_noop("When logging statements, limit logged 
> parameter values to first N bytes."),
> + gettext_noop("Zero to print values in full."),

Could you make zero a normal value and -1 the "special" value to disable
trimming ?

Setting to zero will avoid displaying parameters at all, setting to -1 wil
display values in full.

Cheers,
-- 
Justin

Re: A bug when use get_bit() function for a long bytea string

2020-03-31 Thread movead...@highgo.ca


>+ int64 res,resultlen;
>It's better to have them on separate lines.
Sorry for that, done.

>-unsigned
>+int64
> hex_decode(const char *src, unsigned len, char *dst)
>Do we want to explicitly cast the return value to int64? Will build on some 
>platform crib if not done so?
>I don't know of such a platform but my knowledge in this area is not great.
I think current change can make sure nothing wrong.

>+ byteNo = (int)(n / 8);
>+ bitNo = (int)(n % 8);
>some comment explaining why this downcasting is safe here?
Done

>-  proname => 'get_bit', prorettype => 'int4', proargtypes => 'bytea int4',
>+  proname => 'get_bit', prorettype => 'int4', proargtypes => 'bytea int8',
>   prosrc => 'byteaGetBit' },
> { oid => '724', descr => 'set bit',
>-  proname => 'set_bit', prorettype => 'bytea', proargtypes => 'bytea int4 
>int4',
>+  proname => 'set_bit', prorettype => 'bytea', proargtypes => 'bytea int8 
>int4',
>   prosrc => 'byteaSetBit' },
>Shouldn't we have similar changes for following entries as well?
>{ oid => '3032', descr => 'get bit',
> proname => 'get_bit', prorettype => 'int4', proargtypes => 'bit int4',
>  prosrc => 'bitgetbit' },
>{ oid => '3033', descr => 'set bit',
>  proname => 'set_bit', prorettype => 'bit', proargtypes => 'bit int4 int4',
 > prosrc => 'bitsetbit' },
Because 'bitsetbit' and 'bitgetbit' do not have to calculate bit size by 
'multiply 8',
so I think it seems need not to change it.

>The tests you have added are for bytea variant which ultimately calles 
>byteaGet/SetBit(). 
>But I think we also need tests for bit variants which will ultimately call 
>bitgetbit and bitsetbit functions.
As above, it need not to touch 'bitgetbit' and 'bitsetbit'.

>Once you address these comments, I think the patch is good for a committer. 
>So please mark the commitfest entry as such when you post the next version of 
>patch.
Thanks a lot for the detailed review again, and changed patch attached.



Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca


long_bytea_string_bug_fix_ver5.patch
Description: Binary data

Re: [Proposal] Global temporary tables

2020-03-31 Thread Prabhat Sahu

On Wed, Apr 1, 2020 at 8:52 AM 曾文旌  wrote:

>
>
> 2020年3月31日 下午9:59，Prabhat Sahu  写道：
>
> Hi Wenjing,
> Thanks for the new patch.
> I saw with the patch(gtt_v23.patch), we are supporting the new concept
> "global temporary sequence"(i.e. session-specific sequence), is this
> intentional?
>
> It was supported in earlier versions,
>
yes.

This causes the sequence built into the GTT to automatically become a
> "global temp sequence",
> Such as create global temp table (a serial);
> Like GTT, the global temp sequnce is used individually for each session.
>
> Recently, I added the global temp sequence syntax so that it can be
> created independently.
> The purpose of this is to enable such sequence built into the GTT to
> support pg_dump and pg_restore.
>

Thanks for the explanation.

-- 

With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-03-31 Thread David Rowley

On Wed, 1 Apr 2020 at 13:45, Andy Fan  wrote:
>> 5. I think you should be performing a bms_del_member during join
>> removal rather than removing this Assert()
>>
>> - Assert(bms_equal(rel->relids, root->all_baserels));
>>
>> FWIW, it's far from perfect that you've needed to delay the left join
>> removal, but I do understand why you've done it. It's also far from
>> perfect that you're including removed relations in the
>> total_table_pages calculation. c6e4133fae1 took some measures to
>> improve this calculation and this is making it worse again.
>>
> Since the removed relation depends on the UniqueKey which has to be
> calculated after  total_table_pages calculation in current code, so that's
> something I must do.  But if the relation is not removable,  there is no waste
> at all.  If it is removable,  such gain will much higher than the loss.  I'm
> not sure this should be a concern.

The reason join removals was done so early in planning before was to
save the planner from having to do additional work for relations which
were going to be removed later.  For example, building path lists.

> Actually looks the current remove_useless_join has some limits which can't
> remove a joinrel,  I still didn't figure out why.  In the past we have some 
> limited
> ability to detect the unqiueness after join, so that's would be ok.  Since  
> we have
> such ability now,  this may be another opportunity to improve the 
> join_is_removable
> function, but I'd not like put such thing in this patch.

Yeah, there's certainly more left join shapes that we could remove.
e.g when the left join relation is not a singleton rel.  We shouldn't
do anything to purposefully block additional join removals as a result
of adding UniqueKeys, but likely shouldn't go to any trouble to make
additional ones work. That can be done later.

> Since you said "far from perfect" twice for this point and I only get one 
> reason (we
> may plan a node which we removed later),  do I miss the other one?

a) additional planning work by not removing the join sooner. b) wrong
total page calculation.

In theory b) could be fixed by subtracting the removed join rels pages
after we remove it, but unfortunately, there's no point since we've
built the paths by that time already and we really only use the value
to determine how much IO is going to be random vs sequential, which is
determined during set_base_rel_pathlists()

>> d. not_null_cols should not be a Relids type, it should be Bitmapset.
>>
> If it is a Bitmapset,  we have to pass it with "&" usually.  is it our 
> practice?

Well, a Bitmapset pointer.   Relids is saved for range table indexes.
Storing anything else in there is likely to lead to confusion.

>> 12. The comment in the code below is not true. The List contains
>> Lists, of which contain UniqueKeys
>>
>> List*uniquekeys; /* List of UniqueKey */
>>
> It is a list of UniqueKey,  the UniqueKey can have a list of exprs.

Hmm, so this is what I called a UniqueKeySet in the original patch.
I'm a bit divided by that change. With PathKeys, technically you can
make use of a Path with a given set of PathKeys if you only require
some leading subset of those keys.  That's not the case for
UniqueKeys, it's all or nothing, so perhaps having the singular name
is better than the plural name I gave it. However, I'm not certain.

(Really PathKey does not seem like a great name in the first place
since it has nothing to do with keys)

>> 13. I'm having trouble parsing the final sentence in:
>>
>> + * can only guarantee the uniqueness without considering the null values. 
>> This
>> + * field is necessary for remove_useless_join & reduce_unique_semijions 
>> since
>> + * these cases don't care about the null values.
>>
>> Why is the field which stores the nullability of the key required for
>> code that does not care about the nullability of the key?
>>
> The guarantee is introduced to for the following cases:
>
> create table t1 (a int primary key, b int);
> create table t2 (a int primary key, b int);
> select .. from t1,  (select b from t2 group by t2) t2 ..;
>
> -- b is nullable.  so t2(b) can't be a normal UniqueKey (which means b may 
> have some
> duplicated rows)
> create unique index t2_uk_b on t2(b);
>
> -- the left join still can be removed since t2.b is a unique index and the 
> nullable
> doesn't matter here.
> select t1.* from t1 left join t2 on (t1.b = t2.b);
>
> do you think we have can do some optimization in this case? I don't understand
> your question well.

OK, so by "don't care", you mean, don't duplicate NULL values.  I
assumed you had meant that it does not matter either way, as in: don't
mind if there are NULL values or not. It might be best to have a go at
changing the wording to be more explicit to what you mean there.

Re: [Proposal] Global temporary tables

2020-03-31 Thread 曾文旌



> 2020年3月31日 下午9:59，Prabhat Sahu  写道：
> 
> Hi Wenjing,
> Thanks for the new patch.
> I saw with the patch(gtt_v23.patch), we are supporting the new concept 
> "global temporary sequence"(i.e. session-specific sequence), is this 
> intentional?
It was supported in earlier versions,
This causes the sequence built into the GTT to automatically become a "global 
temp sequence",
Such as create global temp table (a serial);
Like GTT, the global temp sequnce is used individually for each session.

Recently, I added the global temp sequence syntax so that it can be created 
independently.
The purpose of this is to enable such sequence built into the GTT to support 
pg_dump and pg_restore.


Wenjing


> 
> postgres=# create global temporary sequence gt_seq;
> CREATE SEQUENCE
> postgres=# create sequence seq;
> CREATE SEQUENCE
> postgres=# \d+
>   List of relations
>  Schema |  Name  |   Type   | Owner | Persistence |Size| Description 
> ++--+---+-++-
>  public | gt_seq | sequence | edb   | session | 8192 bytes | 
>  public | seq| sequence | edb   | permanent   | 8192 bytes | 
> (2 rows)
> 
> postgres=# select nextval('gt_seq'), nextval('seq');
>  nextval | nextval 
> -+-
>1 |   1
> (1 row)
> 
> postgres=# select nextval('gt_seq'), nextval('seq');
>  nextval | nextval 
> -+-
>2 |   2
> (1 row)
> 
> -- Exit and re-connect to psql prompt:
> postgres=# \q
> [edb@localhost bin]$ ./psql postgres 
> psql (13devel)
> Type "help" for help.
> 
> postgres=# select nextval('gt_seq'), nextval('seq');
>  nextval | nextval 
> -+-
>1 |   3
> (1 row)
> 
> postgres=# select nextval('gt_seq'), nextval('seq');
>  nextval | nextval 
> -+-
>2 |   4
> (1 row)
> 
> On Tue, Mar 31, 2020 at 9:46 AM 曾文旌  > wrote:
> 
> 
>> 2020年3月27日 下午5:21，tushar > > 写道：
>> 
>> On 3/27/20 10:55 AM, 曾文旌 wrote:
 Hi Wenjing,
 This patch(gtt_v21_pg13.patch) is not applicable on PG HEAD, I hope you 
 have prepared the patch on top of some previous commit. 
 Could you please rebase the patch which we can apply on HEAD ?
>>> Yes, It looks like the built-in functions are in conflict with new code.
>>> 
>>> 
>> This error message looks wrong  to me-
>> 
>> postgres=# reindex table concurrently t ;
>> ERROR:  cannot create indexes on global temporary tables using concurrent 
>> mode
>> postgres=# 
>> 
>> Better message would be-
>> 
>> ERROR:  cannot reindex global temporary tables concurrently
>> 
> I found that the local temp table automatically disables concurrency mode.
> so, I made some improvements, The reindex GTT behaves the same as the local 
> temp table.
> 
> 
> Wenjing
> 
> 
> 
>> 
>> -- 
>> regards,tushar
>> EnterpriseDB  https://www.enterprisedb.com/ 
>> The Enterprise PostgreSQL Company
> 
> 
> 
> -- 
> With Regards,
> Prabhat Kumar Sahu
> EnterpriseDB: http://www.enterprisedb.com 



smime.p7s
Description: S/MIME cryptographic signature

Re: pg_stat_statements issue with parallel maintenance (Was Re: WAL usage calculation patch)

2020-03-31 Thread Dilip Kumar

On Wed, Apr 1, 2020 at 8:26 AM Masahiko Sawada
 wrote:
>
> On Wed, 1 Apr 2020 at 11:46, Amit Kapila  wrote:
> >
> > On Tue, Mar 31, 2020 at 7:32 PM Dilip Kumar  wrote:
> > >
> > > While testing I have found one issue.  Basically, during a parallel
> > > vacuum, it was showing more number of
> > > shared_blk_hits+shared_blks_read.  After, some investigation, I found
> > > that during the cleanup phase nworkers are -1, and because of this we
> > > didn't try to launch worker but "lps->pcxt->nworkers_launched" had the
> > > old launched worker count and shared memory also had old buffer read
> > > data which was never updated as we did not try to launch the worker.
> > >
> > > diff --git a/src/backend/access/heap/vacuumlazy.c
> > > b/src/backend/access/heap/vacuumlazy.c
> > > index b97b678..5dfaf4d 100644
> > > --- a/src/backend/access/heap/vacuumlazy.c
> > > +++ b/src/backend/access/heap/vacuumlazy.c
> > > @@ -2150,7 +2150,8 @@ lazy_parallel_vacuum_indexes(Relation *Irel,
> > > IndexBulkDeleteResult **stats,
> > >  * Next, accumulate buffer usage.  (This must wait for the 
> > > workers to
> > >  * finish, or we might get incomplete data.)
> > >  */
> > > -   for (i = 0; i < lps->pcxt->nworkers_launched; i++)
> > > +   nworkers = Min(nworkers, lps->pcxt->nworkers_launched);
> > > +   for (i = 0; i < nworkers; i++)
> > > InstrAccumParallelQuery(>buffer_usage[i]);
> > >
> > > It worked after the above fix.
> > >
> >
> > Good catch.  I think we should not even call
> > WaitForParallelWorkersToFinish for such a case.  So, I guess the fix
> > could be,
> >
> > if (workers > 0)
> > {
> > WaitForParallelWorkersToFinish();
> > for (i = 0; i < lps->pcxt->nworkers_launched; i++)
> >  InstrAccumParallelQuery(>buffer_usage[i]);
> > }
> >
>
> Agreed. I've attached the updated patch.
>
> Thank you for testing, Dilip!

Thanks!  One hunk is failing on the latest head.  And, I have rebased
the patch for my testing so posting the same.  I have done some more
testing to test multi-pass vacuum.

postgres[114321]=# show maintenance_work_mem ;
 maintenance_work_mem
--
 1MB
(1 row)

--Test case
select pg_stat_statements_reset();
drop table test;
CREATE TABLE test (a int, b int);
CREATE INDEX idx1 on test(a);
CREATE INDEX idx2 on test(b);
INSERT INTO test SELECT i, i FROM GENERATE_SERIES(1,200) as i;
DELETE FROM test where a%2=0;
VACUUM (PARALLEL n) test;
select query, total_time, shared_blks_hit, shared_blks_read,
shared_blks_hit + shared_blks_read as total_read_blks,
shared_blks_dirtied, shared_blks_written from pg_stat_statements where
query like 'VACUUM%';

  query   | total_time  | shared_blks_hit |
shared_blks_read | total_read_blks | shared_blks_dirtied |
shared_blks_written
--+-+-+--+-+-+-
 VACUUM (PARALLEL 0) test | 5964.282408 |   92447 |
6 |   92453 |   19789 |   0


  query   | total_time | shared_blks_hit |
shared_blks_read | total_read_blks | shared_blks_dirtied |
shared_blks_written
--++-+--+-+-+-
 VACUUM (PARALLEL 1) test | 3957.765881003 |   92447 |
   6 |   92453 |   19789 |
  0
(1 row)

So I am getting correct results with the multi-pass vacuum.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


bufferusage_vacuum_v5.patch
Description: Binary data

Re: Tab completion for \gx

2020-03-31 Thread Bruce Momjian



Patch applied though PG 10, thanks.

---

On Thu, Mar 26, 2020 at 09:58:50AM -0700, Vik Fearing wrote:
> While reviewing the patch for \gf, I noticed that \gx does not have tab
> completion for its optional filename.  Trivial patch attached.  I would
> also suggest this be backpatched.
> -- 
> Vik Fearing

> diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
> index ae35fa4aa9..7252b6c4e6 100644
> --- a/src/bin/psql/tab-complete.c
> +++ b/src/bin/psql/tab-complete.c
> @@ -3882,7 +3882,7 @@ psql_completion(const char *text, int start, int end)
>   COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_routines, NULL);
>   else if (TailMatchesCS("\\sv*"))
>   COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_views, NULL);
> - else if (TailMatchesCS("\\cd|\\e|\\edit|\\g|\\i|\\include|"
> + else if (TailMatchesCS("\\cd|\\e|\\edit|\\g|\\gx|\\i|\\include|"
>  
> "\\ir|\\include_relative|\\o|\\out|"
>  
> "\\s|\\w|\\write|\\lo_import"))
>   {


-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: pg_stat_statements issue with parallel maintenance (Was Re: WAL usage calculation patch)

2020-03-31 Thread Masahiko Sawada

On Wed, 1 Apr 2020 at 11:46, Amit Kapila  wrote:
>
> On Tue, Mar 31, 2020 at 7:32 PM Dilip Kumar  wrote:
> >
> > While testing I have found one issue.  Basically, during a parallel
> > vacuum, it was showing more number of
> > shared_blk_hits+shared_blks_read.  After, some investigation, I found
> > that during the cleanup phase nworkers are -1, and because of this we
> > didn't try to launch worker but "lps->pcxt->nworkers_launched" had the
> > old launched worker count and shared memory also had old buffer read
> > data which was never updated as we did not try to launch the worker.
> >
> > diff --git a/src/backend/access/heap/vacuumlazy.c
> > b/src/backend/access/heap/vacuumlazy.c
> > index b97b678..5dfaf4d 100644
> > --- a/src/backend/access/heap/vacuumlazy.c
> > +++ b/src/backend/access/heap/vacuumlazy.c
> > @@ -2150,7 +2150,8 @@ lazy_parallel_vacuum_indexes(Relation *Irel,
> > IndexBulkDeleteResult **stats,
> >  * Next, accumulate buffer usage.  (This must wait for the workers 
> > to
> >  * finish, or we might get incomplete data.)
> >  */
> > -   for (i = 0; i < lps->pcxt->nworkers_launched; i++)
> > +   nworkers = Min(nworkers, lps->pcxt->nworkers_launched);
> > +   for (i = 0; i < nworkers; i++)
> > InstrAccumParallelQuery(>buffer_usage[i]);
> >
> > It worked after the above fix.
> >
>
> Good catch.  I think we should not even call
> WaitForParallelWorkersToFinish for such a case.  So, I guess the fix
> could be,
>
> if (workers > 0)
> {
> WaitForParallelWorkersToFinish();
> for (i = 0; i < lps->pcxt->nworkers_launched; i++)
>  InstrAccumParallelQuery(>buffer_usage[i]);
> }
>

Agreed. I've attached the updated patch.

Thank you for testing, Dilip!

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


bufferusage_vacuum_v4.patch
Description: Binary data

Re: pg_stat_statements issue with parallel maintenance (Was Re: WAL usage calculation patch)

2020-03-31 Thread Dilip Kumar

On Wed, Apr 1, 2020 at 8:16 AM Amit Kapila  wrote:
>
> On Tue, Mar 31, 2020 at 7:32 PM Dilip Kumar  wrote:
> >
> > While testing I have found one issue.  Basically, during a parallel
> > vacuum, it was showing more number of
> > shared_blk_hits+shared_blks_read.  After, some investigation, I found
> > that during the cleanup phase nworkers are -1, and because of this we
> > didn't try to launch worker but "lps->pcxt->nworkers_launched" had the
> > old launched worker count and shared memory also had old buffer read
> > data which was never updated as we did not try to launch the worker.
> >
> > diff --git a/src/backend/access/heap/vacuumlazy.c
> > b/src/backend/access/heap/vacuumlazy.c
> > index b97b678..5dfaf4d 100644
> > --- a/src/backend/access/heap/vacuumlazy.c
> > +++ b/src/backend/access/heap/vacuumlazy.c
> > @@ -2150,7 +2150,8 @@ lazy_parallel_vacuum_indexes(Relation *Irel,
> > IndexBulkDeleteResult **stats,
> >  * Next, accumulate buffer usage.  (This must wait for the workers 
> > to
> >  * finish, or we might get incomplete data.)
> >  */
> > -   for (i = 0; i < lps->pcxt->nworkers_launched; i++)
> > +   nworkers = Min(nworkers, lps->pcxt->nworkers_launched);
> > +   for (i = 0; i < nworkers; i++)
> > InstrAccumParallelQuery(>buffer_usage[i]);
> >
> > It worked after the above fix.
> >
>
> Good catch.  I think we should not even call
> WaitForParallelWorkersToFinish for such a case.  So, I guess the fix
> could be,
>
> if (workers > 0)
> {
> WaitForParallelWorkersToFinish();
> for (i = 0; i < lps->pcxt->nworkers_launched; i++)
>  InstrAccumParallelQuery(>buffer_usage[i]);
> }
>
> or something along those lines.

Hmm, Right!

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: pg_stat_statements issue with parallel maintenance (Was Re: WAL usage calculation patch)

2020-03-31 Thread Amit Kapila

On Tue, Mar 31, 2020 at 7:32 PM Dilip Kumar  wrote:
>
> While testing I have found one issue.  Basically, during a parallel
> vacuum, it was showing more number of
> shared_blk_hits+shared_blks_read.  After, some investigation, I found
> that during the cleanup phase nworkers are -1, and because of this we
> didn't try to launch worker but "lps->pcxt->nworkers_launched" had the
> old launched worker count and shared memory also had old buffer read
> data which was never updated as we did not try to launch the worker.
>
> diff --git a/src/backend/access/heap/vacuumlazy.c
> b/src/backend/access/heap/vacuumlazy.c
> index b97b678..5dfaf4d 100644
> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -2150,7 +2150,8 @@ lazy_parallel_vacuum_indexes(Relation *Irel,
> IndexBulkDeleteResult **stats,
>  * Next, accumulate buffer usage.  (This must wait for the workers to
>  * finish, or we might get incomplete data.)
>  */
> -   for (i = 0; i < lps->pcxt->nworkers_launched; i++)
> +   nworkers = Min(nworkers, lps->pcxt->nworkers_launched);
> +   for (i = 0; i < nworkers; i++)
> InstrAccumParallelQuery(>buffer_usage[i]);
>
> It worked after the above fix.
>

Good catch.  I think we should not even call
WaitForParallelWorkersToFinish for such a case.  So, I guess the fix
could be,

if (workers > 0)
{
WaitForParallelWorkersToFinish();
for (i = 0; i < lps->pcxt->nworkers_launched; i++)
 InstrAccumParallelQuery(>buffer_usage[i]);
}

or something along those lines.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-31 Thread Tomas Vondra


On Tue, Mar 31, 2020 at 10:12:29PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 9:59 PM Tomas Vondra
 wrote:


On Tue, Mar 31, 2020 at 08:42:47PM -0400, James Coleman wrote:
>On Tue, Mar 31, 2020 at 8:38 PM Tomas Vondra
> wrote:
>>
>> On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:
>> >On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
>> > wrote:
>> >>
>> >> On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:
>> >> >On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
>> >> > wrote:
>> >> >>
>> >> >> On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:
>> >> >> >Tomas Vondra  writes:
>> >> >> >> In general, I think it'd be naive that we can make planner smarter 
with
>> >> >> >> no extra overhead spent on planning, and we can never accept patches
>> >> >> >> adding even tiny overhead. With that approach we'd probably end up 
with
>> >> >> >> a trivial planner that generates just a single query plan, because
>> >> >> >> that's going to be the fastest planner. A realistic approach needs 
to
>> >> >> >> consider both the planning and execution phase, and benefits of this
>> >> >> >> patch seem to be clear - if you have queries that do benefit from 
it.
>> >> >> >
>> >> >> >I think that's kind of attacking a straw man, though.  The thing that
>> >> >> >people push back on, or should push back on IMO, is when a proposed
>> >> >> >patch adds significant slowdown to queries that it has no or very 
little
>> >> >> >hope of improving.  The trick is to do expensive stuff only when
>> >> >> >there's a good chance of getting a better plan out of it.
>> >> >> >
>> >> >>
>> >> >> Yeah, I agree with that. I think the main issue is that we don't really
>> >> >> know what the "expensive stuff" is in this case, so it's not really
>> >> >> clear how to be smarter :-(
>> >> >
>> >> >To add to this: I agree that ideally you'd check cheaply to know
>> >> >you're in a situation that might help, and then do more work. But here
>> >> >the question is always going to be simply "would we benefit from an
>> >> >ordering, and, if so, do we have it already partially sorted". It's
>> >> >hard to imagine that reducing much conceptually, so we're left with
>> >> >optimizations of that check.
>> >> >
>> >>
>> >> I think it depends on what exactly is the expensive part. For example if
>> >> it's the construction of IncrementalSort paths, then maybe we could try
>> >> do a quick/check check if the path can even be useful by estimating the
>> >> cost and only then building the path.
>> >>
>> >> That's what we do for join algorithms, for example - we first compute
>> >> initial_cost_nestloop and only when that seems cheap enough we do the
>> >> more expensive stuff.
>> >>
>> >> But I'm not sure the path construction is the expensive part, as it
>> >> should be disabled by enable_incrementalsort=off. But the regression
>> >> does not seem to disappear, at least not entirely.
>> >>
>> >>
>> >> >> One possibility is that it's just one of those regressions due to 
change
>> >> >> in binary layout, but I'm not sure know how to verify that.
>> >> >
>> >> >If we are testing with a case that can't actually add more paths (due
>> >> >to it checking the guc before building them), doesn't that effectively
>> >> >leave one of these two options:
>> >> >
>> >> >1. Binary layout/cache/other untraceable change, or
>> >> >2. Changes due to refactored function calls.
>> >> >
>> >>
>> >> Hmm, but in case of (1) the overhead should be there even with tests
>> >> that don't really have any additional paths to consider, right? I've
>> >> tried with such test (single table with no indexes) and I don't quite
>> >> see any regression (maybe ~1%).
>> >
>> >Not necessarily, if the cost is in sort costing or useful pathkeys
>> >checking, right? We have run that code even without incremental sort,
>> >but it's changed from master.
>> >
>>
>> Ah, I should have mentioned I've done most of the tests on just the
>> basic incremental sort patch (0001+0002), without the additional useful
>> paths. I initially tested the whole patch series, but after discovering
>> the regression I removed the last part (which I suspected might be the
>> root cause). But the regression is still there, so it's not that.
>>
>> It might be in the reworked costing, yeah. But then I'd expect those
>> function to show in the perf profile.
>
>Right. I'm just grasping at straws on that.
>
>> >> (2) might have impact, but I don't see any immediate supects. Did you
>> >> have some functions in mind?
>> >
>> >I guess this is where the lines blur: I didn't see anything obvious
>> >either, but the changes to sort costing...should probably not have
>> >real impact...but...
>> >
>>
>> :-(
>>
>> >> BTW I see the patch adds pathkeys_common but it's not called from
>> >> anywhere. It's probably leftover from an earlier patch version.
>> >>
>
>BTW, I think I'm going to rename the pathkeys_common_contained_in
>function to something like pathkeys_count_contained_in, unless you
>have an

Re: recovery_target_action=pause with confusing hint

2020-03-31 Thread movead...@highgo.ca


>> When I test the patch, I find an issue: I start a stream with 
>> 'promote_trigger_file'
>> GUC valid, and exec pg_wal_replay_pause() during recovery and as below it
>> shows success to pause at the first time. I think it use a initialize
>> 'SharedPromoteIsTriggered' value first time I exec the pg_wal_replay_pause().
 
>hm. Are you sure this is related to this patch? Could you explain the exact 
>timing? I mean log_statement = all and relevant logs.
>Most likely this is expected change by 
>https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=496ee647ecd2917369ffcf1eaa0b2cdca07c8730
> 
>My proposal does not change the behavior after this commit, only changing the 
>lines in the logs.
 
I test it again with (92d31085e926253aa650b9d1e1f2f09934d0ddfc), and the
issue appeared again. Here is my test method which quite simple:
1. Setup a base backup by pg_basebackup.
2. Insert lots of data in master for the purpose I have enough time to exec
   pg_wal_replay_pause() when startup the replication.
3. Configure the 'promote_trigger_file' GUC and create the trigger file.
4. Start the backup(standby), connect it immediately, and exec 
pg_wal_replay_pause()
Then it appears, and a test log attached.

I means when I exec the pg_wal_replay_pause() first time, nobody has check the 
trigger state
by CheckForStandbyTrigger(), it use a Initialized 'SharedPromoteIsTriggered' 
value. 
And patch attached can solve the issue.






Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca



pg_wal_replay_pause_issue.patch
Description: Binary data


postgresql-2020-04-01_100849.log
Description: Binary data

Re: Issues with building cpp extensions on PostgreSQL 10+

2020-03-31 Thread Bruce Momjian



Patch applied to PG 10, thanks. I don't know about your other questions,
but if you want to propose a patch, we can review it.  Thanks.

---

On Fri, Mar 20, 2020 at 05:02:15PM +0100, Oleksii Kliukin wrote:
> Hello,
> 
> I’ve recently tried to build an extension that employs C++ files and also
> passes them to a linker to make a shared library. I’ve discovered a few
> issues with them:
> 
> - in v10 CFLAGS_SL is not appended to the CXXFLAGS in Makefile.shlib,
> resulting in cpp files compiled without -fPIC, leading to errors when
> creating the shared library out of them. In v11 and above CFLAGS_SL is
> prepended to the PG_CXXFLAGS, but there are no PG_CXXFLAGS on v10, and the
> Makefile does nothing to add them to CXXFLAGS. Patch is attached.
> 
> - not just with v10, when building bc files from cpp, there are no CXXFLAGS
> passed; as a result, when building a source with non-standard flags (i.e
> -std=c++11) one would get an error during building of bc files.
> 
> The rules in the Makefile.global.(in) look like:
> 
> ifndef COMPILE.c.bc
> # -Wno-ignored-attributes added so gnu_printf doesn't trigger
> # warnings, when the main binary is compiled with C.
> COMPILE.c.bc = $(CLANG) -Wno-ignored-attributes $(BITCODE_CFLAGS) $(CPPFLAGS) 
> -flto=thin -emit-llvm -c
> endif
> 
> ifndef COMPILE.cxx.bc
> COMPILE.cxx.bc = $(CLANG) -xc++ -Wno-ignored-attributes $(BITCODE_CXXFLAGS) 
> $(CPPFLAGS) -flto=thin -emit-llvm -c
> endif
> 
> %.bc : %.c
>   $(COMPILE.c.bc) -o $@ $<
> 
> %.bc : %.cpp
>   $(COMPILE.cxx.bc) -o $@ $<
> 
> However, there seems to be no way to override BITCODE_CXXFLAGS to include
> any specific C++ compilation flags that are also required to build object
> files from cpp. Same applies to .bc derived from .c files with
> BITCODE_CFLAGS respectively.
> 
> I am wondering if we could define something like PG_BITCODE_CXXFLAGS and
> PG_BITCODE_CFLAGS in pgxs.mk to be able to override those. If this sound
> like a right strategy, I’ll prepare a patch.
> 
> Cheers,
> Oleksii “Alex” Kluukin
> 

> diff --git a/src/Makefile.shlib b/src/Makefile.shlib
> index eb45daedc8..342496eecd 100644
> --- a/src/Makefile.shlib
> +++ b/src/Makefile.shlib
> @@ -101,6 +101,7 @@ endif
>  # Try to keep the sections in some kind of order, folks...
>  
>  override CFLAGS += $(CFLAGS_SL)
> +override CXXFLAGS += $(CFLAGS_SL)
>  ifdef SO_MAJOR_VERSION
>  # libraries ought to use this to refer to versioned gettext domain names
>  override CPPFLAGS += -DSO_MAJOR_VERSION=$(SO_MAJOR_VERSION)


-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: error context for vacuum to include block number

2020-03-31 Thread Amit Kapila

On Tue, Mar 31, 2020 at 8:53 AM Justin Pryzby  wrote:
>
> On Tue, Mar 31, 2020 at 07:50:45AM +0530, Amit Kapila wrote:
> > One thing I have noticed is that there is some saving by using
> > vacrelstats->relnamespace as that avoids sys cache lookup.  OTOH,
> > using vacrelstats->relname doesn't save much, but maybe for the sake
> > of consistency, we can use it.
>
> Mostly I wrote that to avoid repeatedly calling functions/macro with long 
> name.
> I consider it a minor cleanup.  I think we should put them to use.  The
> LVRelStats describes them as not being specifically for the error context.
>

Pushed. I think we are done here.  The patch is marked as committed in
CF.  Thank you!


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-03-31 Thread Michael Paquier

On Tue, Mar 31, 2020 at 03:48:21PM +0900, Michael Paquier wrote:
> Thanks, committed 0001 after fixing the order of the headers.  One
> patch left.

And committed now 0002, meaning that we are officially done.  Thanks
Alexey for your patience.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-31 Thread James Coleman

On Tue, Mar 31, 2020 at 9:59 PM Tomas Vondra
 wrote:
>
> On Tue, Mar 31, 2020 at 08:42:47PM -0400, James Coleman wrote:
> >On Tue, Mar 31, 2020 at 8:38 PM Tomas Vondra
> > wrote:
> >>
> >> On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:
> >> >On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
> >> > wrote:
> >> >>
> >> >> On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:
> >> >> >On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
> >> >> > wrote:
> >> >> >>
> >> >> >> On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:
> >> >> >> >Tomas Vondra  writes:
> >> >> >> >> In general, I think it'd be naive that we can make planner 
> >> >> >> >> smarter with
> >> >> >> >> no extra overhead spent on planning, and we can never accept 
> >> >> >> >> patches
> >> >> >> >> adding even tiny overhead. With that approach we'd probably end 
> >> >> >> >> up with
> >> >> >> >> a trivial planner that generates just a single query plan, because
> >> >> >> >> that's going to be the fastest planner. A realistic approach 
> >> >> >> >> needs to
> >> >> >> >> consider both the planning and execution phase, and benefits of 
> >> >> >> >> this
> >> >> >> >> patch seem to be clear - if you have queries that do benefit from 
> >> >> >> >> it.
> >> >> >> >
> >> >> >> >I think that's kind of attacking a straw man, though.  The thing 
> >> >> >> >that
> >> >> >> >people push back on, or should push back on IMO, is when a proposed
> >> >> >> >patch adds significant slowdown to queries that it has no or very 
> >> >> >> >little
> >> >> >> >hope of improving.  The trick is to do expensive stuff only when
> >> >> >> >there's a good chance of getting a better plan out of it.
> >> >> >> >
> >> >> >>
> >> >> >> Yeah, I agree with that. I think the main issue is that we don't 
> >> >> >> really
> >> >> >> know what the "expensive stuff" is in this case, so it's not really
> >> >> >> clear how to be smarter :-(
> >> >> >
> >> >> >To add to this: I agree that ideally you'd check cheaply to know
> >> >> >you're in a situation that might help, and then do more work. But here
> >> >> >the question is always going to be simply "would we benefit from an
> >> >> >ordering, and, if so, do we have it already partially sorted". It's
> >> >> >hard to imagine that reducing much conceptually, so we're left with
> >> >> >optimizations of that check.
> >> >> >
> >> >>
> >> >> I think it depends on what exactly is the expensive part. For example if
> >> >> it's the construction of IncrementalSort paths, then maybe we could try
> >> >> do a quick/check check if the path can even be useful by estimating the
> >> >> cost and only then building the path.
> >> >>
> >> >> That's what we do for join algorithms, for example - we first compute
> >> >> initial_cost_nestloop and only when that seems cheap enough we do the
> >> >> more expensive stuff.
> >> >>
> >> >> But I'm not sure the path construction is the expensive part, as it
> >> >> should be disabled by enable_incrementalsort=off. But the regression
> >> >> does not seem to disappear, at least not entirely.
> >> >>
> >> >>
> >> >> >> One possibility is that it's just one of those regressions due to 
> >> >> >> change
> >> >> >> in binary layout, but I'm not sure know how to verify that.
> >> >> >
> >> >> >If we are testing with a case that can't actually add more paths (due
> >> >> >to it checking the guc before building them), doesn't that effectively
> >> >> >leave one of these two options:
> >> >> >
> >> >> >1. Binary layout/cache/other untraceable change, or
> >> >> >2. Changes due to refactored function calls.
> >> >> >
> >> >>
> >> >> Hmm, but in case of (1) the overhead should be there even with tests
> >> >> that don't really have any additional paths to consider, right? I've
> >> >> tried with such test (single table with no indexes) and I don't quite
> >> >> see any regression (maybe ~1%).
> >> >
> >> >Not necessarily, if the cost is in sort costing or useful pathkeys
> >> >checking, right? We have run that code even without incremental sort,
> >> >but it's changed from master.
> >> >
> >>
> >> Ah, I should have mentioned I've done most of the tests on just the
> >> basic incremental sort patch (0001+0002), without the additional useful
> >> paths. I initially tested the whole patch series, but after discovering
> >> the regression I removed the last part (which I suspected might be the
> >> root cause). But the regression is still there, so it's not that.
> >>
> >> It might be in the reworked costing, yeah. But then I'd expect those
> >> function to show in the perf profile.
> >
> >Right. I'm just grasping at straws on that.
> >
> >> >> (2) might have impact, but I don't see any immediate supects. Did you
> >> >> have some functions in mind?
> >> >
> >> >I guess this is where the lines blur: I didn't see anything obvious
> >> >either, but the changes to sort costing...should probably not have
> >> >real impact...but...
> >> >
> >>
> >> :-(
> >>
> >> >> BTW I see the patch adds

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-03-31 Thread Masahiko Sawada

On Tue, 31 Mar 2020 at 23:16, Masahiko Sawada
 wrote:
>
> On Fri, 27 Mar 2020 at 13:54, Fujii Masao  wrote:
> >
> >
> >
> > On 2020/03/27 10:26, Tom Lane wrote:
> > > Twice in the past month [1][2], buildfarm member hoverfly has managed
> > > to reach the "unreachable" Assert(false) at the end of
> > > SyncRepGetSyncStandbysPriority.
> >
> > When I search the past discussions, I found that Noah Misch reported
> > the same issue.
> > https://www.postgresql.org/message-id/20200206074552.gb3326...@rfd.leadboat.com
> >
> > > What seems likely to me, after quickly eyeballing the code, is that
> > > hoverfly is hitting the blatantly-obvious race condition in that function.
> > > Namely, that the second loop supposes that the state of the walsender
> > > array hasn't changed since the first loop.
> > >
> > > The minimum fix for this, I suppose, would have the first loop capture
> > > the sync_standby_priority value for each walsender along with what it's
> > > already capturing.  But I wonder if the whole function shouldn't be
> > > rewritten from scratch, because it seems like the algorithm is both
> > > expensively brute-force and unintelligible, which is a sad combination.
> > > It's likely that the number of walsenders would never be high enough
> > > that efficiency could matter, but then couldn't we use an algorithm
> > > that is less complicated and more obviously correct?
> >
> > +1 to rewrite the function with better algorithm.
> >
> > > (Because the
> > > alternative conclusion, if you reject the theory that a race is happening,
> > > is that the algorithm is just flat out buggy; something that's not too
> > > easy to disprove either.)
> > >
> > > Another fairly dubious thing here is that whether or not *am_sync
> > > gets set depends not only on whether MyWalSnd is claiming to be
> > > synchronous but on how many lower-numbered walsenders are too.
> > > Is that really the right thing?
> > >
> > > But worse than any of that is that the return value seems to be
> > > a list of walsender array indexes, meaning that the callers cannot
> > > use it without making even stronger assumptions about the array
> > > contents not having changed since the start of this function.
> > >
> > > It sort of looks like the design is based on the assumption that
> > > the array contents can't change while SyncRepLock is held ... but
> > > if that's the plan then why bother with the per-walsender spinlocks?
> > > In any case this assumption seems to be failing, suggesting either
> > > that there's a caller that's not holding SyncRepLock when it calls
> > > this function, or that somebody is failing to take that lock while
> > > modifying the array.
> >
> > As far as I read the code, that assumption seems still valid. But the 
> > problem
> > is that each walsender updates MyWalSnd->sync_standby_priority at each
> > convenient timing, when SIGHUP is signaled. That is, at a certain moment,
> > some walsenders (also their WalSnd entries in shmem) work based on
> > the latest configuration but the others (also their WalSnd entries) work 
> > based
> > on the old one.
> >
> > lowest_priority = SyncRepConfig->nmembers;
> > next_highest_priority = lowest_priority + 1;
> >
> > SyncRepGetSyncStandbysPriority() calculates the lowest priority among
> > all running walsenders as the above, by using the configuration info that
> > this walsender is based on. But this calculated lowest priority would be
> > invalid if other walsender is based on different (e.g., old) configuraiton.
> > This can cause the (other) walsender to have lower priority than
> > the calculated lowest priority and the second loop in
> > SyncRepGetSyncStandbysPriority() to unexpectedly end.
>
> I have the same understanding. Since sync_standby_priroity is
> protected by SyncRepLock these values of each walsender are not
> changed through two loops in SyncRepGetSyncStandbysPriority().
> However, as Fujii-san already mentioned the true lowest priority can
> be lower than lowest_priority, nmembers, when only part of walsenders
> reloaded the configuration, which in turn could be the cause of
> leaving entries in the pending list at the end of the function.
>
> > Therefore, the band-aid fix seems to be to set the lowest priority to
> > very large number at the beginning of SyncRepGetSyncStandbysPriority().
>
> I think we can use max_wal_senders.

Sorry, that's not true. We need another number large enough.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-31 Thread Tomas Vondra


On Tue, Mar 31, 2020 at 08:42:47PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 8:38 PM Tomas Vondra
 wrote:


On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:
>On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
> wrote:
>>
>> On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:
>> >On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
>> > wrote:
>> >>
>> >> On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:
>> >> >Tomas Vondra  writes:
>> >> >> In general, I think it'd be naive that we can make planner smarter with
>> >> >> no extra overhead spent on planning, and we can never accept patches
>> >> >> adding even tiny overhead. With that approach we'd probably end up with
>> >> >> a trivial planner that generates just a single query plan, because
>> >> >> that's going to be the fastest planner. A realistic approach needs to
>> >> >> consider both the planning and execution phase, and benefits of this
>> >> >> patch seem to be clear - if you have queries that do benefit from it.
>> >> >
>> >> >I think that's kind of attacking a straw man, though.  The thing that
>> >> >people push back on, or should push back on IMO, is when a proposed
>> >> >patch adds significant slowdown to queries that it has no or very little
>> >> >hope of improving.  The trick is to do expensive stuff only when
>> >> >there's a good chance of getting a better plan out of it.
>> >> >
>> >>
>> >> Yeah, I agree with that. I think the main issue is that we don't really
>> >> know what the "expensive stuff" is in this case, so it's not really
>> >> clear how to be smarter :-(
>> >
>> >To add to this: I agree that ideally you'd check cheaply to know
>> >you're in a situation that might help, and then do more work. But here
>> >the question is always going to be simply "would we benefit from an
>> >ordering, and, if so, do we have it already partially sorted". It's
>> >hard to imagine that reducing much conceptually, so we're left with
>> >optimizations of that check.
>> >
>>
>> I think it depends on what exactly is the expensive part. For example if
>> it's the construction of IncrementalSort paths, then maybe we could try
>> do a quick/check check if the path can even be useful by estimating the
>> cost and only then building the path.
>>
>> That's what we do for join algorithms, for example - we first compute
>> initial_cost_nestloop and only when that seems cheap enough we do the
>> more expensive stuff.
>>
>> But I'm not sure the path construction is the expensive part, as it
>> should be disabled by enable_incrementalsort=off. But the regression
>> does not seem to disappear, at least not entirely.
>>
>>
>> >> One possibility is that it's just one of those regressions due to change
>> >> in binary layout, but I'm not sure know how to verify that.
>> >
>> >If we are testing with a case that can't actually add more paths (due
>> >to it checking the guc before building them), doesn't that effectively
>> >leave one of these two options:
>> >
>> >1. Binary layout/cache/other untraceable change, or
>> >2. Changes due to refactored function calls.
>> >
>>
>> Hmm, but in case of (1) the overhead should be there even with tests
>> that don't really have any additional paths to consider, right? I've
>> tried with such test (single table with no indexes) and I don't quite
>> see any regression (maybe ~1%).
>
>Not necessarily, if the cost is in sort costing or useful pathkeys
>checking, right? We have run that code even without incremental sort,
>but it's changed from master.
>

Ah, I should have mentioned I've done most of the tests on just the
basic incremental sort patch (0001+0002), without the additional useful
paths. I initially tested the whole patch series, but after discovering
the regression I removed the last part (which I suspected might be the
root cause). But the regression is still there, so it's not that.

It might be in the reworked costing, yeah. But then I'd expect those
function to show in the perf profile.


Right. I'm just grasping at straws on that.


>> (2) might have impact, but I don't see any immediate supects. Did you
>> have some functions in mind?
>
>I guess this is where the lines blur: I didn't see anything obvious
>either, but the changes to sort costing...should probably not have
>real impact...but...
>

:-(

>> BTW I see the patch adds pathkeys_common but it's not called from
>> anywhere. It's probably leftover from an earlier patch version.
>>


BTW, I think I'm going to rename the pathkeys_common_contained_in
function to something like pathkeys_count_contained_in, unless you
have an objection to that. The name doesn't seem obvious at all to me.



WFM


>> >There's not anything obvious in point (2) that would be a big cost,
>> >but there are definitely changes there. I was surprised that just
>> >eliminating the loop through the pathkeys on the query and the index
>> >was enough to save us ~4%.
>> >
>> >Tomas: Earlier you'd wondered about if we should try to shortcut the
>> >changes

Re: backend type in log_line_prefix?

2020-03-31 Thread Bruce Momjian

On Fri, Mar 27, 2020 at 04:30:07PM +0900, Kyotaro Horiguchi wrote:
> Hello.
> 
> At Mon, 23 Mar 2020 18:38:53 -0400, Bruce Momjian  wrote in 
> > Patch applied to master, thanks.
> 
> The patch (8e8a0becb3) named archiver process as just "archiver".  On
> the other hand the discussion in the thread [1] was going to name the
> process as "WAL/wal archiver".  As all other processes related to WAL
> are named as walreceiver, walsender, walwriter, wouldn't we name the
> process like "wal archiver"?
> 
> [1]: 
> https://www.postgresql.org/message-id/20200319195410.icib45bbgjwqb...@alap3.anarazel.de

Agreed.  I ended up moving "wal" as a separate word, since it looks
cleaner;  patch attached.  Tools that look for the backend type in
pg_stat_activity would need to be adjusted;  it would be an
incompatibility.  Maybe changing it would cause too much disruption.


-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index a7b7b12249..2d625ee01e 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -221,16 +221,16 @@ GetBackendTypeDesc(BackendType backendType)
 			backendDesc = "startup";
 			break;
 		case B_WAL_RECEIVER:
-			backendDesc = "walreceiver";
+			backendDesc = "wal receiver";
 			break;
 		case B_WAL_SENDER:
-			backendDesc = "walsender";
+			backendDesc = "wal sender";
 			break;
 		case B_WAL_WRITER:
-			backendDesc = "walwriter";
+			backendDesc = "wal writer";
 			break;
 		case B_ARCHIVER:
-			backendDesc = "archiver";
+			backendDesc = "wal archiver";
 			break;
 		case B_STATS_COLLECTOR:
 			backendDesc = "stats collector";

Re: [HACKERS] Restricting maximum keep segments by repslots

2020-03-31 Thread Kyotaro Horiguchi

At Tue, 31 Mar 2020 16:59:05 -0300, Alvaro Herrera  
wrote in 
> On 2020-Mar-31, Alvaro Herrera wrote:
> 
> > On 2020-Mar-31, Alvaro Herrera wrote:
> > 
> > > I'm not sure if I explained my proposal clearly.  What if
> > > XLogGetLastRemovedSegno returning zero means that every segment is
> > > valid?  We don't need to scan pg_xlog at all.
> > 
> > I mean this:
> 
> [v21 does it that way.  Your typo fixes are included, but not the
> LastRemoved stuff being discussed here.  I also edited the shortdesc in
> guc.c to better match {min,max}_wal_size.]
> 
> Hmm ... but if the user runs pg_resetwal to remove WAL segments, then
> this will work badly for a time (until a segment is removed next).  I'm
> not very worried for that scenario, since surely the user will have to
> reclone any standbys anyway.  I think your v20 behaves better in that
> case.  But I'm not sure we should have that code to cater only to that
> case ... seems to me that it will go untested 99.999% of the time.

I feel the same. If we allow bogus status or "unkown" status before
the first checkpoint, we don't need to scan the directory.

> Maybe you're aware of some other cases where lastRemovedSegNo is not
> correct for the purposes of this feature?

The cases of archive-failure (false "removed") and change of
max_slot_wal_keep_size(false "normal/kept") mentioned in another mail.

> I pushed the silly test_decoding test adjustment to get it out of the
> way.
> 
> /me tries to figure out KeepLogSeg next

Thanks.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

2020-03-31 Thread Kyotaro Horiguchi

At Tue, 31 Mar 2020 14:18:36 -0300, Alvaro Herrera  
wrote in 
> On 2020-Mar-31, Alvaro Herrera wrote:
> 
> > I'm not sure if I explained my proposal clearly.  What if
> > XLogGetLastRemovedSegno returning zero means that every segment is
> > valid?  We don't need to scan pg_xlog at all.
> 
> I mean this:
> 
> XLogSegNo
> FindOldestXLogFileSegNo(void)
> {
>   XLogSegNo segno = XLogGetLastRemovedSegno();
> 
>   /* this is the only special case we need to care about */
>   if (segno == 0)
>   return some-value;
> 
>   return segno + 1;
> }
> 
> ... and that point one can further note that a freshly initdb'd system
> (no file has been removed) has "1" as the first file.  So when segno is
> 0, you can return 1 and all should be well.  That means you can reduce
> the function to this:

If we don't scan the wal files, for example (somewhat artificail), if
segments canoot be removed by a wrong setting of archive_command,
GetWalAvailability can return false "removed(lost)" state.  If
max_slot_wal_keep_size is shrinked is changed then restarted, the
function can return false "normal" or "keeping" states.

By the way the oldest segment of initdb'ed cluster was (14x)th for
me. So I think we can treat segno == 1 as "uncertain" or "unknown"
state, but that state lasts until a checkpoint actually removes a
segment.

> XLogSegNo
> FindOldestXLogFileSegNo(void)
> {
>   return XLogGetLastRemovedSegno() + 1;
> }
> 
> 
> The tests still pass with this coding.

Mmm. Yeah, that affects when under an abnormal condition.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: control max length of parameter values logged

2020-03-31 Thread Tom Lane

Alexey Bashtanov  writes:
> Sorry for the delay, please could you have a look?

Got it, will look tomorrow.  (I think it's important to get this
done for v13, before we ship this behavior.)

regards, tom lane

Re: control max length of parameter values logged

2020-03-31 Thread Alexey Bashtanov




Also agreed.  It's been like it is for a long time with not that
many complaints, so the case for changing the default behavior
seems a bit weak.

Barring other opinions, I think we have consensus here on what
to do.  Alexey, will you update your patch?


Sorry for the delay, please could you have a look?

Best, Alex
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2de21903a1..35d420e026 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6713,6 +6713,38 @@ log_line_prefix = '%m [%p] %q%u@%d/%a '
   
  
 
+ 
+  log_parameter_max_length (integer)
+  
+   log_parameter_max_length configuration parameter
+  
+  
+  
+   
+If greater than zero, bind parameter values reported in non-error
+statement-logging messages are trimmed to no more than this many bytes.
+If this value is specified without units, it is taken as bytes.
+Zero (the default) disables trimming. Only superusers can change this setting.
+   
+  
+ 
+
+ 
+  log_parameter_max_length_on_error (integer)
+  
+   log_parameter_max_length_on_error configuration parameter
+  
+  
+  
+   
+If greater than zero, bind parameter values reported in error messages
+are trimmed to no more than this many bytes.
+If this value is specified without units, it is taken as bytes.
+Zero (the default) disables trimming.
+   
+  
+ 
+
  
   log_statement (enum)
   
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 5b677863b9..ea2fe17e0b 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -1846,13 +1846,24 @@ exec_bind_message(StringInfo input_message)
 		if (knownTextValues == NULL)
 			knownTextValues =
 palloc0(numParams * sizeof(char *));
-		/*
-		 * Note: must copy at least two more full characters
-		 * than BuildParamLogString wants to see; otherwise it
-		 * might fail to include the ellipsis.
-		 */
-		knownTextValues[paramno] =
-			pnstrdup(pstring, 64 + 2 * MAX_MULTIBYTE_CHAR_LEN);
+
+		if (log_parameter_max_length_on_error > 0)
+		{
+			/*
+			 * We can trim the saved string, knowing that we
+			 * won't print all of it.  But we must copy at
+			 * least two more full characters than
+			 * BuildParamLogString wants to use; otherwise it
+			 * might fail to include the trailing ellipsis.
+			 */
+			knownTextValues[paramno] =
+pnstrdup(pstring,
+		 log_parameter_max_length_on_error
+		 + 2 * MAX_MULTIBYTE_CHAR_LEN);
+		}
+		else
+			knownTextValues[paramno] = pstrdup(pstring);
+
 		MemoryContextSwitchTo(oldcxt);
 	}
 	if (pstring != pbuf.data)
@@ -1915,7 +1926,9 @@ exec_bind_message(StringInfo input_message)
 		 */
 		if (log_parameters_on_error)
 			params->paramValuesStr =
-BuildParamLogString(params, knownTextValues, 64);
+BuildParamLogString(params,
+	knownTextValues,
+	log_parameter_max_length_on_error);
 	}
 	else
 		params = NULL;
@@ -2404,7 +2417,7 @@ errdetail_params(ParamListInfo params)
 	{
 		char   *str;
 
-		str = BuildParamLogString(params, NULL, 0);
+		str = BuildParamLogString(params, NULL, log_parameter_max_length);
 		if (str && str[0] != '\0')
 			errdetail("parameters: %s", str);
 	}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 79bc7ac8ca..a3903d1b16 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -544,6 +544,8 @@ int			log_min_messages = WARNING;
 int			client_min_messages = NOTICE;
 int			log_min_duration_sample = -1;
 int			log_min_duration_statement = -1;
+int			log_parameter_max_length = 0;
+int			log_parameter_max_length_on_error = 0;
 int			log_temp_files = -1;
 double		log_statement_sample_rate = 1.0;
 double		log_xact_sample_rate = 0;
@@ -2855,6 +2857,28 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"log_parameter_max_length", PGC_SUSET, LOGGING_WHAT,
+			gettext_noop("When logging statements, limit logged parameter values to first N bytes."),
+			gettext_noop("Zero to print values in full."),
+			GUC_UNIT_BYTE
+		},
+		_parameter_max_length,
+		0, 0, INT_MAX / 2,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"log_parameter_max_length_on_error", PGC_USERSET, LOGGING_WHAT,
+			gettext_noop("When reporting an error, limit logged parameter values to first N bytes."),
+			gettext_noop("Zero to print values in full."),
+			GUC_UNIT_BYTE
+		},
+		_parameter_max_length_on_error,
+		0, 0, INT_MAX / 2,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"bgwriter_delay", PGC_SIGHUP, RESOURCES_BGWRITER,
 			gettext_noop("Background writer sleep time between rounds."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e9f8ca775d..8899f174c7 100644
---

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-03-31 Thread Andy Fan

Thanks David for your time,  I will acknowledge every item you mentioned
with the updated patch.  Now I just talk about part of them.


> 1. There seem to be some cases where joins are no longer being
> detected as unique. This is evident in postgres_fdw.out. We shouldn't
> be regressing any of these cases.


You are correct,  the issue here is  I didn't distinguish the one_row case
in UniqueKey struct.  Actually when a outer relation is join with a relation
which has only one row,  there must be at most one row match the outer join.
The only-one-row case in postgres_fdw.out come from aggregation
call. I will added one more field "bool onerow" in UniqueKey struct.  and
also try your optimization suggestion for the onerow UniqueKey.


> 2. The following change does not seem like it should be part of this
> patch.  I understand you perhaps have done as you think it will
> improve the performance of checking if an expression is in a list of
> expressions.
>
> - COMPARE_SCALAR_FIELD(varno);
> + /* Compare varattno first since it has higher selectivity than varno */
>   COMPARE_SCALAR_FIELD(varattno);
> + COMPARE_SCALAR_FIELD(varno);
>
> I did have a hesitation when I make this changes.  so I'd rollback this
change
at the following patch.


> If you think that is true, then please do it as a separate effort and
> provide benchmarks with your findings.
>
> 3. list_all_members_in. I think this would be better named as
> list_is_subset. Please follow the lead of bms_is_subset().
> Additionally, you should Assert that IsPointerList is true as there's
> nothing else to indicate that it can't be used for an int or Oid list.
>
> 4. guarantee is not a very good name for the field in UniqueKey.
> Maybe something like is_not_null?
>
>

> 5. I think you should be performing a bms_del_member during join
> removal rather than removing this Assert()
>
> - Assert(bms_equal(rel->relids, root->all_baserels));
>
> FWIW, it's far from perfect that you've needed to delay the left join
> removal, but I do understand why you've done it. It's also far from
> perfect that you're including removed relations in the
> total_table_pages calculation. c6e4133fae1 took some measures to
> improve this calculation and this is making it worse again.
>
> Since the removed relation depends on the UniqueKey which has to be
calculated after  total_table_pages calculation in current code, so that's
something I must do.  But if the relation is not removable,  there is no
waste
at all.  If it is removable,  such gain will much higher than the loss.
I'm
not sure this should be a concern.

Actually looks the current remove_useless_join has some limits which can't
remove a joinrel,  I still didn't figure out why.  In the past we have some
limited
ability to detect the unqiueness after join, so that's would be ok.  Since
we have
such ability now,  this may be another opportunity to improve the
join_is_removable
function, but I'd not like put such thing in this patch.

Since you said "far from perfect" twice for this point and I only get one
reason (we
may plan a node which we removed later),  do I miss the other one?

6. Can you explain why you moved the populate_baserel_uniquekeys()
> call out of set_plain_rel_size()?
>
> This is to be consistent with populate_partitionedrel_uniquekeys, which
is set at set_append_rel_pathlist.


> 7. I don't think the removal of rel_supports_distinctness() is
> warranted.  Is it not ok to check if the relation has any uniquekeys?
>

I think this is a good suggestion.  I will follow that.

8. Your spelling of unique is incorrect in many places:
>
> src/backend/nodes/makefuncs.c: * makeUnqiueKey
> src/backend/optimizer/path/uniquekeys.c:static List
> *initililze_unqiuecontext_for_joinrel(RelOptInfo *joinrel,
> src/backend/optimizer/path/uniquekeys.c: * check if combination of
> unqiuekeys from both side is still useful for us,
> src/backend/optimizer/path/uniquekeys.c:outerrel_uniquekey_ctx
> = initililze_unqiuecontext_for_joinrel(joinrel, outerrel);
> src/backend/optimizer/path/uniquekeys.c:innerrel_uniquekey_ctx
> = initililze_unqiuecontext_for_joinrel(joinrel, innerrel);
> src/backend/optimizer/path/uniquekeys.c: * we need to convert the
> UnqiueKey from sub_final_rel to currel via the positions info in
> src/backend/optimizer/path/uniquekeys.c:ctx->pos =
> pos; /* the position in current targetlist,  will be used to set
> UnqiueKey */
> src/backend/optimizer/path/uniquekeys.c: * Check if Unqiue key of the
> innerrel is valid after join. innerrel's UniqueKey
> src/backend/optimizer/path/uniquekeys.c: *
> initililze_unqiuecontext_for_joinrel
> src/backend/optimizer/path/uniquekeys.c: * all the unqiuekeys which
> are not possible to use later
>
> src/backend/optimizer/path/uniquekeys.c:initililze_unqiuecontext_for_joinrel(RelOptInfo
> *joinrel,  RelOptInfo *inputrel)
> src/backend/optimizer/plan/analyzejoins.c:  /*
> This UnqiueKey is what we want */
>

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-31 Thread James Coleman

On Tue, Mar 31, 2020 at 8:38 PM Tomas Vondra
 wrote:
>
> On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:
> >On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
> > wrote:
> >>
> >> On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:
> >> >On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
> >> > wrote:
> >> >>
> >> >> On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:
> >> >> >Tomas Vondra  writes:
> >> >> >> In general, I think it'd be naive that we can make planner smarter 
> >> >> >> with
> >> >> >> no extra overhead spent on planning, and we can never accept patches
> >> >> >> adding even tiny overhead. With that approach we'd probably end up 
> >> >> >> with
> >> >> >> a trivial planner that generates just a single query plan, because
> >> >> >> that's going to be the fastest planner. A realistic approach needs to
> >> >> >> consider both the planning and execution phase, and benefits of this
> >> >> >> patch seem to be clear - if you have queries that do benefit from it.
> >> >> >
> >> >> >I think that's kind of attacking a straw man, though.  The thing that
> >> >> >people push back on, or should push back on IMO, is when a proposed
> >> >> >patch adds significant slowdown to queries that it has no or very 
> >> >> >little
> >> >> >hope of improving.  The trick is to do expensive stuff only when
> >> >> >there's a good chance of getting a better plan out of it.
> >> >> >
> >> >>
> >> >> Yeah, I agree with that. I think the main issue is that we don't really
> >> >> know what the "expensive stuff" is in this case, so it's not really
> >> >> clear how to be smarter :-(
> >> >
> >> >To add to this: I agree that ideally you'd check cheaply to know
> >> >you're in a situation that might help, and then do more work. But here
> >> >the question is always going to be simply "would we benefit from an
> >> >ordering, and, if so, do we have it already partially sorted". It's
> >> >hard to imagine that reducing much conceptually, so we're left with
> >> >optimizations of that check.
> >> >
> >>
> >> I think it depends on what exactly is the expensive part. For example if
> >> it's the construction of IncrementalSort paths, then maybe we could try
> >> do a quick/check check if the path can even be useful by estimating the
> >> cost and only then building the path.
> >>
> >> That's what we do for join algorithms, for example - we first compute
> >> initial_cost_nestloop and only when that seems cheap enough we do the
> >> more expensive stuff.
> >>
> >> But I'm not sure the path construction is the expensive part, as it
> >> should be disabled by enable_incrementalsort=off. But the regression
> >> does not seem to disappear, at least not entirely.
> >>
> >>
> >> >> One possibility is that it's just one of those regressions due to change
> >> >> in binary layout, but I'm not sure know how to verify that.
> >> >
> >> >If we are testing with a case that can't actually add more paths (due
> >> >to it checking the guc before building them), doesn't that effectively
> >> >leave one of these two options:
> >> >
> >> >1. Binary layout/cache/other untraceable change, or
> >> >2. Changes due to refactored function calls.
> >> >
> >>
> >> Hmm, but in case of (1) the overhead should be there even with tests
> >> that don't really have any additional paths to consider, right? I've
> >> tried with such test (single table with no indexes) and I don't quite
> >> see any regression (maybe ~1%).
> >
> >Not necessarily, if the cost is in sort costing or useful pathkeys
> >checking, right? We have run that code even without incremental sort,
> >but it's changed from master.
> >
>
> Ah, I should have mentioned I've done most of the tests on just the
> basic incremental sort patch (0001+0002), without the additional useful
> paths. I initially tested the whole patch series, but after discovering
> the regression I removed the last part (which I suspected might be the
> root cause). But the regression is still there, so it's not that.
>
> It might be in the reworked costing, yeah. But then I'd expect those
> function to show in the perf profile.

Right. I'm just grasping at straws on that.

> >> (2) might have impact, but I don't see any immediate supects. Did you
> >> have some functions in mind?
> >
> >I guess this is where the lines blur: I didn't see anything obvious
> >either, but the changes to sort costing...should probably not have
> >real impact...but...
> >
>
> :-(
>
> >> BTW I see the patch adds pathkeys_common but it's not called from
> >> anywhere. It's probably leftover from an earlier patch version.
> >>

BTW, I think I'm going to rename the pathkeys_common_contained_in
function to something like pathkeys_count_contained_in, unless you
have an objection to that. The name doesn't seem obvious at all to me.

> >> >There's not anything obvious in point (2) that would be a big cost,
> >> >but there are definitely changes there. I was surprised that just
> >> >eliminating the loop through the pathkeys on

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-31 Thread Tomas Vondra

On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
 wrote:

On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:
>On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
> wrote:
>>
>> On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:
>> >Tomas Vondra  writes:
>> >> In general, I think it'd be naive that we can make planner smarter with
>> >> no extra overhead spent on planning, and we can never accept patches
>> >> adding even tiny overhead. With that approach we'd probably end up with
>> >> a trivial planner that generates just a single query plan, because
>> >> that's going to be the fastest planner. A realistic approach needs to
>> >> consider both the planning and execution phase, and benefits of this
>> >> patch seem to be clear - if you have queries that do benefit from it.
>> >
>> >I think that's kind of attacking a straw man, though.  The thing that
>> >people push back on, or should push back on IMO, is when a proposed
>> >patch adds significant slowdown to queries that it has no or very little
>> >hope of improving.  The trick is to do expensive stuff only when
>> >there's a good chance of getting a better plan out of it.
>> >
>>
>> Yeah, I agree with that. I think the main issue is that we don't really
>> know what the "expensive stuff" is in this case, so it's not really
>> clear how to be smarter :-(
>
>To add to this: I agree that ideally you'd check cheaply to know
>you're in a situation that might help, and then do more work. But here
>the question is always going to be simply "would we benefit from an
>ordering, and, if so, do we have it already partially sorted". It's
>hard to imagine that reducing much conceptually, so we're left with
>optimizations of that check.
>

I think it depends on what exactly is the expensive part. For example if
it's the construction of IncrementalSort paths, then maybe we could try
do a quick/check check if the path can even be useful by estimating the
cost and only then building the path.

That's what we do for join algorithms, for example - we first compute
initial_cost_nestloop and only when that seems cheap enough we do the
more expensive stuff.

But I'm not sure the path construction is the expensive part, as it
should be disabled by enable_incrementalsort=off. But the regression
does not seem to disappear, at least not entirely.

>> One possibility is that it's just one of those regressions due to change
>> in binary layout, but I'm not sure know how to verify that.
>
>If we are testing with a case that can't actually add more paths (due
>to it checking the guc before building them), doesn't that effectively
>leave one of these two options:
>
>1. Binary layout/cache/other untraceable change, or
>2. Changes due to refactored function calls.
>

Hmm, but in case of (1) the overhead should be there even with tests
that don't really have any additional paths to consider, right? I've
tried with such test (single table with no indexes) and I don't quite
see any regression (maybe ~1%).

Not necessarily, if the cost is in sort costing or useful pathkeys
checking, right? We have run that code even without incremental sort,
but it's changed from master.

Ah, I should have mentioned I've done most of the tests on just the
basic incremental sort patch (0001+0002), without the additional useful
paths. I initially tested the whole patch series, but after discovering
the regression I removed the last part (which I suspected might be the
root cause). But the regression is still there, so it's not that.

It might be in the reworked costing, yeah. But then I'd expect those
function to show in the perf profile.

(2) might have impact, but I don't see any immediate supects. Did you
have some functions in mind?

I guess this is where the lines blur: I didn't see anything obvious
either, but the changes to sort costing...should probably not have
real impact...but...

:-(

BTW I see the patch adds pathkeys_common but it's not called from
anywhere. It's probably leftover from an earlier patch version.

>There's not anything obvious in point (2) that would be a big cost,
>but there are definitely changes there. I was surprised that just
>eliminating the loop through the pathkeys on the query and the index
>was enough to save us ~4%.
>
>Tomas: Earlier you'd wondered about if we should try to shortcut the
>changes in costing...I was skeptical of that originally, but maybe
>it's worth looking into? I'm going to try backing that out and see
>what the numbers look like.
>

BTW, I did this test, and it looks like we can get back something
close to 1% by reverting that initial fix on partial path costing. But
we can't get rid of it all the time, at the very least. *Maybe* we
could condition it on incremental sort, but I'm not convinced that's
the only place it's needed as a fix.

Sounds interesting. I actually tried how much the add_partial_path
change accounts for, and you're right it was quite a bit. But

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-31 Thread Alvaro Herrera

On 2020-Mar-31, Tom Lane wrote:

> James Coleman  writes:
> > On Tue, Mar 31, 2020 at 1:04 PM Tom Lane  wrote:
> >> Perhaps the semantics are such that that's actually sensible, but it's
> >> far from a straightforward remapping of the old enum.
> 
> > Right, I didn't see the explicit "= 0" in other enums there, so it
> > made me wonder if it was intentional to designate that one had to be
> > 0, but I guess without a comment that's a lot of inference.
> 
> It's possible that somebody meant that as an indicator that the code
> depends on palloc0() leaving the field with that value.  But if so,
> you'd soon find that out ... and an actual comment would be better,
> anyway.

git blame fingers this:

commit bf11e7ee2e3607bb67d25aec73aa53b2d7e9961b
Author: Robert Haas 
AuthorDate: Tue Aug 29 13:22:49 2017 -0400
CommitDate: Tue Aug 29 13:26:33 2017 -0400

Propagate sort instrumentation from workers back to leader.

Up until now, when parallel query was used, no details about the
sort method or space used by the workers were available; details
were shown only for any sorting done by the leader.  Fix that.

Commit 1177ab1dabf72bafee8f19d904cee3a299f25892 forced the test case
added by commit 1f6d515a67ec98194c23a5db25660856c9aab944 to run
without parallelism; now that we have this infrastructure, allow
that again, with a little tweaking to make it pass with and without
force_parallel_mode.

Robert Haas and Tom Lane

Discussion: 
http://postgr.es/m/CA+Tgmoa2VBZW6S8AAXfhpHczb=rf6rqq2br+zjvegwj0uod...@mail.gmail.com

I looked at the discussion thread.  That patch was first posted by
Robert at
https://postgr.es/m/CA+Tgmoa2VBZW6S8AAXfhpHczb=rf6rqq2br+zjvegwj0uod...@mail.gmail.com
without the "= 0" part; later Tom posted v2 here
https://postgr.es/m/11223.1503695...@sss.pgh.pa.us
containing the "= 0", but I see no actual discussion of that point.

I suppose it could also be important to clarify that it's 0 if it were
used as an array index of some sort, but I don't see that in 2017's
commit.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-31 Thread James Coleman

On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
 wrote:
>
> On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:
> >On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
> > wrote:
> >>
> >> On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:
> >> >Tomas Vondra  writes:
> >> >> In general, I think it'd be naive that we can make planner smarter with
> >> >> no extra overhead spent on planning, and we can never accept patches
> >> >> adding even tiny overhead. With that approach we'd probably end up with
> >> >> a trivial planner that generates just a single query plan, because
> >> >> that's going to be the fastest planner. A realistic approach needs to
> >> >> consider both the planning and execution phase, and benefits of this
> >> >> patch seem to be clear - if you have queries that do benefit from it.
> >> >
> >> >I think that's kind of attacking a straw man, though.  The thing that
> >> >people push back on, or should push back on IMO, is when a proposed
> >> >patch adds significant slowdown to queries that it has no or very little
> >> >hope of improving.  The trick is to do expensive stuff only when
> >> >there's a good chance of getting a better plan out of it.
> >> >
> >>
> >> Yeah, I agree with that. I think the main issue is that we don't really
> >> know what the "expensive stuff" is in this case, so it's not really
> >> clear how to be smarter :-(
> >
> >To add to this: I agree that ideally you'd check cheaply to know
> >you're in a situation that might help, and then do more work. But here
> >the question is always going to be simply "would we benefit from an
> >ordering, and, if so, do we have it already partially sorted". It's
> >hard to imagine that reducing much conceptually, so we're left with
> >optimizations of that check.
> >
>
> I think it depends on what exactly is the expensive part. For example if
> it's the construction of IncrementalSort paths, then maybe we could try
> do a quick/check check if the path can even be useful by estimating the
> cost and only then building the path.
>
> That's what we do for join algorithms, for example - we first compute
> initial_cost_nestloop and only when that seems cheap enough we do the
> more expensive stuff.
>
> But I'm not sure the path construction is the expensive part, as it
> should be disabled by enable_incrementalsort=off. But the regression
> does not seem to disappear, at least not entirely.
>
>
> >> One possibility is that it's just one of those regressions due to change
> >> in binary layout, but I'm not sure know how to verify that.
> >
> >If we are testing with a case that can't actually add more paths (due
> >to it checking the guc before building them), doesn't that effectively
> >leave one of these two options:
> >
> >1. Binary layout/cache/other untraceable change, or
> >2. Changes due to refactored function calls.
> >
>
> Hmm, but in case of (1) the overhead should be there even with tests
> that don't really have any additional paths to consider, right? I've
> tried with such test (single table with no indexes) and I don't quite
> see any regression (maybe ~1%).

Not necessarily, if the cost is in sort costing or useful pathkeys
checking, right? We have run that code even without incremental sort,
but it's changed from master.

> (2) might have impact, but I don't see any immediate supects. Did you
> have some functions in mind?

I guess this is where the lines blur: I didn't see anything obvious
either, but the changes to sort costing...should probably not have
real impact...but...

> BTW I see the patch adds pathkeys_common but it's not called from
> anywhere. It's probably leftover from an earlier patch version.
>
> >There's not anything obvious in point (2) that would be a big cost,
> >but there are definitely changes there. I was surprised that just
> >eliminating the loop through the pathkeys on the query and the index
> >was enough to save us ~4%.
> >
> >Tomas: Earlier you'd wondered about if we should try to shortcut the
> >changes in costing...I was skeptical of that originally, but maybe
> >it's worth looking into? I'm going to try backing that out and see
> >what the numbers look like.
> >

BTW, I did this test, and it looks like we can get back something
close to 1% by reverting that initial fix on partial path costing. But
we can't get rid of it all the time, at the very least. *Maybe* we
could condition it on incremental sort, but I'm not convinced that's
the only place it's needed as a fix.

> I've described the idea about something like initial_cost_nestloop and
> so on. But I'm a bit skeptical about it, considering that the GUC only
> has limited effect.
>
>
> A somewhat note is that the number of indexes has pretty significant
> impact on planning time, even on master. This is timing of the same
> explain script (similar to the one shown before) with different number
> of indexes on master:
>
>  0 indexes7 indexes   49 indexes
>  ---
>

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-31 Thread Tomas Vondra

On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
 wrote:

On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:
>Tomas Vondra  writes:
>> In general, I think it'd be naive that we can make planner smarter with
>> no extra overhead spent on planning, and we can never accept patches
>> adding even tiny overhead. With that approach we'd probably end up with
>> a trivial planner that generates just a single query plan, because
>> that's going to be the fastest planner. A realistic approach needs to
>> consider both the planning and execution phase, and benefits of this
>> patch seem to be clear - if you have queries that do benefit from it.
>
>I think that's kind of attacking a straw man, though.  The thing that
>people push back on, or should push back on IMO, is when a proposed
>patch adds significant slowdown to queries that it has no or very little
>hope of improving.  The trick is to do expensive stuff only when
>there's a good chance of getting a better plan out of it.
>

Yeah, I agree with that. I think the main issue is that we don't really
know what the "expensive stuff" is in this case, so it's not really
clear how to be smarter :-(

To add to this: I agree that ideally you'd check cheaply to know
you're in a situation that might help, and then do more work. But here
the question is always going to be simply "would we benefit from an
ordering, and, if so, do we have it already partially sorted". It's
hard to imagine that reducing much conceptually, so we're left with
optimizations of that check.

I think it depends on what exactly is the expensive part. For example if
it's the construction of IncrementalSort paths, then maybe we could try
do a quick/check check if the path can even be useful by estimating the
cost and only then building the path.

That's what we do for join algorithms, for example - we first compute
initial_cost_nestloop and only when that seems cheap enough we do the
more expensive stuff.

But I'm not sure the path construction is the expensive part, as it
should be disabled by enable_incrementalsort=off. But the regression
does not seem to disappear, at least not entirely.

One possibility is that it's just one of those regressions due to change
in binary layout, but I'm not sure know how to verify that.

If we are testing with a case that can't actually add more paths (due
to it checking the guc before building them), doesn't that effectively
leave one of these two options:

1. Binary layout/cache/other untraceable change, or
2. Changes due to refactored function calls.

Hmm, but in case of (1) the overhead should be there even with tests
that don't really have any additional paths to consider, right? I've
tried with such test (single table with no indexes) and I don't quite
see any regression (maybe ~1%).

(2) might have impact, but I don't see any immediate supects. Did you
have some functions in mind?

BTW I see the patch adds pathkeys_common but it's not called from
anywhere. It's probably leftover from an earlier patch version.

There's not anything obvious in point (2) that would be a big cost,
but there are definitely changes there. I was surprised that just
eliminating the loop through the pathkeys on the query and the index
was enough to save us ~4%.

Tomas: Earlier you'd wondered about if we should try to shortcut the
changes in costing...I was skeptical of that originally, but maybe
it's worth looking into? I'm going to try backing that out and see
what the numbers look like.

I've described the idea about something like initial_cost_nestloop and
so on. But I'm a bit skeptical about it, considering that the GUC only
has limited effect.

A somewhat note is that the number of indexes has pretty significant
impact on planning time, even on master. This is timing of the same
explain script (similar to the one shown before) with different number
of indexes on master:

0 indexes7 indexes   49 indexes
---
10.8512.5627.83

The way I look at incremental sort is that it allows using indexes for
queries that couldn't use it before, possibly requiring a separate
index. So incremental sort might easily reduce the number of indexes
needed, compensating for the overhead we're discussing here. Of course,
that may or may not be true.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: proposal \gcsv

2020-03-31 Thread Tom Lane

I took a look at this proposal, and while I see the value of being
able to do something like this, it seems pretty short-sighted and
non-orthogonal as it stands.  We've already got \gx, which is a wart,
and now this patch wants to add \gfmt which is a different wart of the
same ilk.  What happens if you want to combine them?  Plus we already
had David complaining upthread that he'd like to be able to select
CSV-format suboptions; and now here comes Erik wondering about the
same thing.

It seems to me that this line of development is going to end in a whole
flotilla of \g-something commands that aren't composable and never quite
satisfy the next use-case to come down the pike, so we keep on needing
even more of them.

So I think we really need a way to be able to specify multiple different
\pset subcommands that apply just for the duration of one \g command.
Pavel dismissed that upthread as being too hard, but I think we'd better
try harder.

Plan A:

Consider some syntax along the lines of

\gpset (pset-option-name [pset-option-value]) ... filename

or if you don't like parentheses, choose some other punctuation to wrap
the \pset options in.  I initially thought of square brackets, but I'm
afraid they might be just too darn confusing to document --- how could
you make them distinct from metasyntax square brackets, especially in
plain-ASCII docs?  Also it'd have to be punctuation that's unlikely to
start a file name --- but parens are already reserved in most shells.

Plan B:

Another idea is to break the operation into multiple backslash commands,
where the initial ones set up state that doesn't do anything until the
output command comes along:

\tpset [ pset-option-name [ pset-option-value ] ]

Sets a "temporary" pset option, which will have effect in the
next \gpset command; or with no parameters, shows the current set
of temporary options

\gpset filename

Execute SQL command and output to filename (or pipe), using the
pset option set defined by preceding \tpset commands, and reverting
that option set to all-defaults afterward.

Probably we could think of better terminology than "temporary"
and a better command name than "\tpset", but you get the gist.

Admittedly, "\tpset format csv \gpset filename" is a bit more
verbose than the current proposal of "\gfmt csv filename"
... but we'd have solved the problem once and for all, even
for pset options we've not invented yet.

Plan C:

Probably there are other ways to get there; these are just the
first ideas that came to me.

regards, tom lane

Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-03-31 Thread Anna Akenteva


On 2020-03-27 04:15, Kartyshov Ivan wrote:

Anna, feel free to work on this patch.


Ivan and I worked on this patch a bit more. We fixed the bugs that we 
could find and cleaned up the code. For now, we've kept both options: 
WAIT as a standalone statement and WAIT as a part of BEGIN. The new 
patch is attached.


The syntax looks a bit different now:

- WAIT FOR [ANY | ALL] event [, ...]
- BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ WAIT FOR 
[ANY | ALL] event [, ...]]

where event is one of:
LSN value
TIMEOUT number_of_milliseconds
timestamp

Now, one event cannot contain both an LSN and a TIMEOUT. With such 
syntax, the behaviour seems to make sense. For the (default) WAIT FOR 
ALL strategy, we pick the maximum LSN and maximum allowed timeout, and 
wait for the LSN till the timeout is over. If no timeout is specified, 
we wait forever. If no LSN is specified, we just wait for the time to 
pass. For the WAIT FOR ANY strategy, it's the same but we pick minimum 
LSN and timeout.


There are still some questions left:
1) Should we only keep the BEGIN option, or does the standalone command 
have potential after all?
2) Should we change the grammar so that WAIT can be in any position of 
the BEGIN statement, not necessarily in the end? Ivan and I haven't come 
to a consensus about this, so more opinions would be helpful.
3) Since we added the "wait" attribute to TransactionStmt struct, do we 
need to add something to _copyTransactionStmt() / 
_equalTransactionStmt()?


--
Anna Akenteva
Postgres Professional:
The Russian Postgres Company
http://www.postgrespro.comdiff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 8d91f3529e6..8697f9807ff 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -187,6 +187,7 @@ Complete list of usable sgml source files in this directory.
 
 
 
+
 
 
 
diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index c23bbfb4e71..558ff668dbd 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,21 @@ PostgreSQL documentation
 
  
 
-BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] wait_for_event
 
 where transaction_mode is one of:
 
 ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
 READ WRITE | READ ONLY
 [ NOT ] DEFERRABLE
+
+and wait_for_event is:
+WAIT FOR [ANY | ALL] event [, event ...]
+
+where event is one of:
+LSN lsn_value
+TIMEOUT number_of_milliseconds
+timestamp
 
  
 
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index d6cd1d41779..e5e2e15cdf0 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,21 @@ PostgreSQL documentation
 
  
 
-START TRANSACTION [ transaction_mode [, ...] ]
+START TRANSACTION [ transaction_mode [, ...] ] wait_for_event
 
 where transaction_mode is one of:
 
 ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
 READ WRITE | READ ONLY
 [ NOT ] DEFERRABLE
+
+and wait_for_event is:
+WAIT FOR [ANY | ALL] event [, event ...]
+
+where event is one of:
+LSN lsn_value
+TIMEOUT number_of_milliseconds
+timestamp
 
  
 
diff --git a/doc/src/sgml/ref/wait.sgml b/doc/src/sgml/ref/wait.sgml
new file mode 100644
index 000..0e5bd2523d6
--- /dev/null
+++ b/doc/src/sgml/ref/wait.sgml
@@ -0,0 +1,146 @@
+
+
+
+ 
+  WAIT FOR
+ 
+
+ 
+  WAIT FOR
+  7
+  SQL - Language Statements
+ 
+
+ 
+  WAIT FOR
+  wait for the target LSN to be replayed
+ 
+
+ 
+
+WAIT FOR [ANY | ALL] event [, event ...]
+
+where event is one of:
+LSN value
+TIMEOUT number_of_milliseconds
+timestamp
+
+WAIT FOR LSN 'lsn_number'
+WAIT FOR LSN 'lsn_number' TIMEOUT wait_timeout
+WAIT FOR LSN 'lsn_number', TIMESTAMP wait_time
+WAIT FOR TIMESTAMP wait_time
+WAIT FOR ALL LSN 'lsn_number' TIMEOUT wait_timeout, TIMESTAMP wait_time
+WAIT FOR ANY LSN 'lsn_number', TIMESTAMP wait_time
+
+ 
+
+ 
+  Description
+
+  
+   WAIT FOR provides a simple interprocess communication
+   mechanism to wait for timestamp, timeout or the target log sequence number
+   (LSN) on standby in PostgreSQL
+   databases with master-standby asynchronous replication. When run with the
+   LSN option, the WAIT FOR
+   command waits for the specified LSN to be replayed.
+   If no timestamp or timeout was specified, wait time is unlimited.
+   Waiting can be interrupted using Ctrl+C, or
+   by shutting down the postgres server.
+  
+
+ 
+
+ 
+  Parameters
+
+  
+   
+LSN
+
+ 
+  Specify the target log sequence number to wait for.
+ 
+
+   
+
+   
+TIMEOUT wait_timeout
+
+ 
+  Limit the time interval to wait for the LSN to be replayed.
+  The specified wait_timeout must be an integer
+  and is measured in milliseconds.
+ 
+
+

Re: Less-silly selectivity for JSONB matching operators

2020-03-31 Thread Alexey Bashtanov


On 31/03/2020 18:53, Tom Lane wrote:

Renamed "matchsel" to "matchingsel" etc, added DEFAULT_MATCHING_SEL,
rebased over commit 911e70207.  Since that commit already created
new versions of the relevant contrib modules, I think we can just
redefine what those versions contain, rather than making yet-newer
versions.  (Of course, that assumes we're going to include this in
v13.)


Looks good to me.

Best, Alex

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-31 Thread James Coleman

On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
 wrote:
>
> On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:
> >Tomas Vondra  writes:
> >> In general, I think it'd be naive that we can make planner smarter with
> >> no extra overhead spent on planning, and we can never accept patches
> >> adding even tiny overhead. With that approach we'd probably end up with
> >> a trivial planner that generates just a single query plan, because
> >> that's going to be the fastest planner. A realistic approach needs to
> >> consider both the planning and execution phase, and benefits of this
> >> patch seem to be clear - if you have queries that do benefit from it.
> >
> >I think that's kind of attacking a straw man, though.  The thing that
> >people push back on, or should push back on IMO, is when a proposed
> >patch adds significant slowdown to queries that it has no or very little
> >hope of improving.  The trick is to do expensive stuff only when
> >there's a good chance of getting a better plan out of it.
> >
>
> Yeah, I agree with that. I think the main issue is that we don't really
> know what the "expensive stuff" is in this case, so it's not really
> clear how to be smarter :-(

To add to this: I agree that ideally you'd check cheaply to know
you're in a situation that might help, and then do more work. But here
the question is always going to be simply "would we benefit from an
ordering, and, if so, do we have it already partially sorted". It's
hard to imagine that reducing much conceptually, so we're left with
optimizations of that check.

> One possibility is that it's just one of those regressions due to change
> in binary layout, but I'm not sure know how to verify that.

If we are testing with a case that can't actually add more paths (due
to it checking the guc before building them), doesn't that effectively
leave one of these two options:
1. Binary layout/cache/other untraceable change, or
2. Changes due to refactored function calls.

There's not anything obvious in point (2) that would be a big cost,
but there are definitely changes there. I was surprised that just
eliminating the loop through the pathkeys on the query and the index
was enough to save us ~4%.

Tomas: Earlier you'd wondered about if we should try to shortcut the
changes in costing...I was skeptical of that originally, but maybe
it's worth looking into? I'm going to try backing that out and see
what the numbers look like.

James

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-31 Thread Tomas Vondra


On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:

Tomas Vondra  writes:

In general, I think it'd be naive that we can make planner smarter with
no extra overhead spent on planning, and we can never accept patches
adding even tiny overhead. With that approach we'd probably end up with
a trivial planner that generates just a single query plan, because
that's going to be the fastest planner. A realistic approach needs to
consider both the planning and execution phase, and benefits of this
patch seem to be clear - if you have queries that do benefit from it.


I think that's kind of attacking a straw man, though.  The thing that
people push back on, or should push back on IMO, is when a proposed
patch adds significant slowdown to queries that it has no or very little
hope of improving.  The trick is to do expensive stuff only when
there's a good chance of getting a better plan out of it.



Yeah, I agree with that. I think the main issue is that we don't really
know what the "expensive stuff" is in this case, so it's not really
clear how to be smarter :-(

One possibility is that it's just one of those regressions due to change
in binary layout, but I'm not sure know how to verify that.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: backup manifests

2020-03-31 Thread Andres Freund

Hi,

On 2020-03-31 14:10:34 -0400, Robert Haas wrote:
> I made an attempt to implement this.

Awesome!

> In the attached patch set, 0001 I'm going to work on those things. I
> would appreciate *very timely* feedback on anything people do or do
> not like about this, because I want to commit this patch set by the
> end of the work week and that isn't very far away. I would also
> appreciate if people would bear in mind the principle that half a loaf
> is better than none, and further improvements can be made in future
> releases.
> 
> As part of my light testing, I tried promoting a standby that was
> running pg_basebackup, and found that pg_basebackup failed like this:
> 
> pg_basebackup: error: could not get COPY data stream: ERROR:  the
> standby was promoted during online backup
> HINT:  This means that the backup being taken is corrupt and should
> not be used. Try taking another online backup.
> pg_basebackup: removing data directory "/Users/rhaas/pgslave2"
> 
> My first thought was that this error message is hard to reconcile with
> this comment:
> 
> /*
>  * Send timeline history files too. Only the latest timeline history
>  * file is required for recovery, and even that only if there happens
>  * to be a timeline switch in the first WAL segment that contains the
>  * checkpoint record, or if we're taking a base backup from a standby
>  * server and the target timeline changes while the backup is taken.
>  * But they are small and highly useful for debugging purposes, so
>  * better include them all, always.
>  */
> 
> But then it occurred to me that this might be a cascading standby.

Yea. The check just prevents the walsender's database from being
promoted:

/*
 * Check if the postmaster has signaled us to exit, and abort 
with an
 * error in that case. The error handler further up will call
 * do_pg_abort_backup() for us. Also check that if the backup 
was
 * started while still in recovery, the server wasn't promoted.
 * do_pg_stop_backup() will check that too, but it's better to 
stop
 * the backup early than continue to the end and fail there.
 */
CHECK_FOR_INTERRUPTS();
if (RecoveryInProgress() != backup_started_in_recovery)
ereport(ERROR,

(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 errmsg("the standby was promoted 
during online backup"),
 errhint("This means that the backup 
being taken is corrupt "
 "and should not be 
used. "
 "Try taking another 
online backup.")));
and

if (strcmp(backupfrom, "standby") == 0 && !backup_started_in_recovery)
ereport(ERROR,

(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 errmsg("the standby was promoted during online 
backup"),
 errhint("This means that the backup being 
taken is corrupt "
 "and should not be used. "
 "Try taking another online 
backup.")));

So that just prevents promotions of the current node, afaict.

> Regardless, it seems like a really good idea to store a list of WAL
> ranges rather than a single start/end/timeline, because even if it's
> impossible today it might become possible in the future.

Indeed.

> Still, unless there's an easy way to set up a test scenario where
> multiple WAL ranges need to be verified, it may be hard to test that
> this code actually behaves properly.

I think it'd be possible to test without a fully cascading setup, by
creating an initial base backup, then do some work to create a bunch of
new timelines, and then start the initial base backup. That'd have to
follow all those timelines.  Not sure that's better than a cascading
setup though.

> +/*
> + * Add information about the WAL that will need to be replayed when restoring
> + * this backup to the manifest.
> + */
> +static void
> +AddWALInfoToManifest(manifest_info *manifest, XLogRecPtr startptr,
> +  TimeLineID starttli, XLogRecPtr 
> endptr, TimeLineID endtli)
> +{
> + List *timelines = readTimeLineHistory(endtli);

should probably happen after the manifest->buffile check.

> + ListCell *lc;
> + boolfirst_wal_range = true;
> + boolfound_ending_tli = false;
> +
> + /* If there is no buffile, then the user doesn't want a manifest. */
> + if (manifest->buffile == NULL)
> + return;

Not really about this patch/function specifically: I wonder if this'd
look

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-31 Thread Tomas Vondra

On Tue, Mar 31, 2020 at 06:00:00PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 5:53 PM Tomas Vondra
 wrote:

On Tue, Mar 31, 2020 at 02:23:15PM -0400, James Coleman wrote:
>On Mon, Mar 30, 2020 at 9:14 PM Tomas Vondra
> wrote:
>>
>> The main thing I've been working on today is benchmarking how this
>> affects planning. And I'm seeing a regression that worries me a bit,
>> unfortunately.
>>
>> The test I'm doing is pretty simple - build a small table with a bunch
>> of columns:
>>
>>create table t (a int, b int, c int, d int, e int, f int, g int);
>>
>>insert into t select 100*random(), 100*random(), 100*random(),
>>  100*random(), 100*random(), 100*random(), 100*random()
>>from generate_series(1,10) s(i);
>>
>> and then a number of indexes on subsets of up to 3 columns, as generated
>> using the attached build-indexes.py script. And then run a bunch of
>> explains (so no actual execution) sorting the data by at least 4 columns
>> (to trigger incremental sort paths), measuring timing of the script.
>>
>> I did a bunch of runs on current master and v46 with incremental sort
>> disabled and enabled, and the results look like this:
>>
>>  master   offon
>>  --
>>  34.60937.46337.729
>>
>> which means about 8-9% regression with incremental sort. Of course, this
>> is only for planning time, for execution the impact is going to be much
>> smaller. But it's still a bit annoying.
>>
>> I've suspected this might be either due to the add_partial_path changes
>> or the patch adding incremental sort to additional places, so I tested
>> those parts individually and the answer is no - add_partial_path changes
>> have very small impact (~1%, which might be noise). The regression comes
>> mostly from the 0002 part that adds incremental sort. At least in this
>> particular test - different tests might behave differently, of course.
>>
>> The annoying bit is that the overhead does not disappear after disabling
>> incremental sort. That suggests this is not merely due to considering
>> and creating higher number of paths, but due to something that happens
>> before we even look at the GUC ...
>>
>> I think I've found one such place - if you look at compare_pathkeys, it
>> has this check right before the forboth() loop:
>>
>>  if (keys1 == keys2)
>>  return PATHKEYS_EQUAL;
>>
>> But with incremental sort we don't call pathkeys_contained_in, we call
>> pathkeys_common_contained_in instead. And that does not call
>> compare_pathkeys and does not have the simple equality check. Adding
>> the following check seems to cut the overhead in half, which is nice:
>>
>>  if (keys1 == keys2)
>>  {
>>  *n_common = list_length(keys1);
>> return true;
>>  }
>>
>> Not sure where the rest of the regression comes from yet.
>
>I noticed in the other function we also optimize by checking if either
>keys list is NIL, so I tried adding that, and it might have made a
>minor difference, but it's hard to tell as it was under 1%.
>

Which other function? I don't see this optimization in compare_pathkeys,
that only checks key1/key2 after the forboth loop, but that's something
different.

pathkeys_useful_for_ordering checks both inputs.

I may be wrong, but my guess would be that this won't save much, because
the loop should terminate immediately. The whole point is not to loop
over possibly many pathkeys when we know it's exactly the same pathkeys
list (same pointer). When one of the lists is NIL the loop should
terminate right away ...

>I also ran perf with a slightly modified version of your test that
>uses psql, and after the above changes was seeing something like a
>3.5% delta between master and this patch series. Nothing obvious in
>the perf report though.
>

Yeah, I don't think there's anything obviously more expensive.

>This test is intended to be somewhat worst case, no? At what point do
>we consider the trade-off worth it (given that it's not plausible to
>have zero impact)?
>

Yes, more or less. It was definitely designed to do that - it merely
plans the query (no execution), with many applicable indexes etc. It's
definitely true that once the exection starts to take more time, the
overhead will get negligible. Same for reducing number of indexes.

And of course, for queries that can benefit from incremental sort, the
speedup is likely way larger than this.

In general, I think it'd be naive that we can make planner smarter with
no extra overhead spent on planning, and we can never accept patches
adding even tiny overhead. With that approach we'd probably end up with
a trivial planner that generates just a single query plan, because
that's going to be the fastest planner. A realistic approach needs to
consider both the planning and execution phase, and benefits of this
patch seem to be clear - if you have queries that do benefit from it.

Let's try to minimize the overhead a bit more, and then we'll see.

Any

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-31 Thread Tom Lane

Tomas Vondra  writes:
> In general, I think it'd be naive that we can make planner smarter with
> no extra overhead spent on planning, and we can never accept patches
> adding even tiny overhead. With that approach we'd probably end up with
> a trivial planner that generates just a single query plan, because
> that's going to be the fastest planner. A realistic approach needs to
> consider both the planning and execution phase, and benefits of this
> patch seem to be clear - if you have queries that do benefit from it.

I think that's kind of attacking a straw man, though.  The thing that
people push back on, or should push back on IMO, is when a proposed
patch adds significant slowdown to queries that it has no or very little
hope of improving.  The trick is to do expensive stuff only when
there's a good chance of getting a better plan out of it.

regards, tom lane

Re: [HACKERS] Restricting maximum keep segments by repslots

2020-03-31 Thread Alvaro Herrera

On 2020-Mar-31, Alvaro Herrera wrote:

> I think we should kill(SIGTERM) the walsender using the slot 
> (slot->active_pid),
> then acquire the slot and set it to some state indicating that it is now
> useless, no longer reserving WAL; so when the walsender is restarted, it
> will find the slot cannot be used any longer.

Ah, I see ioguix already pointed this out and the response was that the
walsender stops by itself.  Hmm.  I suppose this works too ... it seems
a bit fragile, but maybe I'm too sensitive.  Do we have other opinions
on this point?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

2020-03-31 Thread Alvaro Herrera

On 2020-Mar-31, Alvaro Herrera wrote:

>   /* release lock before syscalls */
>   foreach(l, pids_to_kill)
>   {
>   kill(SIGTERM, lfirst_int(l));
>   }
> 
> I sense some attempt to salvage slots that are reading a segment that is
> "outdated" and removed, but for which the walsender has an open file
> descriptor.  (This appears to be the "losing" state.) This seems
> dangerous, for example the segment might be recycled and is being
> overwritten with different data.  Trying to keep track of that seems
> doomed.  And even if the walsender can still read that data, it's only a
> matter of time before the next segment is also removed.  So keeping the
> walsender alive is futile; it only delays the inevitable.

I think we should kill(SIGTERM) the walsender using the slot (slot->active_pid),
then acquire the slot and set it to some state indicating that it is now
useless, no longer reserving WAL; so when the walsender is restarted, it
will find the slot cannot be used any longer.  Two ideas come to mind
about doing this:

1. set the LSNs and Xmins to Invalid; keep only the slot name, database,
plug_in, etc.  This makes monitoring harder, I think, because as soon as
the slot is gone you know nothing at all about it.

2. add a new flag to ReplicationSlotPersistentData to indicate that the
slot is dead.  This preserves the LSN info for forensics, and might even
be easier to code.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-31 Thread James Coleman

On Tue, Mar 31, 2020 at 5:53 PM Tomas Vondra
 wrote:
>
> On Tue, Mar 31, 2020 at 02:23:15PM -0400, James Coleman wrote:
> >On Mon, Mar 30, 2020 at 9:14 PM Tomas Vondra
> > wrote:
> >>
> >> The main thing I've been working on today is benchmarking how this
> >> affects planning. And I'm seeing a regression that worries me a bit,
> >> unfortunately.
> >>
> >> The test I'm doing is pretty simple - build a small table with a bunch
> >> of columns:
> >>
> >>create table t (a int, b int, c int, d int, e int, f int, g int);
> >>
> >>insert into t select 100*random(), 100*random(), 100*random(),
> >>  100*random(), 100*random(), 100*random(), 100*random()
> >>from generate_series(1,10) s(i);
> >>
> >> and then a number of indexes on subsets of up to 3 columns, as generated
> >> using the attached build-indexes.py script. And then run a bunch of
> >> explains (so no actual execution) sorting the data by at least 4 columns
> >> (to trigger incremental sort paths), measuring timing of the script.
> >>
> >> I did a bunch of runs on current master and v46 with incremental sort
> >> disabled and enabled, and the results look like this:
> >>
> >>  master   offon
> >>  --
> >>  34.60937.46337.729
> >>
> >> which means about 8-9% regression with incremental sort. Of course, this
> >> is only for planning time, for execution the impact is going to be much
> >> smaller. But it's still a bit annoying.
> >>
> >> I've suspected this might be either due to the add_partial_path changes
> >> or the patch adding incremental sort to additional places, so I tested
> >> those parts individually and the answer is no - add_partial_path changes
> >> have very small impact (~1%, which might be noise). The regression comes
> >> mostly from the 0002 part that adds incremental sort. At least in this
> >> particular test - different tests might behave differently, of course.
> >>
> >> The annoying bit is that the overhead does not disappear after disabling
> >> incremental sort. That suggests this is not merely due to considering
> >> and creating higher number of paths, but due to something that happens
> >> before we even look at the GUC ...
> >>
> >> I think I've found one such place - if you look at compare_pathkeys, it
> >> has this check right before the forboth() loop:
> >>
> >>  if (keys1 == keys2)
> >>  return PATHKEYS_EQUAL;
> >>
> >> But with incremental sort we don't call pathkeys_contained_in, we call
> >> pathkeys_common_contained_in instead. And that does not call
> >> compare_pathkeys and does not have the simple equality check. Adding
> >> the following check seems to cut the overhead in half, which is nice:
> >>
> >>  if (keys1 == keys2)
> >>  {
> >>  *n_common = list_length(keys1);
> >> return true;
> >>  }
> >>
> >> Not sure where the rest of the regression comes from yet.
> >
> >I noticed in the other function we also optimize by checking if either
> >keys list is NIL, so I tried adding that, and it might have made a
> >minor difference, but it's hard to tell as it was under 1%.
> >
>
> Which other function? I don't see this optimization in compare_pathkeys,
> that only checks key1/key2 after the forboth loop, but that's something
> different.

pathkeys_useful_for_ordering checks both inputs.

> I may be wrong, but my guess would be that this won't save much, because
> the loop should terminate immediately. The whole point is not to loop
> over possibly many pathkeys when we know it's exactly the same pathkeys
> list (same pointer). When one of the lists is NIL the loop should
> terminate right away ...
>
> >I also ran perf with a slightly modified version of your test that
> >uses psql, and after the above changes was seeing something like a
> >3.5% delta between master and this patch series. Nothing obvious in
> >the perf report though.
> >
>
> Yeah, I don't think there's anything obviously more expensive.
>
> >This test is intended to be somewhat worst case, no? At what point do
> >we consider the trade-off worth it (given that it's not plausible to
> >have zero impact)?
> >
>
> Yes, more or less. It was definitely designed to do that - it merely
> plans the query (no execution), with many applicable indexes etc. It's
> definitely true that once the exection starts to take more time, the
> overhead will get negligible. Same for reducing number of indexes.
>
> And of course, for queries that can benefit from incremental sort, the
> speedup is likely way larger than this.
>
> In general, I think it'd be naive that we can make planner smarter with
> no extra overhead spent on planning, and we can never accept patches
> adding even tiny overhead. With that approach we'd probably end up with
> a trivial planner that generates just a single query plan, because
> that's going to be the fastest planner. A realistic approach needs to
> consider both the planning and execution phase, and benefits of this
> patch

Re: Improving connection scalability: GetSnapshotData()

2020-03-31 Thread Andres Freund

Hi,

On 2020-03-31 13:04:38 -0700, Andres Freund wrote:
> I'm still fighting with snapshot_too_old. The feature is just badly
> undertested, underdocumented, and there's lots of other oddities. I've
> now spent about as much time on that feature than on the whole rest of
> the patchset.

To expand on this being under-tested: The whole time mapping
infrastructure is not tested, because all of that is bypassed when
old_snapshot_threshold = 0.  And old_snapshot_threshold = 0 basically
only exists for testing.  The largest part of the complexity of this
feature are TransactionIdLimitedForOldSnapshots() and
MaintainOldSnapshotTimeMapping() - and none of the complexity is tested
due to the tests running with old_snapshot_threshold = 0.

So we have test only infrastructure that doesn't allow to actually test
the feature.

And the tests that we do have don't have a single comment explaining
what the expected results are. Except for the newer
sto_using_hash_index.spec, they just run all permutations. I don't know
how those tests actually help, since it's not clear why any of the
results are the way they are. And which just are the results of
bugs. Ore not affected by s_t_o.

Greetings,

Andres Freund

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-31 Thread Tomas Vondra


On Tue, Mar 31, 2020 at 02:23:15PM -0400, James Coleman wrote:

On Mon, Mar 30, 2020 at 9:14 PM Tomas Vondra
 wrote:


The main thing I've been working on today is benchmarking how this
affects planning. And I'm seeing a regression that worries me a bit,
unfortunately.

The test I'm doing is pretty simple - build a small table with a bunch
of columns:

   create table t (a int, b int, c int, d int, e int, f int, g int);

   insert into t select 100*random(), 100*random(), 100*random(),
 100*random(), 100*random(), 100*random(), 100*random()
   from generate_series(1,10) s(i);

and then a number of indexes on subsets of up to 3 columns, as generated
using the attached build-indexes.py script. And then run a bunch of
explains (so no actual execution) sorting the data by at least 4 columns
(to trigger incremental sort paths), measuring timing of the script.

I did a bunch of runs on current master and v46 with incremental sort
disabled and enabled, and the results look like this:

 master   offon
 --
 34.60937.46337.729

which means about 8-9% regression with incremental sort. Of course, this
is only for planning time, for execution the impact is going to be much
smaller. But it's still a bit annoying.

I've suspected this might be either due to the add_partial_path changes
or the patch adding incremental sort to additional places, so I tested
those parts individually and the answer is no - add_partial_path changes
have very small impact (~1%, which might be noise). The regression comes
mostly from the 0002 part that adds incremental sort. At least in this
particular test - different tests might behave differently, of course.

The annoying bit is that the overhead does not disappear after disabling
incremental sort. That suggests this is not merely due to considering
and creating higher number of paths, but due to something that happens
before we even look at the GUC ...

I think I've found one such place - if you look at compare_pathkeys, it
has this check right before the forboth() loop:

 if (keys1 == keys2)
 return PATHKEYS_EQUAL;

But with incremental sort we don't call pathkeys_contained_in, we call
pathkeys_common_contained_in instead. And that does not call
compare_pathkeys and does not have the simple equality check. Adding
the following check seems to cut the overhead in half, which is nice:

 if (keys1 == keys2)
 {
 *n_common = list_length(keys1);
return true;
 }

Not sure where the rest of the regression comes from yet.


I noticed in the other function we also optimize by checking if either
keys list is NIL, so I tried adding that, and it might have made a
minor difference, but it's hard to tell as it was under 1%.



Which other function? I don't see this optimization in compare_pathkeys,
that only checks key1/key2 after the forboth loop, but that's something
different.

I may be wrong, but my guess would be that this won't save much, because
the loop should terminate immediately. The whole point is not to loop
over possibly many pathkeys when we know it's exactly the same pathkeys
list (same pointer). When one of the lists is NIL the loop should
terminate right away ...


I also ran perf with a slightly modified version of your test that
uses psql, and after the above changes was seeing something like a
3.5% delta between master and this patch series. Nothing obvious in
the perf report though.



Yeah, I don't think there's anything obviously more expensive.


This test is intended to be somewhat worst case, no? At what point do
we consider the trade-off worth it (given that it's not plausible to
have zero impact)?



Yes, more or less. It was definitely designed to do that - it merely
plans the query (no execution), with many applicable indexes etc. It's
definitely true that once the exection starts to take more time, the
overhead will get negligible. Same for reducing number of indexes.

And of course, for queries that can benefit from incremental sort, the
speedup is likely way larger than this.

In general, I think it'd be naive that we can make planner smarter with
no extra overhead spent on planning, and we can never accept patches
adding even tiny overhead. With that approach we'd probably end up with
a trivial planner that generates just a single query plan, because
that's going to be the fastest planner. A realistic approach needs to
consider both the planning and execution phase, and benefits of this
patch seem to be clear - if you have queries that do benefit from it.

Let's try to minimize the overhead a bit more, and then we'll see.



Also, while looking at pathkeys_common_contained_in(), I've been a bit
puzzled why does is this correct:

 return (key1 == NULL);

It's easy to not notice it's key1 and not keys1, so I suggest we add a
comment 'key1 == NULL' means we've processed whole keys1 list.


Done.

I've included fixes for Alvaro's comments in this patch series also.

Re: Rethinking opclass member checks and dependency strength

2020-03-31 Thread Tom Lane

Alvaro Herrera  writes:
> On 2019-Aug-18, Tom Lane wrote:
>> * I'm not at all impressed with the name, location, or concept of
>> opfam_internal.h.  I think we should get rid of that header and put
>> the OpFamilyMember struct somewhere else.  Given that this patch
>> makes it part of the AM API, it wouldn't be unreasonable to move it
>> to amapi.h.  But I've not done that here.

> I created that file so that it'd be possible to interpret the struct
> when dealing with DDL commands in event triggers (commit b488c580aef4).
> The struct was previously in a .c file, and we didn't have an
> appropriate .h file to put it in.  I think amapi.h is a great place for
> it.

Yeah, later versions of the patch put it there.

regards, tom lane

Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-31 Thread Tom Lane

I wrote:
> Dean Rasheed  writes:
>> I had a go at reproducing this. I wasn't able to produce the reported
>> failure, but I can reliably produce an Assert failure that may be
>> related by doing a VACUUM FULL simultaneously with an ANALYZE that is
>> generating extended stats, which produces:
>> ...
>> It looks to me as though the problem is that statext_store() needs to
>> take its lock on pg_statistic_ext_data *before* searching for the
>> stats tuple to update.

> Hmm, yeah, that seems like clearly a bad idea.

I pushed a fix for that, but I think it must be unrelated to the
buildfarm failures we're seeing.  For that coding to be a problem,
it would have to run concurrently with a VACUUM FULL or CLUSTER
on pg_statistic_ext_data (which would give all the tuples new TIDs).
AFAICS that won't happen with the tests that are giving trouble.

regards, tom lane

Re: Rethinking opclass member checks and dependency strength

2020-03-31 Thread Alvaro Herrera

On 2019-Aug-18, Tom Lane wrote:

> * I'm not at all impressed with the name, location, or concept of
> opfam_internal.h.  I think we should get rid of that header and put
> the OpFamilyMember struct somewhere else.  Given that this patch
> makes it part of the AM API, it wouldn't be unreasonable to move it
> to amapi.h.  But I've not done that here.

I created that file so that it'd be possible to interpret the struct
when dealing with DDL commands in event triggers (commit b488c580aef4).
The struct was previously in a .c file, and we didn't have an
appropriate .h file to put it in.  I think amapi.h is a great place for
it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

2020-03-31 Thread Alvaro Herrera

I noticed some other things:

1. KeepLogSeg sends a warning message when slots fall behind.  To do
this, it searches for "the most affected slot", that is, the slot that
lost the most data.  But it seems to me that that's a bit pointless; if
a slot data, it's now useless and anything that was using that slot must
be recreated.  If you only know what's the most affected slot, it's not
possible to see which *other* slots are affected.  It doesn't matter if
the slot missed one segment or twenty segments or  segments -- the
slot is now useless, or it is not useless.  I think we should list the
slot that was *least* affected, i.e., the slot that lost the minimum
amount of segments; then the user knows that all slots that are older
than that one are *also* affected.

2. KeepLogSeg ignores slots that are active.  I guess the logic here is
that if a slot is active, then it'll keep going until it catches up and
we don't need to do anything about the used disk space.  But that seems
a false premise, because if a standby is so slow that it cannot keep up,
it will eventually run the master out of diskspace even if it's active
all the time.  So I'm not seeing the reasoning that makes it useful to
skip checking active slots.

(BTW I don't think you need to keep that many static variables in that
function.  Just the slot name should be sufficient, I think ... or maybe
even the *pointer* to the slot that was last reported.

I think if a slot is behind and it lost segments, we should kill the
walsender that's using it, and unreserve the segments.  So maybe
something like

LWLockAcquire( ... );
for (i = 0 ; i < max_replication_slots; i++)
{
ReplicationSlot *s =

>replication_slots[i];
XLogSegNo slotSegNo;

XLByteToSeg(s->data.restart_lsn, slotSegNo, 
wal_segment_size);

if (s->in_use)
{
if (s->active_pid)
pids_to_kill = 
lappend(pids_to_kill, s->active_pid);

nslots_affected++;
... ;   /* other stuff */
}
}
LWLockRelease( ... )
/* release lock before syscalls */
foreach(l, pids_to_kill)
{
kill(SIGTERM, lfirst_int(l));
}

I sense some attempt to salvage slots that are reading a segment that is
"outdated" and removed, but for which the walsender has an open file
descriptor.  (This appears to be the "losing" state.) This seems
dangerous, for example the segment might be recycled and is being
overwritten with different data.  Trying to keep track of that seems
doomed.  And even if the walsender can still read that data, it's only a
matter of time before the next segment is also removed.  So keeping the
walsender alive is futile; it only delays the inevitable.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-31 Thread Tom Lane

Dean Rasheed  writes:
> I had a go at reproducing this. I wasn't able to produce the reported
> failure, but I can reliably produce an Assert failure that may be
> related by doing a VACUUM FULL simultaneously with an ANALYZE that is
> generating extended stats, which produces:
> ...
> It looks to me as though the problem is that statext_store() needs to
> take its lock on pg_statistic_ext_data *before* searching for the
> stats tuple to update.

Hmm, yeah, that seems like clearly a bad idea.

> It's late here, so I haven't worked up a patch yet, but it looks
> pretty straightforward.

I can take care of it.

regards, tom lane

Re: Rethinking opclass member checks and dependency strength

2020-03-31 Thread Tom Lane

I wrote:
> Still haven't got a better naming idea, but in the meantime here's
> a rebase to fix a conflict with 612a1ab76.

I see from the cfbot that this needs another rebase, so here 'tis.
No changes in the patch itself.

regards, tom lane

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index d3bf866..663ff7e 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -139,6 +139,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->amproperty = NULL;
 	amroutine->ambuildphasename = NULL;
 	amroutine->amvalidate = blvalidate;
+	amroutine->amcheckmembers = NULL;
 	amroutine->ambeginscan = blbeginscan;
 	amroutine->amrescan = blrescan;
 	amroutine->amgettuple = NULL;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 37f8d87..d76d17d 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -141,6 +141,7 @@ typedef struct IndexAmRoutine
 amproperty_function amproperty; /* can be NULL */
 ambuildphasename_function ambuildphasename;   /* can be NULL */
 amvalidate_function amvalidate;
+amcheckmembers_function amcheckmembers; /* can be NULL */
 ambeginscan_function ambeginscan;
 amrescan_function amrescan;
 amgettuple_function amgettuple; /* can be NULL */
@@ -500,7 +501,48 @@ amvalidate (Oid opclassoid);
the access method can reasonably do that.  For example, this might include
testing that all required support functions are provided.
The amvalidate function must return false if the opclass is
-   invalid.  Problems should be reported with ereport messages.
+   invalid.  Problems should be reported with ereport
+   messages, typically at INFO level.
+  
+
+  
+
+void
+amcheckmembers (Oid opfamilyoid,
+Oid opclassoid,
+List *operators,
+List *functions);
+
+   Validate proposed operator and function members of an operator family,
+   so far as the access method can reasonably do that, and set their
+   dependency types if the default is not satisfactory.  This is called
+   during CREATE OPERATOR CLASS and during
+   ALTER OPERATOR FAMILY ADD; in the latter
+   case opclassoid is InvalidOid.
+   The List arguments are lists
+   of OpFamilyMember structs, as defined
+   in amapi.h.
+
+   Tests done by this function will typically be a subset of those
+   performed by amvalidate,
+   since amcheckmembers cannot assume that it is
+   seeing a complete set of members.  For example, it would be reasonable
+   to check the signature of a support function, but not to check whether
+   all required support functions are provided.  Any problems can be
+   reported by throwing an error.
+
+   The dependency-related fields of
+   the OpFamilyMember structs are initialized by
+   the core code to create hard dependencies on the opclass if this
+   is CREATE OPERATOR CLASS, or soft dependencies on the
+   opfamily if this is ALTER OPERATOR FAMILY ADD.
+   amcheckmembers can adjust these fields if some other
+   behavior is more appropriate.  For example, GIN, GiST, and SP-GiST
+   always set operator members to have soft dependencies on the opfamily,
+   since the connection between an operator and an opclass is relatively
+   weak in these index types; so it is reasonable to allow operator members
+   to be added and removed freely.  Optional support functions are typically
+   also given soft dependencies, so that they can be removed if necessary.
   
 
 
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 7db3ae5..b911029 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -120,6 +120,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->amproperty = NULL;
 	amroutine->ambuildphasename = NULL;
 	amroutine->amvalidate = brinvalidate;
+	amroutine->amcheckmembers = NULL;
 	amroutine->ambeginscan = brinbeginscan;
 	amroutine->amrescan = brinrescan;
 	amroutine->amgettuple = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index a400f1f..26aa152 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -71,6 +71,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amproperty = NULL;
 	amroutine->ambuildphasename = NULL;
 	amroutine->amvalidate = ginvalidate;
+	amroutine->amcheckmembers = gincheckmembers;
 	amroutine->ambeginscan = ginbeginscan;
 	amroutine->amrescan = ginrescan;
 	amroutine->amgettuple = NULL;
diff --git a/src/backend/access/gin/ginvalidate.c b/src/backend/access/gin/ginvalidate.c
index 1e3046f..739006a 100644
--- a/src/backend/access/gin/ginvalidate.c
+++ b/src/backend/access/gin/ginvalidate.c
@@ -271,3 +271,64 @@ ginvalidate(Oid opclassoid)
 
 	return result;
 }
+
+/*
+ * Prechecking function for adding operators/functions to a GIN opfamily.
+ */
+void
+gincheckmembers(Oid opfamilyoid,
+Oid opclassoid,
+List *operators,
+List *functions)
+{
+	ListCell   *lc;
+
+	/*
+	 * Operator members of a GIN

Re: Small docs bugfix: make it clear what can be used in UPDATE FROM and DELETE USING

2020-03-31 Thread Bruce Momjian

On Wed, Mar 18, 2020 at 12:24:45PM -0400, Bruce Momjian wrote:
> On Tue, Mar 17, 2020 at 10:58:54PM -0400, Tom Lane wrote:
> > Bruce Momjian  writes:
> > > I have implemented the ideas above in the attached patch.  I have
> > > synchronized the syntax to match SELECT, and synchronized the paragraphs
> > > describing the item.
> > 
> > I think that the DELETE synopsis should look like
> > 
> > [ USING from_item [, ...] ]
> > 
> > so that there's not any question which part of the SELECT syntax we're
> > talking about.  I also think that the running text in both cases should
> > say in exactly these words "from_item means the same thing as it does
> > in SELECT"; the wording you propose still seems to be dancing around
> > the point, leaving readers perhaps not quite sure about what is meant.
> > 
> > In the DELETE case you could alternatively say "using_item means the same
> > thing as from_item does in SELECT", but that doesn't really seem like an
> > improvement to me.
> 
> OK, updated patch attached.

Patch appied through 9.5.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-31 Thread Dean Rasheed

On Tue, 31 Mar 2020 at 04:39, David Rowley  wrote:
>
> On Sat, 28 Mar 2020 at 22:22, David Rowley  wrote:
> > I'm unsure yet if this has caused an instability on lousyjack's run in
> > [1].
>
> pogona has just joined in on the fun [1], so, we're not out the woods
> on this yet. I'll start having a look at this in more detail.
>
> [1] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pogona=2020-03-30%2023%3A10%3A03
>

I had a go at reproducing this. I wasn't able to produce the reported
failure, but I can reliably produce an Assert failure that may be
related by doing a VACUUM FULL simultaneously with an ANALYZE that is
generating extended stats, which produces:

#0  0x7f28081c9520 in raise () from /lib64/libc.so.6
#1  0x7f28081cab01 in abort () from /lib64/libc.so.6
#2  0x00aad1ad in ExceptionalCondition (conditionName=0xb2f1a1
"ItemIdIsNormal(lp)", errorType=0xb2e7c9 "FailedAssertion",
fileName=0xb2e848 "heapam.c", lineNumber=3016) at assert.c:67
#3  0x004fb79e in heap_update (relation=0x7f27feebeda8,
otid=0x2d881fc, newtup=0x2d881f8, cid=0, crosscheck=0x0, wait=true,
tmfd=0x7ffc568a5900, lockmode=0x7ffc568a58fc) at heapam.c:3016
#4  0x004fdead in simple_heap_update (relation=0x7f27feebeda8,
otid=0x2d881fc, tup=0x2d881f8) at heapam.c:3902
#5  0x005be860 in CatalogTupleUpdate (heapRel=0x7f27feebeda8,
otid=0x2d881fc, tup=0x2d881f8) at indexing.c:230
#6  0x008df898 in statext_store (statOid=18964, ndistinct=0x0,
dependencies=0x2a85fe0, mcv=0x0, stats=0x2a86570) at
extended_stats.c:553
#7  0x008deec0 in BuildRelationExtStatistics
(onerel=0x7f27feed9008, totalrows=5000, numrows=5000, rows=0x2ad5a30,
natts=7, vacattrstats=0x2a75f40) at extended_stats.c:187
#8  0x0065c1b7 in do_analyze_rel (onerel=0x7f27feed9008,
params=0x7ffc568a5fc0, va_cols=0x0, acquirefunc=0x65ce37
, relpages=31, inh=false, in_outer_xact=false,
elevel=13) at analyze.c:606
#9  0x0065b532 in analyze_rel (relid=18956,
relation=0x29b0bc0, params=0x7ffc568a5fc0, va_cols=0x0,
in_outer_xact=false, bstrategy=0x2a7dfa0) at analyze.c:263
#10 0x006fd768 in vacuum (relations=0x2a7e148,
params=0x7ffc568a5fc0, bstrategy=0x2a7dfa0, isTopLevel=true) at
vacuum.c:468
#11 0x006fd22c in ExecVacuum (pstate=0x2a57a00,
vacstmt=0x29b0ca8, isTopLevel=true) at vacuum.c:251

It looks to me as though the problem is that statext_store() needs to
take its lock on pg_statistic_ext_data *before* searching for the
stats tuple to update.

It's late here, so I haven't worked up a patch yet, but it looks
pretty straightforward.

Regards,
Dean

Re: pg_ls_tmpdir to show directories and shared filesets (and pg_ls_*)

2020-03-31 Thread Justin Pryzby

On Tue, Mar 17, 2020 at 02:04:01PM -0500, Justin Pryzby wrote:
> > The example in the documentation could be better indented. Also, ISTM that
> > there are two implicit laterals (format & pg_ls_dir_recurse) that I would
> > make explicit. I'd use the pcs alias explicitely. I'd use meaningful aliases
> > (eg ts instead of b, …).
> 
> > On reflection, I think that a boolean "isdir" column is a bad idea because
> > it is not extensible. I'd propose to switch to the standard "ls" approach of
> > providing the type as one character: '-' for regular, 'd' for directory, 'l'
> > for link, 's' for socket, 'c' for character special…
> 
> I think that's outside the scope of the patch, since I'd want to change
> pg_stat_file; that's where I borrowed "isdir" from, for consistency.
> 
> Note that both LS_DIR_HISTORIC and LS_DIR_MODERN include LS_DIR_SKIP_SPECIAL,
> so only pg_ls_dir itself show specials, so they way to do it would be to 1)
> change pg_stat_file to expose the file's "type", 2) use pg_ls_dir() AS a,
> lateral pg_stat_file(a) AS b, 3) then consider also changing LS_DIR_MODERN and
> all the existing pg_ls_*.

The patch intends to fix the issue of "failing to show failed filesets"
(because dirs are skipped) while also generalizing existing functions (to show
directories and "isdir" column) and providing some more flexible ones (to list
file and metadata of a dir, which is currently possible [only] for "special"
directories, or by recursively calling pg_stat_file).

I'm still of the opinion that supporting arbitrary file types is out of scope,
but I changed the "isdir" to show "type".  I'm only supporting '[-dl]'.  I
don't want to have to check #ifdef S_ISDOOR or whatever other vendors have.  I
insist that it is a separate patch, since it depends on everything else, and I
have no feedback from anybody else as to whether any of that is desired.

template1=# SELECT * FROM pg_ls_waldir();
   name   |   size   | access |  
modification  | change | creation | type 
--+--++++--+--
 barr |0 | 2020-03-31 14:43:11-05 | 2020-03-31 
14:43:11-05 | 2020-03-31 14:43:11-05 |  | ?
 baz  | 4096 | 2020-03-31 14:39:18-05 | 2020-03-31 
14:39:18-05 | 2020-03-31 14:39:18-05 |  | d
 foo  |0 | 2020-03-31 14:39:37-05 | 2020-03-31 
14:39:37-05 | 2020-03-31 14:39:37-05 |  | -
 archive_status   | 4096 | 2020-03-31 14:38:20-05 | 2020-03-31 
14:38:18-05 | 2020-03-31 14:38:18-05 |  | d
 00010001 | 16777216 | 2020-03-31 14:42:53-05 | 2020-03-31 
14:43:08-05 | 2020-03-31 14:43:08-05 |  | -
 bar  |3 | 2020-03-31 14:39:16-05 | 2020-03-31 
14:39:01-05 | 2020-03-31 14:39:01-05 |  | l

>From f6b54482cc1d1b9595f49211b14807a20f994a3f Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 16 Mar 2020 14:12:55 -0500
Subject: [PATCH v15 01/10] Document historic behavior of links to
 directories..

Backpatch to 9.5: pg_stat_file
---
 doc/src/sgml/func.sgml | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7a0bb0c70a..25b1278459 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -21558,7 +21558,8 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 size, last accessed time stamp, last modified time stamp,
 last file status change time stamp (Unix platforms only),
 file creation time stamp (Windows only), and a boolean
-indicating if it is a directory.  Typical usages include:
+indicating if it is a directory (or a symbolic link to a directory).
+Typical usages include:
 
 SELECT * FROM pg_stat_file('filename');
 SELECT (pg_stat_file('filename')).modification;
-- 
2.17.0

>From e1c8750fb777df4bf41f36df7ae6cde39b907ec2 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 30 Mar 2020 18:59:16 -0500
Subject: [PATCH v15 02/10] pg_stat_file and pg_ls_dir_* to use lstat()..

pg_ls_dir_* will now skip (no longer show) symbolic links, same as other
non-regular file types, as we advertize we do since 8b6d94cf6.  That seems to
be the intented behavior, since irregular file types are 1) less portable; and,
2) we don't currently show a file's type except for "bool is_dir".

pg_stat_file will now 1) show metadata of links themselves, rather than their
target; and, 2) specifically, show links to directories with "is_dir=false";
and, 3) not error on broken symlinks.
---
 doc/src/sgml/func.sgml  | 2 +-
 src/backend/utils/adt/genfile.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 25b1278459..630f03d7dd 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -21558,7 +21558,7 @@

Re: Improving connection scalability: GetSnapshotData()

2020-03-31 Thread Andres Freund

Hi,

I'm still fighting with snapshot_too_old. The feature is just badly
undertested, underdocumented, and there's lots of other oddities. I've
now spent about as much time on that feature than on the whole rest of
the patchset.

As an example for under-documented, here's a definitely non-trivial
block of code without a single comment explaining what it's doing.

if (oldSnapshotControl->count_used > 0 &&
ts >= 
oldSnapshotControl->head_timestamp)
{
int offset;

offset = ((ts - 
oldSnapshotControl->head_timestamp)
  / USECS_PER_MINUTE);
if (offset > 
oldSnapshotControl->count_used - 1)
offset = 
oldSnapshotControl->count_used - 1;
offset = 
(oldSnapshotControl->head_offset + offset)
% OLD_SNAPSHOT_TIME_MAP_ENTRIES;
xlimit = 
oldSnapshotControl->xid_by_minute[offset];

if (NormalTransactionIdFollows(xlimit, 
recentXmin))

SetOldSnapshotThresholdTimestamp(ts, xlimit);
}

LWLockRelease(OldSnapshotTimeMapLock);

Also, SetOldSnapshotThresholdTimestamp() acquires a separate spinlock -
not great to call that with the lwlock held.


Then there's this comment:

/*
 * Failsafe protection against vacuuming work of active 
transaction.
 *
 * This is not an assertion because we avoid the spinlock for
 * performance, leaving open the possibility that xlimit could 
advance
 * and be more current; but it seems prudent to apply this 
limit.  It
 * might make pruning a tiny bit less aggressive than it could 
be, but
 * protects against data loss bugs.
 */
if (TransactionIdIsNormal(latest_xmin)
&& TransactionIdPrecedes(latest_xmin, xlimit))
xlimit = latest_xmin;

if (NormalTransactionIdFollows(xlimit, recentXmin))
return xlimit;

So this is not using lock, so the values aren't accurate, but it avoids
data loss bugs? I also don't know which spinlock is avoided on the path
here as mentioend - the acquisition is unconditional.

But more importantly - if this is about avoiding data loss bugs, how on
earth is it ok that we don't go through these checks in the
old_snapshot_threshold == 0 path?

/*
 * Zero threshold always overrides to latest xmin, if valid.  
Without
 * some heuristic it will find its own snapshot too old on, for
 * example, a simple UPDATE -- which would make it useless for 
most
 * testing, but there is no principled way to ensure that it 
doesn't
 * fail in this way.  Use a five-second delay to try to get 
useful
 * testing behavior, but this may need adjustment.
 */
if (old_snapshot_threshold == 0)
{
if (TransactionIdPrecedes(latest_xmin, MyProc->xmin)
&& TransactionIdFollows(latest_xmin, xlimit))
xlimit = latest_xmin;

ts -= 5 * USECS_PER_SEC;
SetOldSnapshotThresholdTimestamp(ts, xlimit);

return xlimit;
}


This feature looks like it was put together by applying force until
something gave, and then stopping just there.


Greetings,

Andres Freund

Re: [HACKERS] Restricting maximum keep segments by repslots

2020-03-31 Thread Alvaro Herrera

On 2020-Mar-31, Alvaro Herrera wrote:

> On 2020-Mar-31, Alvaro Herrera wrote:
> 
> > I'm not sure if I explained my proposal clearly.  What if
> > XLogGetLastRemovedSegno returning zero means that every segment is
> > valid?  We don't need to scan pg_xlog at all.
> 
> I mean this:

[v21 does it that way.  Your typo fixes are included, but not the
LastRemoved stuff being discussed here.  I also edited the shortdesc in
guc.c to better match {min,max}_wal_size.]

Hmm ... but if the user runs pg_resetwal to remove WAL segments, then
this will work badly for a time (until a segment is removed next).  I'm
not very worried for that scenario, since surely the user will have to
reclone any standbys anyway.  I think your v20 behaves better in that
case.  But I'm not sure we should have that code to cater only to that
case ... seems to me that it will go untested 99.999% of the time.

Maybe you're aware of some other cases where lastRemovedSegNo is not
correct for the purposes of this feature?

I pushed the silly test_decoding test adjustment to get it out of the
way.

/me tries to figure out KeepLogSeg next

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From a37c380e9808c315ba7e1240d16c7d583e03ff39 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 19 Dec 2018 12:43:57 +0900
Subject: [PATCH v21 1/2] Add WAL relief vent for replication slots

Replication slot is useful to maintain replication connection in the
configurations where replication is so delayed that connection is
broken. On the other hand so many WAL files can fill up disk that the
master downs by a long delay. This feature, which is activated by a
GUC "max_slot_wal_keep_size", protects master servers from suffering
disk full by limiting the number of WAL files reserved by replication
slots.
---
 doc/src/sgml/catalogs.sgml|  48 +++
 doc/src/sgml/config.sgml  |  23 ++
 doc/src/sgml/high-availability.sgml   |   8 +-
 src/backend/access/transam/xlog.c | 322 --
 src/backend/catalog/system_views.sql  |   4 +-
 src/backend/replication/slot.c|   1 +
 src/backend/replication/slotfuncs.c   |  39 ++-
 src/backend/utils/misc/guc.c  |  13 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h |  19 ++
 src/include/catalog/pg_proc.dat   |   6 +-
 src/test/recovery/t/018_replslot_limit.pl | 203 +++
 src/test/regress/expected/rules.out   |   6 +-
 13 files changed, 657 insertions(+), 36 deletions(-)
 create mode 100644 src/test/recovery/t/018_replslot_limit.pl

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 64614b569c..01a7802ed4 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9907,6 +9907,54 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
   
  
 
+ 
+  wal_status
+  text
+  
+
+  Availability of WAL files claimed by this slot.
+  Valid values are:
+   
+
+ normal means that the claimed files
+ are within max_wal_size
+
+
+ keeping means that max_wal_size
+ is exceeded but still held by replication slots or
+ wal_keep_segments
+
+
+ losing means that some of the files are on the verge
+ of deletion, but can still be accessed by a session that's currently
+ reading it
+
+
+ lost means that some of them are definitely lost
+ and the session using this slot cannot continue replication.
+ This state also implies that the session using this slot has been
+ stopped.
+
+   
+  The last two states are seen only when
+   is
+  non-negative. If restart_lsn is NULL, this
+  field is null.
+  
+ 
+
+ 
+  remain
+  bigint
+  
+  The amount in bytes of WAL that can be written before this slot
+loses required WAL files.
+If restart_lsn is null or
+wal_status is losing
+or lost, this field is null.
+  
+ 
+
 

   
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2de21903a1..dc99c6868a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3758,6 +3758,29 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
   
  
 
+  
+   max_slot_wal_keep_size (integer)
+   
+max_slot_wal_keep_size configuration parameter
+   
+   
+   
+   
+Specify the maximum size of WAL files
+that replication
+slots are allowed to retain in the pg_wal
+directory at checkpoint time.
+If max_slot_wal_keep_size is -1 (the default),
+replication slots retain unlimited amount of WAL files.  If
+

Re: Add A Glossary

2020-03-31 Thread Corey Huinker

On Tue, Mar 31, 2020 at 2:09 PM Justin Pryzby  wrote:

> On Sun, Oct 13, 2019 at 04:52:05PM -0400, Corey Huinker wrote:
> > 1. It's obviously incomplete. There are more terms, a lot more, to add.
>
> How did you come up with the initial list of terms ?
>

1. I asked some newer database people to come up with a list of terms that
they used.
2. I then added some more terms that seemed obvious given that first list.
3. That combined list was long on general database concepts and theory, and
short on administration concepts
4. Then Jürgen suggested that we integrate his working list of terms, very
much focused on internals, so I did that.
5. Everything after that was applying suggested edits and new terms.

> Here's some ideas; I'm *not* suggesting to include all of everything, but
> hopefully start with a coherent, self-contained list.
>

I don't think this list will ever be complete. It will always be a work in
progress. I'd prefer to get the general structure of a glossary committed
in the short term, and we're free to follow up with edits that focus on the
wording.

>
> grep -roh '[^<]*' doc/src/ |sed 's/.*/\L&/' |sort |uniq -c
> |sort -nr |less
>
> Maybe also:
> object identifier
> operator classes
> operator family
> visibility map
>

Just so I can prioritize my work, which of these things, along with your
suggestions in previous emails, would you say is a barrier to considering
this ready for a committer?

Re: WIP: System Versioned Temporal Table

2020-03-31 Thread Eli Marmor

Hi Surafel and the rest,

I'm the owner of the Israeli meetup group of PostgreSQL, and I'm interested
in Temporality and have been trying for several years a few ways to add it
to PostgreSQL
(all of them through extensions and external ways).
I'm happy that this is done by you internally (and a little bit
disappointed that it's delayed again and again, but that's life...).
I'll be happy to join this effort.
I can't promise that I'll succeed to contribute anything, but first I want
to play with it a little.
To save me several hours, can you advise me what is the best way to install
it?
Which exact version of PG should I apply this patch to?

Thanks in advance, and thanks for your great work!
Eli

On Tue, Mar 31, 2020 at 10:04 PM Surafel Temesgen 
wrote:

> Hi,
>
> On Tue, Mar 3, 2020 at 9:33 PM David Steele  wrote:
>
>> Hi Surafel,
>>
>> On 1/3/20 5:57 AM, Surafel Temesgen wrote:
>> > Rebased and conflict resolved i hope it build clean this time
>>
>> This patch no longer applies according to cfbot and there are a number
>> of review comments that don't seem to have been addressed yet.
>>
>> The patch is not exactly new for this CF but since the first version was
>> posted 2020-01-01 and there have been no updates (except a rebase) since
>> then it comes pretty close.
>>
>> Were you planning to work on this for PG13?  If so we'll need to see a
>> rebased and updated patch pretty soon.  My recommendation is that we
>> move this patch to PG14.
>>
>>
> I agree with moving to  PG14 . Its hard to make it to PG13.
>
> regards
> Surafel
>

Re: recovery_target_action=pause with confusing hint

2020-03-31 Thread Sergei Kornilov

    My proposal does not change the behavior after this commit, only 
 changing the lines in the logs.
>>>
>>>  Yes. What's your opinion about the latest patch based on Robert's idea?
>>>  Barring any ojection, I'd like to commit that.
>>
>>  Oh, sorry. Looks good and solves my issue
>
> Thanks for reviewing the patch! Pushed!

Thank you!

regards, Sergei

Re: Planning counters in pg_stat_statements (using pgss_store)

2020-03-31 Thread Julien Rouhaud

On Wed, Apr 01, 2020 at 02:43:10AM +0900, Fujii Masao wrote:
>
>
> On 2020/03/31 16:33, Julien Rouhaud wrote:
> >
> > v12 attached!
>
> Thanks for updating the patch! The patch looks good to me.
>
> I applied minor and cosmetic changes into the patch. Attached is
> the updated version of the patch. Barring any objection, I'd like to
> commit this version.
>
> BTW, the minor and cosmetic changes that I applied are, for example,
>
> - Rename pgss_planner_hook to pgss_planner for the sake of consistency.
>Other function using hook in pgss doesn't use "_hook" in their names, too.
> - Make pgss_planner use PG_FINALLY() instead of PG_CATCH().
> - Make PGSS_NUMKIND as the last value in enum pgssStoreKind.


+1, and the PGSS_INVALID is also way better.


> - Update the sample output in the document.
> etc


Thanks a lot.  It all looks good to me!

Re: recovery_target_action=pause with confusing hint

2020-03-31 Thread Fujii Masao





On 2020/03/31 19:33, Sergei Kornilov wrote:

Hello


  My proposal does not change the behavior after this commit, only changing the 
lines in the logs.


Yes. What's your opinion about the latest patch based on Robert's idea?
Barring any ojection, I'd like to commit that.


Oh, sorry. Looks good and solves my issue


Thanks for reviewing the patch! Pushed!

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

Re: Ecpg dependency

2020-03-31 Thread Bruce Momjian

On Sat, Mar 21, 2020 at 06:13:03PM -0400, Bruce Momjian wrote:
> On Sat, Mar 21, 2020 at 07:30:48PM +, Dagfinn Ilmari Mannsåker wrote:
> > Bruce Momjian  writes:
> > > On Sat, Mar 21, 2020 at 02:14:44PM -0400, Bruce Momjian wrote:
> > > Oh, I forgot to mention I got this line from
> > > src/interfaces/libpq/Makefile:
> > 
> > And that line is wrong, but my patch to fix it¹ seems to have fallen
> > between the cracks.
> > 
> > [1] 
> > https://www.postgresql.org/message-id/flat/871rsa13ae.fsf%40wibble.ilmari.org
> >  
> > 
> > Adding the dependency to `all-lib` only fixes it for serial builds. To
> > fix it properly, so it works with parallel builds (e.g. 'make -j4 -C
> > src/interfaces/ecpg', the dependency needs to be declared via
> > SHLIB_PREREQS, as attached
> 
> Oh, good catch.  I did not notice that patch before.  Adding that change
> to src/interfaces/ecpg/pgtypeslib/Makefile fixes the stand-alone
> compile.

Patch applied and backpatched to PG 12.  Thanks.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Add A Glossary

2020-03-31 Thread Justin Pryzby

On Sun, Oct 13, 2019 at 04:52:05PM -0400, Corey Huinker wrote:
> 1. It's obviously incomplete. There are more terms, a lot more, to add.

How did you come up with the initial list of terms ?

Here's some ideas; I'm *not* suggesting to include all of everything, but
hopefully start with a coherent, self-contained list.

grep -roh '[^<]*' doc/src/ |sed 's/.*/\L&/' |sort |uniq -c |sort -nr 
|less

Maybe also:
object identifier
operator classes
operator family
visibility map

-- 
Justin

Re: Add A Glossary

2020-03-31 Thread Justin Pryzby

On Mon, Mar 30, 2020 at 01:10:19PM -0400, Corey Huinker wrote:
> +   
> +Aggregating
> +
> + 
> +  The act of combining a collection of data (input) values into
> +  a single output value, which may not be of the same type as the
> +  input values.

I think we maybe already tried to address this ; but could we define a noun
form ?  But not "aggregate" since it's the same word as the verb form.  I think
it would maybe be best to merge with "aggregate function", below.

> +   
> +Consistency
> +
> + 
> +  One of the ACID properties. This means that the 
> database
> +  is always in compliance with its own rules such as 
> Table
> +  structure, Constraints,

I don't think the definition of "compliance" is good.  The state of being
consistent means an absense of corruption more than that an absense of data
integrity issues (which could be caused by corruption).

> +   
> +Datum
> +
> + 
> +  The internal representation of a SQL data type.

Could you say "..used by PostgreSQL" ?

> +File Segment
> +
> + 
> +   A physical file which stores data for a given
> +   Heap or Index object.
> +   File Segments are limited in size by a
> +   configuration value and if that size is exceeded, it will be split
> +   into multiple physical files.

Say "if an object exceeds that size, then it will be stored across multiple
physical files".

> +  which handles parts of an SQL query to take
...
> +  A SQL command used to add new data into a

I mentioned before, please be consistent: "A SQL or An SQL".

> + 
> + 
> +  Many Instances can run on the same server as

Say "multiple" not many.

> +   
> +Join
> +
> + 
> +  A SQL keyword used in SELECT 
> statements for
> +  combining data from multiple Relations.

Could you add a link to the docs ?

> +   
> +Log Writer
> +
> + 
> +  If activated and parameterized, the

I still don't know what parameterized means here.

> +   
> +System Catalog
> +
> + 
> +  A collection of Tables and
> +  Views which describe the structure of all
> +  SQL objects of the Database

I would say "... a PostgreSQL >Database<"

> +  and the Global SQL Objects of the
> +  Cluster. The System
> +  Catalog resides in the schema
> +  pg_catalog. Main parts are mirrored as
> +  Views in the Schema
> +  information_schema.

I wouldn't say "mirror":  Some information is also exposed as >Views< in the
>information_schema< >Schema<.

> +   
> +Tablespace
> +
> + 
> +  A named location on the server filesystem. All SQL 
> Objects
> +  which require storage beyond their definition in the
> +  System Catalog
> +  must belong to a single tablespace.

Remove "single" as it sounds like we only support one.

> +Transaction
> +
> + 
> +  A combination of commands that must act as a single
> +  Atomic command: they all succeed or all fail
> +  as a single unit, and their effects are not visible to other
> +  Sessions until
> +  the Transaction is complete.

s/complete/commited/ ?


> +   
> +Unique
> +
> + 
> +  The condition of having no duplicate values in the same
> +  Relation. Often used in the concept of

s/concept/context/

> +Vacuum
> +
> + 
> +  The process of removing outdated MVCC

Maybe say "tuples which were deleted or obsoleted by an UPDATE".
But maybe you're trying to use generic language.

-- 
Justin

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-31 Thread Tom Lane

James Coleman  writes:
> On Tue, Mar 31, 2020 at 1:04 PM Tom Lane  wrote:
>> Perhaps the semantics are such that that's actually sensible, but it's
>> far from a straightforward remapping of the old enum.

> Right, I didn't see the explicit "= 0" in other enums there, so it
> made me wonder if it was intentional to designate that one had to be
> 0, but I guess without a comment that's a lot of inference.

It's possible that somebody meant that as an indicator that the code
depends on palloc0() leaving the field with that value.  But if so,
you'd soon find that out ... and an actual comment would be better,
anyway.

regards, tom lane

Re: Add A Glossary

2020-03-31 Thread Justin Pryzby

On Tue, Mar 31, 2020 at 04:13:00PM +0200, Jürgen Purtz wrote:
> Please find some minor suggestions in the attachment. They are based on
> Corey's last patch 0001-glossary-v4.patch.

> @@ -220,7 +220,7 @@
>Records to the file system and creates a special
>checkpoint record. This process is initiated when predefined
>conditions are met, such as a specified amount of time has passed, or
> -  a certain volume of records have been collected.
> +  a certain volume of records has been collected.

I think you're correct in that "volume" is singular.  But I think "collected"
is the wrong world.  I suggested "written".

>   
> -  One of the ACID properties. This means that 
> concurrently running 
> +  One of the ACID properties. This means that 
> concurrently running

These could maybe say "required" or "essential" >ACID< properties

>   
> +  In reference to a Table:
>A Table that can be queried directly,

Maybe: "In reference to a >Relation<: A table which can be queried directly,"

>table in the collection.
>   
>   
> -  When referring to an Analytic
> -  Function: a partition is a definition
> -  that identifies which neighboring
> +  In reference to a Analytic Function:
s/a/an/

> @@ -1333,7 +1334,8 @@
>  
>   
>The condition of having no duplicate values in the same
> -  Relation. Often used in the concept of
> +  Column of a Relation.
> +  Often used in the concept of

s/concept/context/, but  I said that before, so maybe it was rejected.

-- 
Justin

Re: Less-silly selectivity for JSONB matching operators

2020-03-31 Thread Tom Lane

Renamed "matchsel" to "matchingsel" etc, added DEFAULT_MATCHING_SEL,
rebased over commit 911e70207.  Since that commit already created
new versions of the relevant contrib modules, I think we can just
redefine what those versions contain, rather than making yet-newer
versions.  (Of course, that assumes we're going to include this in
v13.)

regards, tom lane

diff --git a/contrib/ltree/ltree_op.c b/contrib/ltree/ltree_op.c
index 34e6e4b..f606761 100644
--- a/contrib/ltree/ltree_op.c
+++ b/contrib/ltree/ltree_op.c
@@ -566,8 +566,6 @@ ltree2text(PG_FUNCTION_ARGS)
 }
 
 
-#define DEFAULT_PARENT_SEL 0.001
-
 /*
  *	ltreeparentsel - Selectivity of parent relationship for ltree data types.
  */
@@ -578,101 +576,12 @@ ltreeparentsel(PG_FUNCTION_ARGS)
 	Oid			operator = PG_GETARG_OID(1);
 	List	   *args = (List *) PG_GETARG_POINTER(2);
 	int			varRelid = PG_GETARG_INT32(3);
-	VariableStatData vardata;
-	Node	   *other;
-	bool		varonleft;
 	double		selec;
 
-	/*
-	 * If expression is not variable <@ something or something <@ variable,
-	 * then punt and return a default estimate.
-	 */
-	if (!get_restriction_variable(root, args, varRelid,
-  , , ))
-		PG_RETURN_FLOAT8(DEFAULT_PARENT_SEL);
-
-	/*
-	 * If the something is a NULL constant, assume operator is strict and
-	 * return zero, ie, operator will never return TRUE.
-	 */
-	if (IsA(other, Const) &&
-		((Const *) other)->constisnull)
-	{
-		ReleaseVariableStats(vardata);
-		PG_RETURN_FLOAT8(0.0);
-	}
-
-	if (IsA(other, Const))
-	{
-		/* Variable is being compared to a known non-null constant */
-		Datum		constval = ((Const *) other)->constvalue;
-		FmgrInfo	contproc;
-		double		mcvsum;
-		double		mcvsel;
-		double		nullfrac;
-		int			hist_size;
-
-		fmgr_info(get_opcode(operator), );
-
-		/*
-		 * Is the constant "<@" to any of the column's most common values?
-		 */
-		mcvsel = mcv_selectivity(, , constval, varonleft,
- );
-
-		/*
-		 * If the histogram is large enough, see what fraction of it the
-		 * constant is "<@" to, and assume that's representative of the
-		 * non-MCV population.  Otherwise use the default selectivity for the
-		 * non-MCV population.
-		 */
-		selec = histogram_selectivity(, ,
-	  constval, varonleft,
-	  10, 1, _size);
-		if (selec < 0)
-		{
-			/* Nope, fall back on default */
-			selec = DEFAULT_PARENT_SEL;
-		}
-		else if (hist_size < 100)
-		{
-			/*
-			 * For histogram sizes from 10 to 100, we combine the histogram
-			 * and default selectivities, putting increasingly more trust in
-			 * the histogram for larger sizes.
-			 */
-			double		hist_weight = hist_size / 100.0;
-
-			selec = selec * hist_weight +
-DEFAULT_PARENT_SEL * (1.0 - hist_weight);
-		}
-
-		/* In any case, don't believe extremely small or large estimates. */
-		if (selec < 0.0001)
-			selec = 0.0001;
-		else if (selec > 0.)
-			selec = 0.;
-
-		if (HeapTupleIsValid(vardata.statsTuple))
-			nullfrac = ((Form_pg_statistic) GETSTRUCT(vardata.statsTuple))->stanullfrac;
-		else
-			nullfrac = 0.0;
-
-		/*
-		 * Now merge the results from the MCV and histogram calculations,
-		 * realizing that the histogram covers only the non-null values that
-		 * are not listed in MCV.
-		 */
-		selec *= 1.0 - nullfrac - mcvsum;
-		selec += mcvsel;
-	}
-	else
-		selec = DEFAULT_PARENT_SEL;
-
-	ReleaseVariableStats(vardata);
-
-	/* result should be in range, but make sure... */
-	CLAMP_PROBABILITY(selec);
+	/* Use generic restriction selectivity logic, with default 0.001. */
+	selec = generic_restriction_selectivity(root, operator,
+			args, varRelid,
+			0.001);
 
 	PG_RETURN_FLOAT8((float8) selec);
 }
diff --git a/doc/src/sgml/xoper.sgml b/doc/src/sgml/xoper.sgml
index 132056f..56b0849 100644
--- a/doc/src/sgml/xoper.sgml
+++ b/doc/src/sgml/xoper.sgml
@@ -283,6 +283,18 @@ column OP constant

 

+Another useful built-in selectivity estimation function
+is matchingsel, which will work for almost any
+binary operator, if standard MCV and/or histogram statistics are
+collected for the input data type(s).  Its default estimate is set to
+twice the default estimate used in eqsel, making
+it most suitable for comparison operators that are somewhat less
+strict than equality.  (Or you could call the
+underlying generic_restriction_selectivity
+function, providing a different default estimate.)
+   
+
+   
 There are additional selectivity estimation functions designed for geometric
 operators in src/backend/utils/adt/geo_selfuncs.c: areasel, positionsel,
 and contsel.  At this writing these are just stubs, but you might want
@@ -319,6 +331,7 @@ table1.column1 OP table2.column2
   scalarlejoinsel for =
   scalargtjoinsel for 
   scalargejoinsel for =
+  matchingjoinsel for generic matching operators
   areajoinsel for 2D area-based comparisons
   positionjoinsel for 2D position-based comparisons
   contjoinsel for 2D

Re: Planning counters in pg_stat_statements (using pgss_store)

2020-03-31 Thread Fujii Masao




On 2020/03/31 16:33, Julien Rouhaud wrote:

On Tue, Mar 31, 2020 at 04:10:47PM +0900, Fujii Masao wrote:



On 2020/03/31 15:03, Julien Rouhaud wrote:

On Tue, Mar 31, 2020 at 12:21:43PM +0900, Fujii Masao wrote:


On 2020/03/31 3:16, Julien Rouhaud wrote:

On Mon, Mar 30, 2020 at 6:36 PM Fujii Masao  wrote:


While testing the patched pgss, I found that the patched version
may track the statements that the original version doesn't.
Please imagine the case where the following queries are executed,
with pgss.track = top.

PREPARE hoge AS SELECT * FROM t;
EXPLAIN EXECUTE hoge;

The pgss view returned "PREPARE hoge AS SELECT * FROM t"
in the patched version, but not in the orignal version.

Is this problematic?


Oh indeed. That's a side effect of having different the executed query
and the planned query being different.

I guess the question is to chose if the top level executed query of a
utilty statement containing an optimisable query, should the top level
planner call of that optimisable statement be considered at top level
or not.  I tend to think that's the correct behavior here, as this is
also what would happen if a regular DML was provided.  What do you
think?


TBH, not sure if that's ok yet...

I'm now just wondering if both plan_nested_level and
exec_nested_level should be incremented in pgss_ProcessUtility().
This is just a guess, so I need more investigation about this.


Yeah, after a second thought I realize that my comparison was wrong.  Allowing
*any* top-level planner call when pgss.track = top would mean that we should
also consider all planner calls from queries executed for FK checks and such,
which is definitely not the intended behavior.


Yes. So, basically any planner activity that happens during
the execution phase of the statement is not tracked.


FTR with this patch such calls still don't get tracked, but only because those
query don't get a queryid assigned, not because the nesting level says so.

How about simply passing (plan_nested_level + exec_nested_level) for
pgss_enabled call in pgss_planner_hook?


Looks good to me! The comment about why this treatment is necessary only in
pgss_planner() should be added.



Yes of course.  It also requires some changes in the macro to make it safe when
called with an expression.

v12 attached!


Thanks for updating the patch! The patch looks good to me.

I applied minor and cosmetic changes into the patch. Attached is
the updated version of the patch. Barring any objection, I'd like to
commit this version.

BTW, the minor and cosmetic changes that I applied are, for example,

- Rename pgss_planner_hook to pgss_planner for the sake of consistency.
   Other function using hook in pgss doesn't use "_hook" in their names, too.
- Make pgss_planner use PG_FINALLY() instead of PG_CATCH().
- Make PGSS_NUMKIND as the last value in enum pgssStoreKind.
- Update the sample output in the document.
etc

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters
diff --git a/contrib/pg_stat_statements/Makefile 
b/contrib/pg_stat_statements/Makefile
index 80042a0905..081f997d70 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -6,7 +6,8 @@ OBJS = \
pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.4.sql pg_stat_statements--1.6--1.7.sql \
+DATA = pg_stat_statements--1.4.sql \
+   pg_stat_statements--1.7--1.8.sql pg_stat_statements--1.6--1.7.sql \
pg_stat_statements--1.5--1.6.sql pg_stat_statements--1.4--1.5.sql \
pg_stat_statements--1.3--1.4.sql pg_stat_statements--1.2--1.3.sql \
pg_stat_statements--1.1--1.2.sql pg_stat_statements--1.0--1.1.sql
diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out 
b/contrib/pg_stat_statements/expected/pg_stat_statements.out
index 6787ec1efd..45dbe9e677 100644
--- a/contrib/pg_stat_statements/expected/pg_stat_statements.out
+++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out
@@ -96,25 +96,26 @@ EXECUTE pgss_test(1);
 
 DEALLOCATE pgss_test;
 SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C";
-   query   | calls | rows 
+---+--
- PREPARE pgss_test (int) AS SELECT $1, $2 LIMIT $3 | 1 |1
- SELECT $1+| 4 |4
-  +|   | 
-   AS "text"   |   | 
- SELECT $1 + $2| 2 |2
- SELECT $1 + $2 + $3 AS "add"  | 3 |3
- SELECT $1 AS "float"  | 1 |1
- SELECT $1 AS "int"| 2 |2
- SELECT $1 AS i UNION SELECT $2 ORDER BY i | 1 |2
- SELECT $1 || $2   | 1 |

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-31 Thread James Coleman

On Tue, Mar 31, 2020 at 1:04 PM Tom Lane  wrote:
>
> James Coleman  writes:
> > + * TuplesortMethod is used in a bitmask in Increment Sort's shared memory
> > + * instrumentation so needs to have each value be a separate bit.
>
> >> I don't quite understand why you skipped "1".  (Also, is the use of zero
> >> a wise choice?)
>
> > The assignment of 0 was already there, and there wasn't a comment to
> > indicate why. That ends up meaning we wouldn't display "still in
> > progress" as a type here, which is maybe desirable, but I'm honestly
> > not sure why it was that way originally. I'm curious if you have any
> > thoughts on it.
>
> As things stood, the "= 0" was a no-op, since the first enum value
> would've been that anyway.  But if you're converting this set of symbols
> to bits that can be OR'd together, it seems pretty strange to use zero,
> because that can't be distinguished from "absence of any entry".
>
> Perhaps the semantics are such that that's actually sensible, but it's
> far from a straightforward remapping of the old enum.

Right, I didn't see the explicit "= 0" in other enums there, so it
made me wonder if it was intentional to designate that one had to be
0, but I guess without a comment that's a lot of inference.

The semantics seemed somewhat useful here in theory, but since I'm not
hearing a "yeah, that was intentional but not commented", I'm just
going to change it to what you'd naturally expect.

James

Re: Explain: Duplicate key "Workers" in JSON format

2020-03-31 Thread Alvaro Herrera

On 2020-Mar-31, Tom Lane wrote:

> Alvaro Herrera  writes:
> > We have not fixed this, have we?
> 
> Isn't this 10013684970453a0ddc86050bba813c64321 ?

Ah, right, another thread reporting the same thing, two months after
this one.

Thanks both :-)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

2020-03-31 Thread Alvaro Herrera

On 2020-Mar-31, Alvaro Herrera wrote:

> I'm not sure if I explained my proposal clearly.  What if
> XLogGetLastRemovedSegno returning zero means that every segment is
> valid?  We don't need to scan pg_xlog at all.

I mean this:

XLogSegNo
FindOldestXLogFileSegNo(void)
{
XLogSegNo segno = XLogGetLastRemovedSegno();

/* this is the only special case we need to care about */
if (segno == 0)
return some-value;

return segno + 1;
}

... and that point one can further note that a freshly initdb'd system
(no file has been removed) has "1" as the first file.  So when segno is
0, you can return 1 and all should be well.  That means you can reduce
the function to this:

XLogSegNo
FindOldestXLogFileSegNo(void)
{
return XLogGetLastRemovedSegno() + 1;
}

The tests still pass with this coding.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

[PATCH] Fix type var declaration (src/backend/access/brin/brin.c)

2020-03-31 Thread Ranier Vilela

Hi,
bringetbitmap function returns int64 value, but internally uses int.
For prevent future bugs, fix to right type.

best regards,

Ranier Vilela


fix_int64_brin.patch
Description: Binary data

Re: Explain: Duplicate key "Workers" in JSON format

2020-03-31 Thread Tom Lane

Alvaro Herrera  writes:
> We have not fixed this, have we?

Isn't this 10013684970453a0ddc86050bba813c64321 ?

regards, tom lane

Re: Explain: Duplicate key "Workers" in JSON format

2020-03-31 Thread Andres Freund

On 2020-03-31 14:06:50 -0300, Alvaro Herrera wrote:
> We have not fixed this, have we?

I thought we did:

commit 10013684970453a0ddc86050bba813c64321
Author: Tom Lane 
Date:   2020-01-25 18:16:42 -0500

Clean up EXPLAIN's handling of per-worker details.

Previously, it was possible for EXPLAIN ANALYZE of a parallel query
to produce several different "Workers" fields for a single plan node,
because different portions of explain.c independently generated
per-worker data and wrapped that output in separate fields.  This
is pretty bogus, especially for the structured output formats: even
if it's not technically illegal, most programs would have a hard time
dealing with such data.

Re: Explain: Duplicate key "Workers" in JSON format

2020-03-31 Thread Alvaro Herrera

We have not fixed this, have we?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-31 Thread Tom Lane

James Coleman  writes:
> + * TuplesortMethod is used in a bitmask in Increment Sort's shared memory
> + * instrumentation so needs to have each value be a separate bit.

>> I don't quite understand why you skipped "1".  (Also, is the use of zero
>> a wise choice?)

> The assignment of 0 was already there, and there wasn't a comment to
> indicate why. That ends up meaning we wouldn't display "still in
> progress" as a type here, which is maybe desirable, but I'm honestly
> not sure why it was that way originally. I'm curious if you have any
> thoughts on it.

As things stood, the "= 0" was a no-op, since the first enum value
would've been that anyway.  But if you're converting this set of symbols
to bits that can be OR'd together, it seems pretty strange to use zero,
because that can't be distinguished from "absence of any entry".

Perhaps the semantics are such that that's actually sensible, but it's
far from a straightforward remapping of the old enum.

regards, tom lane

Re: Less-silly selectivity for JSONB matching operators

2020-03-31 Thread Tom Lane

Alexey Bashtanov  writes:
>> I was wondering about DEFAULT_MATCHING_SEL.  The difference in purpose
>> from DEFAULT_MATCH_SEL wouldn't be too obvious, but then it probably
>> wouldn't be anyway.

> Fine with me, especially if both new functions are renamed accordingly.

Yup, that would make sense, will do it like that.

regards, tom lane

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-31 Thread James Coleman

On Tue, Mar 31, 2020 at 12:31 PM Alvaro Herrera
 wrote:
>
> On 2020-Mar-30, James Coleman wrote:
>
> > +/* 
> > + *Instruementation information for IncrementalSort
> > + * 
> > + */
> > +typedef struct IncrementalSortGroupInfo
> > +{
> > + int64   groupCount;
> > + longmaxDiskSpaceUsed;
> > + longtotalDiskSpaceUsed;
> > + longmaxMemorySpaceUsed;
> > + longtotalMemorySpaceUsed;
> > + SizesortMethods; /* bitmask of TuplesortMethod */
> > +} IncrementalSortGroupInfo;
>
> There's a typo "Instruementation" in the comment, but I'm more surprised
> that type Size is being used to store a bitmask.  It looks weird to me.
> Wouldn't it be more reasonable to use bits32 or some such?  (I first
> noticed this in the "sizeof(Size)" code that appears in the explain
> code.)

I just didn't know about bits32; I'll change.

> OTOH, aesthetically it would seem to be better to define these values
> using ones and increasing shifts (1 << 1 and so on), rather than powers
> of two:
>
> > + * TuplesortMethod is used in a bitmask in Increment Sort's shared memory
> > + * instrumentation so needs to have each value be a separate bit.
> >   */
> >  typedef enum
> >  {
> >   SORT_TYPE_STILL_IN_PROGRESS = 0,
> > - SORT_TYPE_TOP_N_HEAPSORT,
> > - SORT_TYPE_QUICKSORT,
> > - SORT_TYPE_EXTERNAL_SORT,
> > - SORT_TYPE_EXTERNAL_MERGE
> > + SORT_TYPE_TOP_N_HEAPSORT = 2,
> > + SORT_TYPE_QUICKSORT = 4,
> > + SORT_TYPE_EXTERNAL_SORT = 8,
> > + SORT_TYPE_EXTERNAL_MERGE = 16
> >  } TuplesortMethod;
>
> I don't quite understand why you skipped "1".  (Also, is the use of zero
> a wise choice?)

The assignment of 0 was already there, and there wasn't a comment to
indicate why. That ends up meaning we wouldn't display "still in
progress" as a type here, which is maybe desirable, but I'm honestly
not sure why it was that way originally. I'm curious if you have any
thoughts on it.

I knew some projects used increasing shifts, but I wasn't sure what
the preference was here. I'll correct.

James

Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

2020-03-31 Thread Tom Lane

Justin Pryzby  writes:
> I suggest to leave stat() alone in your patch for stable releases.  I think
> it's okay if we change behavior so that a broken symlink is skipped instead of
> erroring (as a side effect of skipping ENOENT with stat()).  But not okay if 
> we
> change pg_ls_logdir() to hide symlinks in back braches.

Meh.  I'm not really convinced, but in the absence of anyone expressing
support for my position, I'll do it that way.  I don't think it's worth
doing both a stat and lstat to tell the difference between file-is-gone
and file-is-a-broken-symlink.

regards, tom lane

Re: Less-silly selectivity for JSONB matching operators

2020-03-31 Thread Alexey Bashtanov





I was wondering about DEFAULT_MATCHING_SEL.  The difference in purpose
from DEFAULT_MATCH_SEL wouldn't be too obvious, but then it probably
wouldn't be anyway.

Fine with me, especially if both new functions are renamed accordingly.

Best, Alex

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-31 Thread Alvaro Herrera

On 2020-Mar-30, James Coleman wrote:

> +/* 
> + *Instruementation information for IncrementalSort
> + * 
> + */
> +typedef struct IncrementalSortGroupInfo
> +{
> + int64   groupCount;
> + longmaxDiskSpaceUsed;
> + longtotalDiskSpaceUsed;
> + longmaxMemorySpaceUsed;
> + longtotalMemorySpaceUsed;
> + SizesortMethods; /* bitmask of TuplesortMethod */
> +} IncrementalSortGroupInfo;

There's a typo "Instruementation" in the comment, but I'm more surprised
that type Size is being used to store a bitmask.  It looks weird to me.
Wouldn't it be more reasonable to use bits32 or some such?  (I first
noticed this in the "sizeof(Size)" code that appears in the explain
code.)


OTOH, aesthetically it would seem to be better to define these values
using ones and increasing shifts (1 << 1 and so on), rather than powers
of two:

> + * TuplesortMethod is used in a bitmask in Increment Sort's shared memory
> + * instrumentation so needs to have each value be a separate bit.
>   */
>  typedef enum
>  {
>   SORT_TYPE_STILL_IN_PROGRESS = 0,
> - SORT_TYPE_TOP_N_HEAPSORT,
> - SORT_TYPE_QUICKSORT,
> - SORT_TYPE_EXTERNAL_SORT,
> - SORT_TYPE_EXTERNAL_MERGE
> + SORT_TYPE_TOP_N_HEAPSORT = 2,
> + SORT_TYPE_QUICKSORT = 4,
> + SORT_TYPE_EXTERNAL_SORT = 8,
> + SORT_TYPE_EXTERNAL_MERGE = 16
>  } TuplesortMethod;

I don't quite understand why you skipped "1".  (Also, is the use of zero
a wise choice?)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Less-silly selectivity for JSONB matching operators

2020-03-31 Thread Tom Lane

Alexey Bashtanov  writes:
>> So that leaves us with needing
>> to find a better name for this new one.  Any ideas?

> I'm thinking of something wide like
> opersel, operjoinsel, DEFAULT_OPER_SEL
> or maybe even
> genericsel, genericjoinsel, DEFAULT_GENERIC_SEL

Seems a little *too* generic :-(

I was wondering about DEFAULT_MATCHING_SEL.  The difference in purpose
from DEFAULT_MATCH_SEL wouldn't be too obvious, but then it probably
wouldn't be anyway.

regards, tom lane

Re: Less-silly selectivity for JSONB matching operators

2020-03-31 Thread Alexey Bashtanov





So that leaves us with needing
to find a better name for this new one.  Any ideas?

I'm thinking of something wide like
opersel, operjoinsel, DEFAULT_OPER_SEL
or maybe even
genericsel, genericjoinsel, DEFAULT_GENERIC_SEL

Best, Alex

Re: Less-silly selectivity for JSONB matching operators

2020-03-31 Thread Tom Lane

Alexey Bashtanov  writes:
> The only little thing I can think of is hardcoding it as 2 * DEFAULT_EQ_SEL.
> While I don't have any arguments against the value itself I think it 
> should be configurable independently.
> Sadly DEFAULT_MATCH_SEL name is already taken for text patterns.
> Not sure if it's a reason to rename all the stuff.

Yeah, I was going to invent a symbol till I noticed that DEFAULT_MATCH_SEL
was already taken :-(.

There are only about half a dozen uses of that in-core, so maybe we could
get away with renaming that one, but on the whole I'd rather leave it
alone in case some extension is using it.  So that leaves us with needing
to find a better name for this new one.  Any ideas?

regards, tom lane

Re: Less-silly selectivity for JSONB matching operators

2020-03-31 Thread Alexey Bashtanov


Quickly tested like this:

create table t(a jsonb);
insert into t select jsonb_object( array[(random() * 10)::int::text], 
'{" "}') from generate_series(1, 10);
insert into t select jsonb_object( array[(random() * 10)::int::text], 
array[(random() * 1000)::int::text]) from generate_series(1, 10);

explain analyze select * from t where a ? '1';
analyze t;
explain analyze select * from t where a ? '1';

Best, Alex

Re: Less-silly selectivity for JSONB matching operators

2020-03-31 Thread Alexey Bashtanov


Hi Tom,

The patches look entirely reasonable to me.
The second one needs to be rebased.

I like the idea of stubbing matchjoinsel for now,
as well as being careful with operators that may correlate with sort 
orderings.


The only little thing I can think of is hardcoding it as 2 * DEFAULT_EQ_SEL.
While I don't have any arguments against the value itself I think it 
should be configurable independently.

Sadly DEFAULT_MATCH_SEL name is already taken for text patterns.
Not sure if it's a reason to rename all the stuff.

Best, Alex

Re: fix for BUG #3720: wrong results at using ltree

2020-03-31 Thread Tom Lane

I wrote:
> I've marked this RFC, and will push tomorrow unless somebody wants
> to object to the loss of backwards compatibility.

And done.  I noticed in some final testing that it's possible to
make this code take a long time by forcing it to backtrack a lot:

regression=# SELECT (('1' || repeat('.1', 65534))::ltree) ~ '*.*.x';
 ?column? 
--
 f
(1 row)

Time: 54015.421 ms (00:54.015)

so I threw in a CHECK_FOR_INTERRUPTS().  Maybe it'd be worth trying
to optimize such cases, but I'm not sure that it'd ever matter for
real-world cases with reasonable-size label strings.

The old implementation seems to handle that particular case well,
evidently because it more-or-less folds adjacent stars together.
However, before anyone starts complaining about regressions, they
should note that it's really easy to get the old code to fail
via stack overflow:

regression=# SELECT (('1' || repeat('.1', 65534))::ltree) ~ '*.!1.*';
ERROR:  stack depth limit exceeded

(That's as of five minutes ago, before that it dumped core.)
So I don't feel bad about the tradeoff.  At least now we have
simple, visibly correct code that could serve as a starting
point for optimization if anyone feels the need to.

regards, tom lane

Re: A rather hackish POC for alternative implementation of WITH TIES

2020-03-31 Thread Alvaro Herrera

On 2020-Mar-26, Surafel Temesgen wrote:

> On Wed, Jan 22, 2020 at 3:35 PM Andrew Gierth 
> wrote:
> 
> > > "Alvaro" == Alvaro Herrera  writes:
> >
> > I was largely holding off on doing further work hoping for some
> > discussion of which way we should go. If you think my approach is worth
> > pursuing (I haven't seriously tested the performance, but I'd expect it
> > to be slower than Surafel's - the price you pay for flexibility) then I
> > can look at it further, but figuring out the planner stuff will take
> > some time.
>
> Other alternative can be pushing the existing implementation
> which will be open to change in case of better-finished
> implementation.

At this point, I think that's what we should do.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

2020-03-31 Thread Alvaro Herrera

On 2020-Mar-31, Kyotaro Horiguchi wrote:

> Thank you for looking this and trouble rebasing!
> 
> At Mon, 30 Mar 2020 20:03:27 -0300, Alvaro Herrera  
> wrote in 
> > I rebased this patch; it's failing to apply due to minor concurrent
> > changes in PostgresNode.pm.  I squashed the patches in a series that
> > made the most sense to me.
> > 
> > I have a question about static variable lastFoundOldestSeg in
> > FindOldestXLogFileSegNo.  It may be set the first time the function
> > runs; if it is, the function never again does anything, it just returns
> > that value.  In other words, the static value is never reset; it never
> > advances either.  Isn't that strange?  I think the coding is to assume
> > that XLogCtl->lastRemovedSegNo will always be set, so its code will
> > almost never run ... except when the very first wal file has not been
> > removed yet.  This seems weird and pointless.  Maybe we should think
> > about this differently -- example: if XLogGetLastRemovedSegno returns
> > zero, then the oldest file is the zeroth one.  In what cases this is
> > wrong?  Maybe we should fix those.
> 
> That's right, but without the static variable, every call to the
> pg_replication_slots view before the fist checkpoint causes scanning
> pg_xlog. XLogCtl->lastRemovedSegNo advances only at a checkpoint, so
> it is actually right that the return value from
> FindOldestXLogFileSegNo doesn't change until the first checkpoint.
> 
> Also we could set XLogCtl->lastRemovedSegNo at startup, but the
> scanning on pg_xlog is useless in most cases.
> 
> I avoided to update the XLogCtl->lastRemovedSegNo directlry, but the
> third way would be if XLogGetLastRemovedSegno() returned 0, then set
> XLogCtl->lastRemovedSegNo by scanning the WAL directory. The attached
> takes this way.

I'm not sure if I explained my proposal clearly.  What if
XLogGetLastRemovedSegno returning zero means that every segment is
valid?  We don't need to scan pg_xlog at all.

> > Regarding the PostgresNode change in 0001, I think adding a special
> > parameter for primary_slot_name is limited.  I'd like to change the
> > definition so that anything that you give as a parameter that's not one
> > of the recognized keywords (has_streaming, etc) is tested to see if it's
> > a GUC; and if it is, then put it in postgresql.conf.  This would have to
> > apply both to PostgresNode::init() as well as
> > PostgresNode::init_from_backup(), obviously, since it would make no
> > sense for the APIs to diverge on this point.  So you'd be able to do
> >   $node->init_from_backup(allow_streaming => 1, work_mem => "4MB");
> > without having to add code to init_from_backup to handle work_mem
> > specifically.  This could be done by having a Perl hash with all the GUC
> > names, that we could read lazily from "postmaster --describe-config" the
> > first time we see an unrecognized keyword as an option to init() /
> > init_from_backup().
> 
> Done that way. We could exclude "known" parameters by explicitly
> delete the key at reading it, but I choosed to enumerate the known
> keywords.  Although it can be used widely but actually I changed only
> 018_repslot_limit.pl to use the feature.

Hmm.  I like this idea in general, but I'm not sure I want to introduce
it in this form right away.  For the time being I realized while waking
up this morning we can just use $node->append_conf() in the
018_replslot_limit.pl file, like every other place that needs a special
config.  There's no need to change the test infrastructure for this.

I'll go through this again.  Many thanks,

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] remove condition always true (/src/backend/access/heap/vacuumlazy.c)

2020-03-31 Thread Ranier Vilela

Hi,
Thanks for the commit.

Ranier Vilela

Re: WAL usage calculation patch

2020-03-31 Thread Kuntal Ghosh

On Tue, Mar 31, 2020 at 7:39 PM Julien Rouhaud  wrote:
>
> On Tue, Mar 31, 2020 at 12:21 PM Kuntal Ghosh
>  wrote:
> >
> > On Mon, Mar 30, 2020 at 6:14 PM Julien Rouhaud  wrote:
> > >
> > @@ -448,6 +449,7 @@ XLogInsert(RmgrId rmid, uint8 info)
> >   bool doPageWrites;
> >   XLogRecPtr fpw_lsn;
> >   XLogRecData *rdt;
> > + int num_fpw = 0;
> >
> >   /*
> >   * Get values needed to decide whether to do full-page writes. Since
> > @@ -457,9 +459,9 @@ XLogInsert(RmgrId rmid, uint8 info)
> >   GetFullPageWriteInfo(, );
> >
> >   rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
> > - _lsn);
> > + _lsn, _fpw);
> >
> > - EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags);
> > + EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags, num_fpw);
> >   } while (EndPos == InvalidXLogRecPtr);
> >
> > I think there are some issues in the num_fpw calculation. For some
> > cases, we have to return from XLogInsert without inserting a record.
> > Basically, we've to recompute/reassemble the same record. In those
> > cases, num_fpw should be reset. Thoughts?
>
> Mmm, yes but since that's the same record is being recomputed from the
> same RedoRecPtr, doesn't it mean that we need to reset the counter?
> Otherwise we would count the same FPW multiple times.

Yes. That was my point as well. I missed the part that you're already
resetting the same inside the do-while loop before calling
XLogRecordAssemble. Sorry for the noise.

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-03-31 Thread Masahiko Sawada

On Fri, 27 Mar 2020 at 13:54, Fujii Masao  wrote:
>
>
>
> On 2020/03/27 10:26, Tom Lane wrote:
> > Twice in the past month [1][2], buildfarm member hoverfly has managed
> > to reach the "unreachable" Assert(false) at the end of
> > SyncRepGetSyncStandbysPriority.
>
> When I search the past discussions, I found that Noah Misch reported
> the same issue.
> https://www.postgresql.org/message-id/20200206074552.gb3326...@rfd.leadboat.com
>
> > What seems likely to me, after quickly eyeballing the code, is that
> > hoverfly is hitting the blatantly-obvious race condition in that function.
> > Namely, that the second loop supposes that the state of the walsender
> > array hasn't changed since the first loop.
> >
> > The minimum fix for this, I suppose, would have the first loop capture
> > the sync_standby_priority value for each walsender along with what it's
> > already capturing.  But I wonder if the whole function shouldn't be
> > rewritten from scratch, because it seems like the algorithm is both
> > expensively brute-force and unintelligible, which is a sad combination.
> > It's likely that the number of walsenders would never be high enough
> > that efficiency could matter, but then couldn't we use an algorithm
> > that is less complicated and more obviously correct?
>
> +1 to rewrite the function with better algorithm.
>
> > (Because the
> > alternative conclusion, if you reject the theory that a race is happening,
> > is that the algorithm is just flat out buggy; something that's not too
> > easy to disprove either.)
> >
> > Another fairly dubious thing here is that whether or not *am_sync
> > gets set depends not only on whether MyWalSnd is claiming to be
> > synchronous but on how many lower-numbered walsenders are too.
> > Is that really the right thing?
> >
> > But worse than any of that is that the return value seems to be
> > a list of walsender array indexes, meaning that the callers cannot
> > use it without making even stronger assumptions about the array
> > contents not having changed since the start of this function.
> >
> > It sort of looks like the design is based on the assumption that
> > the array contents can't change while SyncRepLock is held ... but
> > if that's the plan then why bother with the per-walsender spinlocks?
> > In any case this assumption seems to be failing, suggesting either
> > that there's a caller that's not holding SyncRepLock when it calls
> > this function, or that somebody is failing to take that lock while
> > modifying the array.
>
> As far as I read the code, that assumption seems still valid. But the problem
> is that each walsender updates MyWalSnd->sync_standby_priority at each
> convenient timing, when SIGHUP is signaled. That is, at a certain moment,
> some walsenders (also their WalSnd entries in shmem) work based on
> the latest configuration but the others (also their WalSnd entries) work based
> on the old one.
>
> lowest_priority = SyncRepConfig->nmembers;
> next_highest_priority = lowest_priority + 1;
>
> SyncRepGetSyncStandbysPriority() calculates the lowest priority among
> all running walsenders as the above, by using the configuration info that
> this walsender is based on. But this calculated lowest priority would be
> invalid if other walsender is based on different (e.g., old) configuraiton.
> This can cause the (other) walsender to have lower priority than
> the calculated lowest priority and the second loop in
> SyncRepGetSyncStandbysPriority() to unexpectedly end.

I have the same understanding. Since sync_standby_priroity is
protected by SyncRepLock these values of each walsender are not
changed through two loops in SyncRepGetSyncStandbysPriority().
However, as Fujii-san already mentioned the true lowest priority can
be lower than lowest_priority, nmembers, when only part of walsenders
reloaded the configuration, which in turn could be the cause of
leaving entries in the pending list at the end of the function.

> Therefore, the band-aid fix seems to be to set the lowest priority to
> very large number at the beginning of SyncRepGetSyncStandbysPriority().

I think we can use max_wal_senders. And we can change the second loop
so that we exit from the function if the pending list is empty.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Add A Glossary

2020-03-31 Thread Jürgen Purtz


On 30.03.20 19:10, Corey Huinker wrote:



On Sun, Mar 29, 2020 at 5:29 AM Jürgen Purtz > wrote:


On 27.03.20 21:12, Justin Pryzby wrote:
> On Fri, Mar 20, 2020 at 11:32:25PM +0100, Jürgen Purtz wrote:
 + Archiver
>>> Can you change that to archiver process ?
>> I prefer the short term without the addition of 'process' -
concerning
>> 'Archiver' as well as the other cases. But I'm not an native
English
>> speaker.
> I didn't like it due to lack of context.
>
> What about "wal archiver" ?
>
> It occured to me when I read this.
>

https://www.postgresql.org/message-id/20200327.163007.128069746774242774.horikyota.ntt%40gmail.com
>
"WAL archiver" is ok for me. In the current documentation we have 2
places with "WAL archiver" and 4 with "archiver"-only
(high-availability.sgml, monitoring.sgml).

"backend process" is an exception to the other terms because the
standalone term "backend" is sensibly used in diverse situations.

Kind regards, Jürgen


I've taken Alvarao's fixes and done my best to incorporate the 
feedback into a new patch, which Roger's (tech writer) reviewed yesterday.


The changes are too numerous to list, but the highlights are:

New definitions:
* All four ACID terms
* Vacuum (split off from Autovacuum)
* Tablespace
* WAL Archiver (replaces Archiver)

Changes to existing terms:
* Implemented most wording changes recommended by Justin
* all remaining links were either made into xrefs or edited out of
existence

* de-tagged most second uses of of a term within a definition


Did not do
* Addressed the " Process" suffix suggested by Justin. There isn't
consensus on these changes, and I'm neutral on the matter
* change the Cast definition. I think it's important to express
that a cast has a FROM datatype as well as a TO
* anything host/server related as I couldn't see a consensus reached

Other thoughts:
* Trivial definitions that are just see-other-definition are ok
with me, as the goal of this glossary is to aid in discovery of
term meanings, so knowing that two terms are interchangable is
itself helpful


It is my hope that this revision represents the final _structural_ 
change to the glossary. New definitions and edits to existing 
definitions will, of course, go on forever.


Please find some minor suggestions in the attachment. They are based on 
Corey's last patch 0001-glossary-v4.patch.


Kind regards, Jürgen


diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index eab14f3c9b..623922a4c3 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -36,10 +36,10 @@

 

-Analytic
+Analytic Function
 
  
-  A Function whose computed value can reference
+  A type of Functions whose result may be based on
   values found in nearby Rows of the same
   Result Set.
  
@@ -59,12 +59,12 @@
   into smaller components.
  
  
+  
   In reference to an operation: an event that cannot be completed in
   part; it must either entirely succeed or entirely fail. For
   example, a series of SQL statements can be
   combined into a Transaction, and that
-  transaction is said to be atomic.
-  Atomic.
+  transaction is said to be Atomic.
  
 

@@ -73,7 +73,7 @@
 Atomicity
 
  
-  One of the ACID properties. This is the state of 
+  One of the ACID properties. This is the state of
   being Atomic in the operational/transactional sense.
  
 
@@ -152,7 +152,7 @@
   A process that continuously writes dirty pages from
   Shared Memory to the file system.
   It wakes up periodically, but
-  works only for a short period in order to distribute expensive
+  works only for a short period in order to distribute its expensive
   I/O activity over time, instead of generating fewer
   larger I/O peaks which could block other processes.
  
@@ -220,7 +220,7 @@
   Records to the file system and creates a special
   checkpoint record. This process is initiated when predefined
   conditions are met, such as a specified amount of time has passed, or
-  a certain volume of records have been collected.
+  a certain volume of records has been collected.
  
 

@@ -303,7 +303,7 @@
 
  
   An established line of communication between a client process
-  and a server process. If the two involved processes reside on the
+  and a Backend Process. If the two involved processes reside on the
   same Server, then the connection can either use
   TCP/IP or Unix-domain sockets. Otherwise,
   only TCP/IP can be used.
@@ -470,7 +470,7 @@
   A type of Constraint defined on one or more
   Columns in a Table which
   requires the value(s) in those Columns to
-

Re: WAL usage calculation patch

2020-03-31 Thread Julien Rouhaud

On Tue, Mar 31, 2020 at 12:21 PM Kuntal Ghosh
 wrote:
>
> On Mon, Mar 30, 2020 at 6:14 PM Julien Rouhaud  wrote:
> >
> @@ -448,6 +449,7 @@ XLogInsert(RmgrId rmid, uint8 info)
>   bool doPageWrites;
>   XLogRecPtr fpw_lsn;
>   XLogRecData *rdt;
> + int num_fpw = 0;
>
>   /*
>   * Get values needed to decide whether to do full-page writes. Since
> @@ -457,9 +459,9 @@ XLogInsert(RmgrId rmid, uint8 info)
>   GetFullPageWriteInfo(, );
>
>   rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
> - _lsn);
> + _lsn, _fpw);
>
> - EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags);
> + EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags, num_fpw);
>   } while (EndPos == InvalidXLogRecPtr);
>
> I think there are some issues in the num_fpw calculation. For some
> cases, we have to return from XLogInsert without inserting a record.
> Basically, we've to recompute/reassemble the same record. In those
> cases, num_fpw should be reset. Thoughts?

Mmm, yes but since that's the same record is being recomputed from the
same RedoRecPtr, doesn't it mean that we need to reset the counter?
Otherwise we would count the same FPW multiple times.

Re: A bug when use get_bit() function for a long bytea string

2020-03-31 Thread Ashutosh Bapat

Thanks for the changes,
+ int64 res,resultlen;

It's better to have them on separate lines.

-unsigned
+int64
 hex_decode(const char *src, unsigned len, char *dst)

Do we want to explicitly cast the return value to int64? Will build on some
platform crib if not done so? I don't know of such a platform but my
knowledge in this area is not great.

+ byteNo = (int)(n / 8);
+ bitNo = (int)(n % 8);
some comment explaining why this downcasting is safe here?

-  proname => 'get_bit', prorettype => 'int4', proargtypes => 'bytea int4',
+  proname => 'get_bit', prorettype => 'int4', proargtypes => 'bytea int8',
   prosrc => 'byteaGetBit' },
 { oid => '724', descr => 'set bit',
-  proname => 'set_bit', prorettype => 'bytea', proargtypes => 'bytea int4
int4',
+  proname => 'set_bit', prorettype => 'bytea', proargtypes => 'bytea int8
int4',
   prosrc => 'byteaSetBit' },

Shouldn't we have similar changes for following entries as well?
{ oid => '3032', descr => 'get bit',
  proname => 'get_bit', prorettype => 'int4', proargtypes => 'bit int4',
  prosrc => 'bitgetbit' },
{ oid => '3033', descr => 'set bit',
  proname => 'set_bit', prorettype => 'bit', proargtypes => 'bit int4 int4',
  prosrc => 'bitsetbit' },

The tests you have added are for bytea variant which ultimately calles
byteaGet/SetBit(). But I think we also need tests for bit variants which
will ultimately call bitgetbit and bitsetbit functions.

Once you address these comments, I think the patch is good for a committer.
So please mark the commitfest entry as such when you post the next version
of patch.

On Sat, 28 Mar 2020 at 14:40, movead...@highgo.ca 
wrote:

> I want to resent the mail, because last one is in wrong format and hardly
>> to read.
>>
>> In addition, I think 'Size' or 'size_t' is rely on platform, they may
>> can't work on 32bit
>> system. So I choose 'int64' after ashutosh's review.
>>
>> >>I think the second argument indicates the bit position, which would be
>> max bytea length * 8. If max bytea length covers whole int32, the second
>> argument >needs to be wider i.e. int64.
>> >Yes, it makes sence and followed.
>>
>>
> >>I think we need a similar change in byteaGetBit() and byteaSetBit() as
> well.
> Sorry, I think it's my mistake, it is the two functions above should be
> changed.
>
>
>> >>Some more comments on the patch
>> >> struct pg_encoding
>> >> {
>> >>- unsigned (*encode_len) (const char *data, unsigned dlen);
>> >>+ int64 (*encode_len) (const char *data, unsigned dlen);
>> >>  unsigned (*decode_len) (const char *data, unsigned dlen);
>> >>  unsigned (*encode) (const char *data, unsigned dlen, char *res);
>> >>  unsigned (*decode) (const char *data, unsigned dlen, char *res);
>> >> Why not use return type of int64 for rest of the functions here as
>> well?
>> >>  res = enc->encode(VARDATA_ANY(data), datalen, VARDATA(result));
>> >>  /* Make this FATAL 'cause we've trodden on memory ... */
>> >>- if (res > resultlen)
>> >>+ if ((int64)res > resultlen)
>> >>
>> >>if we change return type of all those functions to int64, we won't
>> need this cast.
>> >I change the 'encode' function, it needs an int64 return type, but keep
>> other
>>
>> >functions in 'pg_encoding', because I think it of no necessary reason.
>>
>>
> >>Ok, let's leave it for a committer to decide.
> Well, I change all of them this time, because Tom Lane supports on next
> mail.
>
>
> >Some more review comments.
> >+   int64   res,resultlen;
> >We need those on separate lines, possibly.
> Done
>
> >+   byteNo = (int32)(n / BITS_PER_BYTE);
> >Does it hurt to have byteNo as int64 so as to avoid a cast. Otherwise,
> please
> >add a comment explaining the reason for the cast. The comment applies at
> other
> >places where this change appears.
> >-   int len;
> >+   int64   len;
> >Why do we need this change?
> >int i;
> It is my mistake as describe above, it should not be 'bitgetbit()/
> bitsetbit()'  to be changed.
>
>
>>
>> >>It might help to add a test where we could pass the second argument
>> something
>> >>greater than 1G. But it may be difficult to write such a test case.
>> >Add two test cases.
>>
>>
> >+
> >+select get_bit(
> >+set_bit((repeat('Postgres', 512 * 1024 * 1024 / 8))::bytea, 1024
> * 1024 * 1024 + 1, 0)
> >+   ,1024 * 1024 * 1024 + 1);
>
> >This bit position is still within int4.
> >postgres=# select pg_column_size(1024 * 1024 * 1024 + 1);
> > pg_column_size
> >
> >  4
> >(1 row)
>
> >You want something like
> >postgres=# select pg_column_size(512::bigint * 1024 * 1024 * 8);
> > pg_column_size
> >
> >  8
> >(1 row)
> I intend to test size large then 1G, and now I think you give a better
> idea and followed.
>
>
>
> --
> Highgo Software (Canada/China/Pakistan)
> URL : www.highgo.ca
> EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
>
>

-- 
Best Wishes,
Ashutosh

Re: pg_stat_statements issue with parallel maintenance (Was Re: WAL usage calculation patch)

2020-03-31 Thread Dilip Kumar

On Tue, Mar 31, 2020 at 12:20 PM Dilip Kumar  wrote:
>
> On Tue, Mar 31, 2020 at 10:44 AM Masahiko Sawada
>  wrote:
> >
> > On Tue, 31 Mar 2020 at 12:58, Amit Kapila  wrote:
> > >
> > > On Mon, Mar 30, 2020 at 12:31 PM Masahiko Sawada
> > >  wrote:
> > > >
> > > > The patch for vacuum conflicts with recent changes in vacuum. So I've
> > > > attached rebased one.
> > > >
> > >
> > > + /*
> > > + * Next, accumulate buffer usage.  (This must wait for the workers to
> > > + * finish, or we might get incomplete data.)
> > > + */
> > > + for (i = 0; i < nworkers; i++)
> > > + InstrAccumParallelQuery(>buffer_usage[i]);
> > > +
> > >
> > > This should be done for launched workers aka
> > > lps->pcxt->nworkers_launched.  I think a similar problem exists in
> > > create index related patch.
> >
> > You're right. Fixed in the new patches.
> >
> > On Mon, 30 Mar 2020 at 17:00, Julien Rouhaud  wrote:
> > >
> > > Just minor nitpicking:
> > >
> > > +   int i;
> > >
> > > Assert(!IsParallelWorker());
> > > Assert(ParallelVacuumIsActive(lps));
> > > @@ -2166,6 +2172,13 @@ lazy_parallel_vacuum_indexes(Relation *Irel, 
> > > IndexBulkDeleteResult **stats,
> > > /* Wait for all vacuum workers to finish */
> > > WaitForParallelWorkersToFinish(lps->pcxt);
> > >
> > > +   /*
> > > +* Next, accumulate buffer usage.  (This must wait for the workers to
> > > +* finish, or we might get incomplete data.)
> > > +*/
> > > +   for (i = 0; i < nworkers; i++)
> > > +   InstrAccumParallelQuery(>buffer_usage[i]);
> > >
> > > We now allow declaring a variable in those loops, so it may be better to 
> > > avoid
> > > declaring i outside the for scope?
> >
> > We can do that but I was not sure if it's good since other codes
> > around there don't use that. So I'd like to leave it for committers.
> > It's a trivial change.
>
> I have reviewed the patch and the patch looks fine to me.
>
> One minor comment
> /+ /* Points to buffer usage are in DSM */
> + BufferUsage *buffer_usage;
> +
> /buffer usage are in DSM / buffer usage area in DSM
>

While testing I have found one issue.  Basically, during a parallel
vacuum, it was showing more number of
shared_blk_hits+shared_blks_read.  After, some investigation, I found
that during the cleanup phase nworkers are -1, and because of this we
didn't try to launch worker but "lps->pcxt->nworkers_launched" had the
old launched worker count and shared memory also had old buffer read
data which was never updated as we did not try to launch the worker.

diff --git a/src/backend/access/heap/vacuumlazy.c
b/src/backend/access/heap/vacuumlazy.c
index b97b678..5dfaf4d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2150,7 +2150,8 @@ lazy_parallel_vacuum_indexes(Relation *Irel,
IndexBulkDeleteResult **stats,
 * Next, accumulate buffer usage.  (This must wait for the workers to
 * finish, or we might get incomplete data.)
 */
-   for (i = 0; i < lps->pcxt->nworkers_launched; i++)
+   nworkers = Min(nworkers, lps->pcxt->nworkers_launched);
+   for (i = 0; i < nworkers; i++)
InstrAccumParallelQuery(>buffer_usage[i]);

It worked after the above fix.

Re: WAL usage calculation patch

2020-03-31 Thread Julien Rouhaud

On Tue, Mar 31, 2020 at 12:17 PM Amit Kapila  wrote:
>
> It is possible that both of us are having different meanings for below
> two variables:
> +typedef struct WalUsage
> +{
> + long wal_records; /* # of WAL records produced */
> + long wal_fpw_records; /* # of full page write WAL records
> + * produced */
>
>
> Let me clarify my understanding.  Say if the record is just an FPI
> (ex. XLOG_FPI) and doesn't contain any data then do we want to add one
> to each of wal_fpw_records and wal_records?  My understanding was in
> such a case we will just increment wal_fpw_records.

Yes, as Dilip just pointed out the misunderstanding is due to this
poor name.  Indeed, in such case what I want is both counters to be
incremented.  What I want is wal_records to reflect the total number
of records generated regardless of any content, and wal_num_fpw the
number of full page images, as it seems to make the most sense, and
the easiest way to estimate the ratio of data due to FPW.

> > > 3.  We need to enhance the patch to cover WAL usage for parallel
> > > vacuum and parallel create index based on Sawada-San's latest patch[1]
> > > which fixed the case for buffer usage.
> >
> > I'm sorry but I'm not following.  Do you mean adding regression tests
> > for that case?
> >
>
> No.  I mean to say we should implement WAL usage calculation for those
> two parallel commands.  AFAICS, your patch doesn't cover those two
> commands.

Oh I see.  I just assumed that Sawada-san's patch would be committed
first and I'd then rebase the patchset on top of the newly added
infrastructure to also handle WAL counters, to avoid any conflict on
that bugfix while this new feature is being discussed.  I'll rebase
the patchset against those patches then.

Re: Random set of typos spotted

2020-03-31 Thread Magnus Hagander

On Tue, Mar 31, 2020 at 3:37 PM Daniel Gustafsson  wrote:
>
> A collection of random typos in docs and comments spotted while working around
> the tree.


Thanks, pushed!

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: [Proposal] Global temporary tables

2020-03-31 Thread Prabhat Sahu

Hi Wenjing,
Thanks for the new patch.
I saw with the patch(gtt_v23.patch), we are supporting the new concept
"global temporary sequence"(i.e. session-specific sequence), is this
intentional?

postgres=# create *global temporary sequence* gt_seq;
CREATE SEQUENCE
postgres=# create sequence seq;
CREATE SEQUENCE
postgres=# \d+
  List of relations
 Schema |  Name  |   Type   | Owner | Persistence |Size|
Description
++--+---+-++-
 *public | gt_seq | sequence | edb   | session | 8192 bytes |*
 public | seq| sequence | edb   | permanent   | 8192 bytes |
(2 rows)

postgres=# select *nextval('gt_seq')*, nextval('seq');
 nextval | nextval
-+-
  * 1* |   1
(1 row)

postgres=# select nextval('gt_seq'), nextval('seq');
 nextval | nextval
-+-
   *2* |   2
(1 row)

-- Exit and re-connect to psql prompt:
postgres=# \q
[edb@localhost bin]$ ./psql postgres
psql (13devel)
Type "help" for help.

postgres=# select nextval('gt_seq'), nextval('seq');
 nextval | nextval
-+-
  * 1* |   3
(1 row)

postgres=# select nextval('gt_seq'), nextval('seq');
 nextval | nextval
-+-
   *2 *|   4
(1 row)

On Tue, Mar 31, 2020 at 9:46 AM 曾文旌  wrote:

>
>
> 2020年3月27日 下午5:21，tushar  写道：
>
> On 3/27/20 10:55 AM, 曾文旌 wrote:
>
> Hi Wenjing,
> This patch(gtt_v21_pg13.patch) is not applicable on PG HEAD, I hope you
> have prepared the patch on top of some previous commit.
> Could you please rebase the patch which we can apply on HEAD ?
>
> Yes, It looks like the built-in functions are in conflict with new code.
>
>
> This error message looks wrong  to me-
>
> postgres=# reindex table concurrently t ;
> ERROR:  cannot create indexes on global temporary tables using concurrent
> mode
> postgres=#
>
> Better message would be-
>
> ERROR:  cannot reindex global temporary tables concurrently
>
> I found that the local temp table automatically disables concurrency mode.
> so, I made some improvements, The reindex GTT behaves the same as the
> local temp table.
>
>
> Wenjing
>
>
>
>
> --
> regards,tushar
> EnterpriseDB  https://www.enterprisedb.com/
> The Enterprise PostgreSQL Company
>
>
>

-- 

With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com

Re: WAL usage calculation patch

2020-03-31 Thread Julien Rouhaud

On Tue, Mar 31, 2020 at 11:21 AM Dilip Kumar  wrote:
>
> I have started reviewing this patch and I have some comments/questions.

Thanks a lot!

>
> 1.
> @@ -22,6 +22,10 @@ static BufferUsage save_pgBufferUsage;
>
>  static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
>
> +WalUsage pgWalUsage;
> +static WalUsage save_pgWalUsage;
> +
> +static void WalUsageAdd(WalUsage *dst, WalUsage *add);
>
> Better we move all variable declaration first along with other
> variables and then function declaration along with other function
> declaration.  That is the convention we follow.

Agreed, fixed.

> 2.
>   {
>   bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
> + bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
>
> I think you need to run pgindent,  we should give only one space
> between the variable name and '='.
> so we need to change like below
>
> boolneed_wal = (instrument_options & INSTRUMENT_WAL) != 0;

Done.

> 3.
> +typedef struct WalUsage
> +{
> + long wal_records; /* # of WAL records produced */
> + long wal_fpw_records; /* # of full page write WAL records
> + * produced */
>
> IMHO, the name wal_fpw_records is bit confusing,  First I thought it
> is counting the number of wal records which actually has FPW, then
> after seeing code, I realized that it is actually counting total FPW.
> Shouldn't we rename it to just wal_fpw? or wal_num_fpw or
> wal_fpw_count?

Yes I agree, the name was too confusing.  I went with wal_num_fpw.  I
also used the same for pg_stat_statements.  Other fields are usually
named with a trailing "s" but wal_fpws just seems too weird.  I can
change it if consistency is preferred here.

> 4.  Currently, we are combining all full-page write
> force/normal/consistency checks in one category.  I am not sure
> whether it will be good information to know how many are force_fpw and
> how many are normal_fpw?

I agree with Amit's POV.  For now a single counter seems like enough
to diagnose many behaviors.

I'll keep answering following mails before sending an updated patchset.

1 2 >

1 - 100 of 129 matches

Mail list logo