date:20200323

Re: Option to dump foreign data in pg_dump

2020-03-23 Thread Daniel Gustafsson

> On 23 Mar 2020, at 21:40, Alvaro Herrera  wrote:

> I don't understand why this code specifically disallows the empty string
> as an option to --dump-foreign-data.  The other pattern-matching options
> don't do that.  This seems to have been added in response to Daniel's
> review[1], but I don't quite understand the rationale.  No other option
> behaves that way.  I'm inclined to remove that, and I have done so in
> this version.

It was a response to the discussion upthread about not allowing a blanket dump-
everything statement for foreign data, but rather require some form of opt-in.
The empty string made the code wildcard to all foreign data, which was thought
of as being a footgun for creating problematic dumps.

cheers ./daniel

Re: recovery_target_action=pause with confusing hint

2020-03-23 Thread Fujii Masao




On 2020/03/10 2:27, Robert Haas wrote:

On Mon, Mar 9, 2020 at 12:03 PM Fujii Masao  wrote:

I started the discussion about the topic very related to this.
I'm thinking to apply the change that Sergei proposes after applying
the patch I attached in this thread.
https://postgr.es/m/00c194b2-dbbb-2e8a-5b39-13f14048e...@oss.nttdata.com


I think it would be good to change the primary message, not just the hint.  e.g.

LOG: pausing at end of recovery
HINT: Execute pg_wal_replay_resume() to promote.

vs.

LOG: recovery has paused
HINT: Execute pg_wal_replay_resume() to continue.


Looks good to me.
Attached is the updated version of the patch.
This is based on Sergei's patch.

Regards,


--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 7621fc05e2..3c29581b63 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -877,7 +877,7 @@ static void validateRecoveryParameters(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog);
 static bool recoveryStopsBefore(XLogReaderState *record);
 static bool recoveryStopsAfter(XLogReaderState *record);
-static void recoveryPausesHere(void);
+static void recoveryPausesHere(bool endOfRecovery);
 static bool recoveryApplyDelay(XLogReaderState *record);
 static void SetLatestXTime(TimestampTz xtime);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
@@ -5943,12 +5943,16 @@ recoveryStopsAfter(XLogReaderState *record)
 /*
  * Wait until shared recoveryPause flag is cleared.
  *
+ * endOfRecovery is true if the recovery target is reached and
+ * the paused state starts at the end of recovery because of
+ * recovery_target_action=pause, and false otherwise.
+ *
  * XXX Could also be done with shared latch, avoiding the pg_usleep loop.
  * Probably not worth the trouble though.  This state shouldn't be one that
  * anyone cares about server power consumption in.
  */
 static void
-recoveryPausesHere(void)
+recoveryPausesHere(bool endOfRecovery)
 {
/* Don't pause unless users can connect! */
if (!LocalHotStandbyActive)
@@ -5958,9 +5962,14 @@ recoveryPausesHere(void)
if (LocalPromoteIsTriggered)
return;
 
-   ereport(LOG,
-   (errmsg("recovery has paused"),
-errhint("Execute pg_wal_replay_resume() to 
continue.")));
+   if (endOfRecovery)
+   ereport(LOG,
+   (errmsg("pausing at the end of recovery"),
+errhint("Execute pg_wal_replay_resume() to 
promote.")));
+   else
+   ereport(LOG,
+   (errmsg("recovery has paused"),
+errhint("Execute pg_wal_replay_resume() to 
continue.")));
 
while (RecoveryIsPaused())
{
@@ -7141,7 +7150,7 @@ StartupXLOG(void)
 * adding another spinlock cycle to prevent 
that.
 */
if (((volatile XLogCtlData *) 
XLogCtl)->recoveryPause)
-   recoveryPausesHere();
+   recoveryPausesHere(false);
 
/*
 * Have we reached our recovery target?
@@ -7166,7 +7175,7 @@ StartupXLOG(void)
 * work.
 */
if (((volatile XLogCtlData *) 
XLogCtl)->recoveryPause)
-   recoveryPausesHere();
+   recoveryPausesHere(false);
}
 
/* Setup error traceback support for ereport() 
*/
@@ -7340,7 +7349,7 @@ StartupXLOG(void)
 
case RECOVERY_TARGET_ACTION_PAUSE:
SetRecoveryPause(true);
-   recoveryPausesHere();
+   recoveryPausesHere(true);
 
/* drop into promote */

Re: weird hash plan cost, starting with pg10

2020-03-23 Thread Justin Pryzby

On Mon, Mar 23, 2020 at 01:50:59PM -0300, Alvaro Herrera wrote:
> While messing with EXPLAIN on a query emitted by pg_dump, I noticed that
> current Postgres 10 emits weird bucket/batch/memory values for certain
> hash nodes:
> 
>  ->  Hash  (cost=0.11..0.11 rows=10 width=12) (actual 
> time=0.002..0.002 rows=1 loops=8)
>Buckets: 2139062143  Batches: 2139062143  
> Memory Usage: 8971876904722400kB
>->  Function Scan on unnest init_1  
> (cost=0.01..0.11 rows=10 width=12) (actual time=0.001..0.001 rows=1 loops=8)
> 
> It shows normal values in 9.6.

Your message wasn't totally clear, but this is a live bug on 13dev.

It's actually broken on 9.6, but the issue isn't exposed until commit
6f236e1eb: "psql: Add tab completion for logical replication",
..which adds a nondefault ACL.

I reproduced the problem with this recipe, which doesn't depend on
c.relispartion or pg_get_partkeydef, and everything else shifting underfoot..

|CREATE TABLE t (i int); REVOKE ALL ON t FROM pryzbyj; explain analyze SELECT 
(SELECT 1 FROM (SELECT * FROM unnest(c.relacl)AS acl WHERE NOT EXISTS ( SELECT 
1 FROM unnest(c.relacl) AS init(init_acl) WHERE acl=init_acl)) as foo) AS 
relacl , EXISTS (SELECT 1 FROM pg_depend WHERE objid=c.oid) FROM pg_class c 
ORDER BY c.oid;
| Index Scan using pg_class_oid_index on pg_class c  (cost=0.27..4704.25 
rows=333 width=9) (actual time=16.257..28.054 rows=334 loops=1)
|   SubPlan 1
| ->  Hash Anti Join  (cost=2.25..3.63 rows=1 width=4) (actual 
time=0.024..0.024 rows=0 loops=334)
|   Hash Cond: (acl.acl = init.init_acl)
|   ->  Function Scan on unnest acl  (cost=0.00..1.00 rows=100 
width=12) (actual time=0.007..0.007 rows=1 loops=334)
|   ->  Hash  (cost=1.00..1.00 rows=100 width=12) (actual 
time=0.015..0.015 rows=2 loops=179)
| Buckets: 2139062143  Batches: 2139062143  Memory Usage: 
8971876904722400kB
| ->  Function Scan on unnest init  (cost=0.00..1.00 rows=100 
width=12) (actual time=0.009..0.010 rows=2 loops=179)
|   SubPlan 2
| ->  Seq Scan on pg_depend  (cost=0.00..144.21 rows=14 width=0) (never 
executed)
|   Filter: (objid = c.oid)
|   SubPlan 3
| ->  Seq Scan on pg_depend pg_depend_1  (cost=0.00..126.17 rows=7217 
width=4) (actual time=0.035..6.270 rows=7220 loops=1)

When I finally gave up on thinking I knew what branch was broken, I got:

|3fc6e2d7f5b652b417fa6937c34de2438d60fa9f is the first bad commit
|commit 3fc6e2d7f5b652b417fa6937c34de2438d60fa9f
|Author: Tom Lane 
|Date:   Mon Mar 7 15:58:22 2016 -0500
|
|Make the upper part of the planner work by generating and comparing Paths.

-- 
Justin

Re: improve transparency of bitmap-only heap scans

2020-03-23 Thread Amit Kapila

On Tue, Mar 24, 2020 at 11:36 AM Justin Pryzby  wrote:
>
> On Tue, Mar 24, 2020 at 10:54:05AM +0530, Amit Kapila wrote:
> > On Fri, Mar 20, 2020 at 7:09 AM James Coleman  wrote:
> > >
> > > Awesome, thanks for confirming with an actual plan.
> > >
> > > > I don't think it matters in nontext mode, but at least in text mode, I 
> > > > think
> > > > maybe the Unfetched blocks should be output after the exact and lossy 
> > > > blocks,
> > > > in case someone is parsing it, and because bitmap-only is a relatively 
> > > > new
> > > > feature.  Its output is probably less common than exact/lossy.
> > >
> > > I tweaked that (and a comment that didn't reference the change); see 
> > > attached.
> > >
> >
> > Few comments:
> > 1.
> > -
> > - if (tbmres->ntuples >= 0)
> > + else if (tbmres->ntuples >= 0)
> >   node->exact_pages++;
> >
> > How is this change related to this patch?
>
> Previously, a page was either "exact" or "lossy".
> Now it's one of exact/lossy/skipped.
>

Okay, that makes sense.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: improve transparency of bitmap-only heap scans

2020-03-23 Thread Justin Pryzby

On Tue, Mar 24, 2020 at 10:54:05AM +0530, Amit Kapila wrote:
> On Fri, Mar 20, 2020 at 7:09 AM James Coleman  wrote:
> >
> > Awesome, thanks for confirming with an actual plan.
> >
> > > I don't think it matters in nontext mode, but at least in text mode, I 
> > > think
> > > maybe the Unfetched blocks should be output after the exact and lossy 
> > > blocks,
> > > in case someone is parsing it, and because bitmap-only is a relatively new
> > > feature.  Its output is probably less common than exact/lossy.
> >
> > I tweaked that (and a comment that didn't reference the change); see 
> > attached.
> >
> 
> Few comments:
> 1.
> -
> - if (tbmres->ntuples >= 0)
> + else if (tbmres->ntuples >= 0)
>   node->exact_pages++;
> 
> How is this change related to this patch?

Previously, a page was either "exact" or "lossy".
Now it's one of exact/lossy/skipped.
(But not exact/lossy but in either case might be skipped).

if (skip_fetch)
{
/* can't be lossy in the skip_fetch case */
Assert(tbmres->ntuples >= 0);

/*
 * The number of tuples on this page is put into
 * node->return_empty_tuples.
 */
node->return_empty_tuples = tbmres->ntuples;
+   node->unfetched_pages++;

   
}   

   
else if (!table_scan_bitmap_next_block(scan, tbmres))   

   
{   

   
/* AM doesn't think this block is valid, skip 
*/  
 
continue;   

   
}   

   
-   

   
-   if (tbmres->ntuples >= 0)   

   
+   else if (tbmres->ntuples >= 0)  

   
node->exact_pages++;

   
else

   
node->lossy_pages++;

   

-- 
Justin

Re: Why does [auto-]vacuum delay not report a wait event?

2020-03-23 Thread Andres Freund

Hi,

On 2020-03-21 17:24:57 -0700, Andres Freund wrote:
> > diff --git a/src/include/pgstat.h b/src/include/pgstat.h
> > index 851d0a7246..4db40e23cc 100644
> > --- a/src/include/pgstat.h
> > +++ b/src/include/pgstat.h
> > @@ -848,7 +848,8 @@ typedef enum
> > WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
> > WAIT_EVENT_PG_SLEEP,
> > WAIT_EVENT_RECOVERY_APPLY_DELAY,
> > -   WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL
> > +   WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL,
> > +   WAIT_EVENT_VACUUM_DELAY,
> >  } WaitEventTimeout;
> 
> Looks good to me - unless somebody protests I'm going to apply this
> shortly.

And pushed. The only thing I changed was to remove the added trailing ,
:)

Thanks for the patch,

Andres

Re: Unqualified pg_catalog casts in pg_dump

2020-03-23 Thread Michael Paquier

On Mon, Mar 23, 2020 at 05:57:37PM +0100, Daniel Gustafsson wrote:
> Correct, there is no functional importance with this.  IMO the value is in
> readability and grep-ability.

This may cause extra conflicts when back-patching.
--
Michael


signature.asc
Description: PGP signature

Re: Internal key management system

2020-03-23 Thread Masahiko Sawada

On Tue, 24 Mar 2020 at 07:15, Bruce Momjian  wrote:
>
> On Mon, Mar 23, 2020 at 03:55:34PM +0900, Masahiko Sawada wrote:
> > On Sat, 21 Mar 2020 at 23:50, Bruce Momjian  wrote:
> > > Actually, I think we need three files:
> > >
> > > *  TDE WAL key file
> > > *  TDE block key file
> > > *  SQL-level file
> > >
> > > Primaries and standbys have to use the same TDE WAL key file, but can
> > > use different TDE block key files to allow for key rotation, so having
> > > separate files makes sense --- maybe they need to be in their own
> > > directory.
> >
> > I've considered to have separate key files once but it would make
> > things complex to update multiple files atomically. Postgres server
> > will never start if it crashes in the middle of cluster passphrase
> > rotation. Can we consider to have keys related to TDE after we
> > introduce the basic key management system? Probably having keys in a
> > separate file rather than in pg_control file would be better but we
> > don't need these keys so far.
>
> Well, we need to be able to upgrade this so we have to set it up now in
> a way that allows that.
>
> I am not sure we have ever had a case where we needed to update multiple
> files atomically at the same time, without the help of WAL.
>
> Perhaps we should put the three keys in separate files in a directory
> called 'cryptokeys', and when we change the pass phrase, we create a new
> directory called 'cryptokeys.new'.  Then once we have created the files
> in there with the new pass phrase, we remove cryptokeys and rename
> directory cryptokeys.new to cryptokeys.  On boot, if cryptokeys exists
> and cryptokeys.new does too, remove cryptokeys.new because we crashed
> during key rotation,  If cryptokeys.new exists and cryptokeys doesn't,
> we rename cryptokeys.new to cryptokeys because we crashed before the
> rename.

That seems to work fine.

So we will have pg_cryptokeys within PGDATA and each key is stored
into separate file named the key id such as "sql", "tde-wal" and
"tde-block". I'll update the patch and post.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: improve transparency of bitmap-only heap scans

2020-03-23 Thread Amit Kapila

On Fri, Mar 20, 2020 at 7:09 AM James Coleman  wrote:
>
> Awesome, thanks for confirming with an actual plan.
>
> > I don't think it matters in nontext mode, but at least in text mode, I think
> > maybe the Unfetched blocks should be output after the exact and lossy 
> > blocks,
> > in case someone is parsing it, and because bitmap-only is a relatively new
> > feature.  Its output is probably less common than exact/lossy.
>
> I tweaked that (and a comment that didn't reference the change); see attached.
>

Few comments:
1.
-
- if (tbmres->ntuples >= 0)
+ else if (tbmres->ntuples >= 0)
  node->exact_pages++;

How is this change related to this patch?

2.
+ * unfetched_pagestotal number of pages not retrieved due to vm
  * prefetch_iterator  iterator for prefetching ahead of current page
  * prefetch_pages# pages prefetch iterator is ahead of current
  * prefetch_targetcurrent target prefetch distance
@@ -1591,6 +1592,7 @@ typedef struct BitmapHeapScanState
  Buffer pvmbuffer;
  long exact_pages;
  long lossy_pages;
+ long unfetched_pages;

Can we name it as skipped_pages?

3. Can we add a test or two for this functionality?



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: error context for vacuum to include block number

2020-03-23 Thread Amit Kapila

On Tue, Mar 24, 2020 at 9:46 AM Justin Pryzby  wrote:
>
> On Mon, Mar 23, 2020 at 02:25:14PM +0530, Amit Kapila wrote:
> > > Yea, and it would be misleading if we reported "while scanning block..of
> > > relation" if we actually failed while writing its FSM.
> > >
> > > My previous patches did this:
> > >
> > > +   case VACUUM_ERRCB_PHASE_VACUUM_FSM:
> > > +   errcontext("while vacuuming free space map of 
> > > relation \"%s.%s\"",
> > > +   cbarg->relnamespace, 
> > > cbarg->relname);
> > > +   break;
> > >
> >
> > In what kind of errors will this help?
>
> If there's an I/O error on an _fsm file, for one.
>

If there is a read or write failure, then we give error like below
which already has required information.
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not read block %u in file \"%s\": %m",
blocknum, FilePathName(v->mdfd_vfd;

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: pgsql: Improve autovacuum logging for aggressive and anti-wraparound ru

2020-03-23 Thread Michael Paquier

On Mon, Mar 23, 2020 at 06:41:50PM -0700, Andres Freund wrote:
> Which valid scenario can lead to this? Neither the comment, nor commit
> message explain it.

The commit message mentions that concurrent autovacuum jobs can lead
to the creation of non-aggressive and anti-wraparound jobs, which have
no sense because an aggressive and anti-wraparound job was already
done in parallel with a different worker, and that this was possible
because of inconsistent relcache lookups across concurrent jobs.  This
was mentioned upthread.

> Unless you're thinking of scenarios where autovacuum
> and manual vacuum are mixed, I don't really see valid reasons? Normally
> autovacuum's locking + the table_recheck_autovac() check should prevent
> problematic scenarios.
>
> I do see a few scenarios that can trigger this - but they all more or
> less are bugs.

Hmm.  OK.

> It doesn't strike me as a good idea to work around such bugs by silently
> neutering heap_vacuum_rel(). The likelihood of that temporarily covering
> up more severe problems seems significant - they're likely to then later
> bite you with a cluster shutdown.

Saying that, I have been thinking about this one for a couple of days
now and it seems to me that this is a factor contributing to what we
are seeing in [1], and I agree that this is just an incorrect approach
that makes easier to trigger the real underlying issues, while
table_recheck_autovac() ought to be the only code path doing the skip
job.  Note that I have failed to reproduce the behavior of the other
thread though, always finishing with a non-aggressive anti-wraparound
skipped because of an aggressive and anti-wraparound job happened just
before in parallel, and autovacuum was always able to continue
triggering new jobs, keeping the relfrozenxid age at bay.

So I would like to first revert that part, to have a cleaner state to
work on the underlying issues.  A pure revert means also adding back
the log message for non-aggressive and anti-wraparound jobs that
should never exist, which should be replaced by an assertion once all
the holes are fixed.  What do you think?

[1]: 
https://www.postgresql.org/message-id/cae39h23rtx1jkyjwc5tccv34hwwraizacuxomdqdpm+zt5-...@mail.gmail.com
--
Michael

signature.asc
Description: PGP signature

Re: range_agg

2020-03-23 Thread Paul Jungwirth

Thanks Alvaro!

On Mon, Mar 23, 2020 at 4:33 PM Alvaro Herrera 
 wrote:

>
> Thinking about the on-disk representation, can we do better than putting
> the contained ranges in long-varlena format, including padding; also we
> include the type OID with each element.  Sounds wasteful.  A more
> compact representation might be to allow short varlenas and doing away
> with the alignment padding, put the the type OID just once.  This is
> important because we cannot change it later.

Can you give me some guidance on this? I don't know how to make the 
on-disk format different from the in-memory format. (And for the 
in-memory format, I think it's important to have actual RangeTypes 
inside the multirange.) Is there something in the documentation, or a 
README in the repo, or even another type I can follow?

> I'm also wondering if multirange_in() is the right strategy.  Would 
it> be sensible to give each range to range_parse or range_parse_bounde, so

> that it determines where each range starts and ends?  Then that function
> doesn't have to worry about each quote and escape, duplicating range
> parsing code.  (This will probably require changing signature of the
> rangetypes.c function, and exporting it; for example have
> range_parse_bound allow bound_str to be NULL and in that case don't mess
> with the StringInfo and just return the end position of the parsed
> bound.)

Yeah, I really wanted to do it that way originally too. As you say it 
would require passing back more information from the range-parsing code. 
I can take a stab at making the necessary changes. I'm a bit more 
confident now than I was then in changing the range code we have already.

Regards,

--
Paul  ~{:-)
p...@illuminatedcomputing.com

Re: error context for vacuum to include block number

2020-03-23 Thread Justin Pryzby

On Mon, Mar 23, 2020 at 02:25:14PM +0530, Amit Kapila wrote:
> > Yea, and it would be misleading if we reported "while scanning block..of
> > relation" if we actually failed while writing its FSM.
> >
> > My previous patches did this:
> >
> > +   case VACUUM_ERRCB_PHASE_VACUUM_FSM:
> > +   errcontext("while vacuuming free space map of 
> > relation \"%s.%s\"",
> > +   cbarg->relnamespace, 
> > cbarg->relname);
> > +   break;
> >
> 
> In what kind of errors will this help?

If there's an I/O error on an _fsm file, for one.

-- 
Justin

Re: Negative cost is seen for plan node

2020-03-23 Thread Kyotaro Horiguchi

At Mon, 23 Mar 2020 21:13:48 +0800, Richard Guo  wrote 
in 
> Hi all,
> 
> With the following statements on latest master (c81bd3b9), I find
> negative cost for plan nodes.
> 
> create table a (i int, j int);
> insert into a select i%10, i from generate_series(1,100)i;
> analyze a;
> 
> # explain select i from a group by i;
>QUERY PLAN
> -
>  HashAggregate  (cost=1300.00..-1585.82 rows=102043 width=4)

Good catch!

>Group Key: i
>Planned Partitions: 4
>->  Seq Scan on a  (cost=0.00..14425.00 rows=100 width=4)
> (4 rows)
> 
> In function cost_agg, when we add the disk costs of hash aggregation
> that spills to disk, nbatches is calculated as 1.18 in this case. It is
> greater than 1, so there will be spill. And the depth is calculated as
> -1 in this case, with num_partitions being 4. I think this is where
> thing goes wrong.

The depth is the expected number of iterations of reading the relation.

>   depth = ceil( log(nbatches - 1) / log(num_partitions) );

I'm not sure what the expression based on, but apparently it is wrong
for nbatches <= 2.0. It looks like a thinko of something like this.

depth = ceil( log(nbatches) / log(num_partitions + 1) );

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: replay pause vs. standby promotion

2020-03-23 Thread Fujii Masao





On 2020/03/24 0:57, Fujii Masao wrote:



On 2020/03/24 0:17, Sergei Kornilov wrote:

Hello


You meant that the promotion request should cause the recovery
to finish immediately even if there are still outstanding WAL records,
and cause the standby to become the master?


Oh, I get your point. But yes, I expect that in case of promotion request 
during a pause, the user (me too) will want to have exactly the current state, 
not latest available in WALs.


Basically I'd like the promotion to make the standby replay all the WAL
even if it's requested during pause state. OTOH I understand there
are use cases where immediate promotion is useful, as you explained.
So, +1 to add something like "pg_ctl promote -m immediate".

But I'm afraid that now it's too late to add such feature into v13.
Probably it's an item for v14


I pushed the latest version of the patch. If you have further opinion
about immediate promotion, let's keep discussing that!

Also we need to go back to the original patch posted at [1].

[1]
https://www.postgresql.org/message-id/flat/19168211580382...@myt5-b646bde4b8f3.qloud-c.yandex.net

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-23 Thread Alvaro Herrera

On 2020-Mar-23, James Coleman wrote:

> 4. nodeIncrementalSort.c ExecReScanIncrementalSort: This whole chunk
> is suspect. I've mentioned previously I don't have a great mental
> model of how rescan works and its invariants (IIRC someone said it was
> about moving around a result set in a cursor). Regardless I'm pretty
> sure this code just doesn't work correctly.

I don't think that's the whole of it.  My own vague understanding of
ReScan is that it's there to support running a node again, possibly with
different parameters.  For example if you have a join of an indexscan
on the outer side and an incremental sort on the inner side, and the
values from the index are used as parameters to the incremental sort,
then the incremental sort is going to receive ReScan calls for each of
the values that the index returns.  Sometimes the index could give you
the same values as before (because there's a dupe in the index), so you
can just return the same values from the incremental sort; but other
times it's going to return different values so you need to reset the
incremental sort to "start from scratch" using the new values as
parameters.

Now, if you have a cursor reading from the incremental sort and fetch
all tuples, then rewind completely and fetch all again, then that's
going to be a rescan as well.

I agree with you that the code doesn't seem to implement that.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-03-23 Thread Michael Paquier

On Mon, Mar 23, 2020 at 07:17:13PM +0300, Alexander Korotkov wrote:
> On Mon, Mar 23, 2020 at 6:16 PM Alvaro Herrera  
> wrote:
>> Hi, I didn't realize that this was waiting on me.  It looks good to me
>> -- I'd just remove the comment on the function prototype in archiver.h,
>> which is not customary (we only put these comments in the .c file).
>> Let's get it pushed.

Thanks.  I have removed the comment in archive.h.

>> Thanks for everybody's work on this.
> 
> Great, thank you!

Thanks Alvaro and Alexander.  0001 has been applied as of e09ad07.
Now for 0002, let's see about it later.  Attached is a rebased
version, with no actual changes.
--
Michael
From ecb2b3d62cce7ef56aa4fdb5343c54d9ee94100b Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 13 Mar 2020 09:17:21 +0900
Subject: [PATCH v4] Add -c/--restore-target-wal to pg_rewind

pg_rewind needs to copy from the source server to the target a set of
relation blocks changed from the previous checkpoint where WAL forked
between both up to the end of WAL on the target.  This may require WAL
segments that are not present anymore on the target's pg_wal, causing
pg_rewind to fail.  It is possible to fix that by copying manually the
WAL segments needed but this may lead to extra and useless work.

Instead, this commit introduces a new option allowing pg_rewind to use a
restore_command when running by grabbing it from the target
configuration.  This allows the rewind operation to be more reliable,
and only the necessary segments are fetched from the archives in this
case.

In order to do that, a new routine is added to src/common/ to allow
frontend tools to retrieve a WAL segment using an already-built restore
command.  This version is much more simple than the backend equivalent.

Author: Alexey Kondratov
Reviewed-by: Andrey Borodin, Andres Freund, Alvaro Herrera, Alexander
Korotkov, Michael Paquier
Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee3616895...@postgrespro.ru
---
 src/include/common/fe_archive.h   |  21 +
 src/include/port.h|   3 +-
 src/common/Makefile   |   1 +
 src/common/exec.c |   3 +-
 src/common/fe_archive.c   | 129 ++
 src/bin/pg_rewind/parsexlog.c |  33 +++-
 src/bin/pg_rewind/pg_rewind.c |  87 ++--
 src/bin/pg_rewind/pg_rewind.h |   6 +-
 src/bin/pg_rewind/t/001_basic.pl  |   3 +-
 src/bin/pg_rewind/t/RewindTest.pm |  64 ++-
 doc/src/sgml/ref/pg_rewind.sgml   |  28 +--
 src/tools/msvc/Mkvcbuild.pm   |   4 +-
 12 files changed, 355 insertions(+), 27 deletions(-)
 create mode 100644 src/include/common/fe_archive.h
 create mode 100644 src/common/fe_archive.c

diff --git a/src/include/common/fe_archive.h b/src/include/common/fe_archive.h
new file mode 100644
index 00..13d19d11e7
--- /dev/null
+++ b/src/include/common/fe_archive.h
@@ -0,0 +1,21 @@
+/*-
+ *
+ * fe_archive.h
+ *	  Routines to access WAL archives from frontend
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/common/fe_archive.h
+ *
+ *-
+ */
+#ifndef FE_ARCHIVE_H
+#define FE_ARCHIVE_H
+
+extern int	FrontendRestoreArchivedFile(const char *path,
+		const char *xlogfname,
+		off_t expectedSize,
+		const char *restoreCommand);
+
+#endif			/* FE_ARCHIVE_H */
diff --git a/src/include/port.h b/src/include/port.h
index 29f3e39e5b..20a5de1b7a 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -102,10 +102,11 @@ extern void pgfnames_cleanup(char **filenames);
 /* Portable locale initialization (in exec.c) */
 extern void set_pglocale_pgservice(const char *argv0, const char *app);
 
-/* Portable way to find binaries (in exec.c) */
+/* Portable way to find and execute binaries (in exec.c) */
 extern int	find_my_exec(const char *argv0, char *retpath);
 extern int	find_other_exec(const char *argv0, const char *target,
 			const char *versionstr, char *retpath);
+extern char *pipe_read_line(char *cmd, char *line, int maxsize);
 
 /* Doesn't belong here, but this is used with find_other_exec(), so... */
 #define PG_BACKEND_VERSIONSTR "postgres (PostgreSQL) " PG_VERSION "\n"
diff --git a/src/common/Makefile b/src/common/Makefile
index 6939b9d087..a97c723fbd 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -88,6 +88,7 @@ endif
 # (Mkvcbuild.pm has a copy of this list, too)
 OBJS_FRONTEND = \
 	$(OBJS_COMMON) \
+	fe_archive.o \
 	fe_memutils.o \
 	file_utils.o \
 	logging.o \
diff --git a/src/common/exec.c b/src/common/exec.c
index 88400daa47..f39b0a294b 100644
--- a/src/common/exec.c
+++ b/src/common/exec.c
@@ -51,7 +51,6 @@
 
 static int	validate_exec(const char *path);
 static int	resolve_symlinks(char *path);
-static c

Re: Parallel grouping sets

2020-03-23 Thread Tomas Vondra


On Fri, Mar 20, 2020 at 07:57:02PM +0800, Pengzhou Tang wrote:

Hi Tomas,

I rebased the code and resolved the comments you attached, some unresolved
comments are explained in 0002-fixes.patch, please take a look.

I also make the hash spill working for parallel grouping sets, the plan
looks like:

gpadmin=# explain select g100, g10, sum(g::numeric), count(*), max(g::text)
from gstest_p group by cube (g100,g10);
   QUERY PLAN
---
Finalize MixedAggregate  (cost=1000.00..7639.95 rows= width=80)
  Filtered by: (GROUPINGSETID())
  Group Key: ()
  Hash Key: g100, g10
  Hash Key: g100
  Hash Key: g10
  Planned Partitions: 4
  ->  Gather  (cost=1000.00..6554.34 rows= width=84)
Workers Planned: 7
->  Partial MixedAggregate  (cost=0.00..4776.64 rows= width=84)
  Group Key: ()
  Hash Key: g100, g10
  Hash Key: g100
  Hash Key: g10
  Planned Partitions: 4
  ->  Parallel Seq Scan on gstest_p  (cost=0.00..1367.71
rows=28571 width=12)
(16 rows)



Hmmm, OK. I think there's some sort of memory leak, though. I've tried
running a simple grouping set query on catalog_sales table from TPC-DS
scale 100GB test. The query is pretty simple:

  select count(*) from catalog_sales
  group by cube (cs_warehouse_sk, cs_ship_mode_sk, cs_call_center_sk);

with a partial MixedAggregate plan (attached). When executed, it however
allocates more and more memory, and eventually gets killed by an OOM
killer. This is on a machine with 8GB of RAM, work_mem=4MB (and 4
parallel workers).

The memory context stats from a running process before it gets killed by
OOM look like this

  TopMemoryContext: 101560 total in 6 blocks; 7336 free (6 chunks); 94224 used
TopTransactionContext: 73816 total in 4 blocks; 11624 free (0 chunks); 
62192 used
  ExecutorState: 1375731712 total in 174 blocks; 5391392 free (382 chunks); 
1370340320 used
HashAgg meta context: 315784 total in 10 blocks; 15400 free (2 chunks); 
300384 used
  ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
  ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
  ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
  ...

That's 1.3GB allocated in ExecutorState - that doesn't seem right.

FWIW there are only very few groups (each attribute has fewer than 30
distinct values, so there's only about ~1000 groups. On master it works
just fine, of course.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
  QUERY PLAN
  
--
 Finalize MixedAggregate  (cost=1000.00..5901854.45 rows=10416 width=20)
   Filtered by: (GROUPINGSETID())
   Group Key: ()
   Hash Key: cs_warehouse_sk, cs_ship_mode_sk, cs_call_center_sk
   Hash Key: cs_warehouse_sk, cs_ship_mode_sk
   Hash Key: cs_warehouse_sk
   Hash Key: cs_ship_mode_sk, cs_call_center_sk
   Hash Key: cs_ship_mode_sk
   Hash Key: cs_call_center_sk, cs_warehouse_sk
   Hash Key: cs_call_center_sk
   ->  Gather  (cost=1000.00..5901348.48 rows=41664 width=24)
 Workers Planned: 4
 ->  Partial MixedAggregate  (cost=0.00..5896182.08 rows=10416 width=24)
   Group Key: ()
   Hash Key: cs_warehouse_sk, cs_ship_mode_sk, cs_call_center_sk
   Hash Key: cs_warehouse_sk, cs_ship_mode_sk
   Hash Key: cs_warehouse_sk
   Hash Key: cs_ship_mode_sk, cs_call_center_sk
   Hash Key: cs_ship_mode_sk
   Hash Key: cs_call_center_sk, cs_warehouse_sk
   Hash Key: cs_call_center_sk
   ->  Parallel Seq Scan on catalog_sales  (cost=0.00..4096256.32 
rows=35996432 width=12)
(22 rows)
TopMemoryContext: 101560 total in 6 blocks; 7336 free (6 chunks); 94224 used
  TopTransactionContext: 73816 total in 4 blocks; 11624 free (0 chunks); 62192 
used
ExecutorState: 1375731712 total in 174 blocks; 5391392 free (382 chunks); 
1370340320 used
  HashAgg meta context: 315784 total in 10 blocks; 15400 free (2 chunks); 
300384 used
ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
  ExprContext: 8192 total

Re: weird hash plan cost, starting with pg10

2020-03-23 Thread Thomas Munro

On Tue, Mar 24, 2020 at 9:55 AM Thomas Munro  wrote:
> On Tue, Mar 24, 2020 at 6:01 AM Tom Lane  wrote:
> > Alvaro Herrera  writes:
> > > While messing with EXPLAIN on a query emitted by pg_dump, I noticed that
> > > current Postgres 10 emits weird bucket/batch/memory values for certain
> > > hash nodes:
> >
> > >  ->  Hash  (cost=0.11..0.11 rows=10 width=12) 
> > > (actual time=0.002..0.002 rows=1 loops=8)
> > >Buckets: 2139062143  Batches: 2139062143  
> > > Memory Usage: 8971876904722400kB
> > >->  Function Scan on unnest init_1  
> > > (cost=0.01..0.11 rows=10 width=12) (actual time=0.001..0.001 rows=1 
> > > loops=8)
> >
> > Looks suspiciously like uninitialized memory ...
>
> I think "hashtable" might have been pfree'd before
> ExecHashGetInstrumentation() ran, because those numbers look like
> CLOBBER_FREED_MEMORY's pattern:
>
> >>> hex(2139062143)
> '0x7f7f7f7f'
> >>> hex(8971876904722400 / 1024)
> '0x7f7f7f7f7f7'
>
> Maybe there is something wrong with the shutdown order of nested subplans.

I think there might be a case like this:

* ExecRescanHashJoin() decides it can't reuse the hash table for a
rescan, so it calls ExecHashTableDestroy(), clears HashJoinState's
hj_HashTable and sets hj_JoinState to HJ_BUILD_HASHTABLE
* the HashState node still has a reference to the pfree'd HashJoinTable!
* HJ_BUILD_HASHTABLE case reaches the empty-outer optimisation case so
it doesn't bother to build a new hash table
* EXPLAIN examines the HashState's pointer to a freed HashJoinTable struct

You could fix the dangling pointer problem by clearing it, but then
you'd have no data for EXPLAIN to show in this case.  Some other
solution is probably needed, but I didn't have time to dig further
today.

Re: Define variables in the approprieate scope

2020-03-23 Thread Tom Lane

Michael Paquier  writes:
> On Mon, Mar 23, 2020 at 08:50:55PM -0400, Bruce Momjian wrote:
>> I am fine with either usage, frankly.  I was just pointing out what
>> might be the benefit of the current coding.

> Personal opinion here.  I tend to prefer putting variable declarations
> into the inner portions because it makes it easier to reason about the
> code, though I agree that this concept does not need to be applied all
> the time.

My vote is to not make this sort of change until there's another
reason to touch the code in question.  All changes create hazards for
back-patching, and I don't think this change is worth it on its own.
But if there are going to be diffs in the immediate vicinity anyway,
then sure.

(I'm feeling a bit sensitized to this, perhaps, because of recent
unpleasant experience with back-patching b4570d33a.  That didn't touch
very much code, and the functions in question seemed like fairly stagnant
backwaters of the code base, so it should not have been painful to
back-patch ... but it was, because of assorted often-cosmetic changes
in said code.)

regards, tom lane

Re: Wait event that should be reported while waiting for WAL archiving to finish

2020-03-23 Thread Fujii Masao





On 2020/03/23 15:56, Fujii Masao wrote:



On 2020/03/19 19:39, Atsushi Torikoshi wrote:


On Wed, Feb 26, 2020 at 9:19 PM Fujii Masao mailto:masao.fu...@oss.nttdata.com>> wrote:

    I have no idea about this. But I wonder how much that change
    is helpful to reduce the power consumption because waiting
    for WAL archive during the backup basically not so frequently
    happens.


+1.
And as far as I reviewed the patch,  I didn't find any problems.


Thanks for the review!
Barring any objection, I will commit this patch.


Pushed! Thanks!

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

Re: Make mesage at end-of-recovery less scary.

2020-03-23 Thread Kyotaro Horiguchi

At Mon, 23 Mar 2020 12:47:36 -0700, Andres Freund  wrote in 
> Hi,
> 
> On 2020-03-23 10:37:16 +0100, Peter Eisentraut wrote:
> > On 2020-03-05 08:06, Kyotaro Horiguchi wrote:
> > > | [20866] LOG:  replication terminated by primary server
> > > | [20866] DETAIL:  End of WAL reached on timeline 1 at 0/30001C8.
> > > | [20866] FATAL:  could not send end-of-streaming message to primary: no 
> > > COPY in progress
> 
> IMO it's a bug that we see this FATAL. I seem to recall that we didn't
> use to get that?

I thought that it is a convention that A auxiliary process uses ERROR
(which is turned into FATAL in ereport) to exit, which I didn't like
so much, but it was out of scope of this patch.

As for the message bove, the FATAL is preceded by the "LOG:
replication terminated by" message, that means walreceiver tries to
send new data after disconnection just to fail, which is
unreasonable. I think we should exit immediately after detecting
disconnection. The FATAL is gone by the attached.

> > > | [24267]  LOG:  reached end of WAL at 0/3000240 on timeline 1 in archive 
> > > during standby mode
> > > | [24267]  DETAIL:  invalid record length at 0/3000240: wanted 24, got 0
> > 
> > Is this the before and after?  That doesn't seem like a substantial
> > improvement to me.  You still get the "scary" message at the end.
> 
> It seems like a minor improvement - folding the DETAIL into the message
> makes sense to me here. But it indeed doesn't really address the main
> issue.
> 
> I think we don't want to elide the information about how the end of the
> WAL was detected - there are some issues where I found that quite
> helpful. But we could reformulate it to be clearer that it's informative
> output, not a bug.  E.g. something roughly like
> 
> LOG:  reached end of WAL at 0/3000240 on timeline 1 in archive during standby 
> mode
> DETAIL: End detected due to invalid record length at 0/3000240: expected 24, 
> got 0
> (I first elided the position in the DETAIL, but it could differ from the
> one in LOG)
> 
> I don't find that very satisfying, but I can't come up with something
> that provides the current information, while being less scary than my
> suggestion?

The 0-length record is not an "invalid" state during recovery, so we
can add the message for the state as "record length is 0 at %X/%X". I
think if other states found there, it implies something wrong.

LSN is redundantly shown but I'm not sure if it is better to remove it
from either of the two lines.

| LOG:  reached end of WAL at 0/3000850 on timeline 1 in pg_wal during crash 
recovery
| DETAIL:  record length is 0 at 0/3000850

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 47511afed5f8acf92abaf1cd6fcfecc1faea9c87 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v3] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is happening.
Make this message less scary as "reached end of WAL".
---
 src/backend/access/transam/xlog.c   | 69 ++---
 src/backend/access/transam/xlogreader.c |  9 
 src/backend/replication/walreceiver.c   | 11 ++--
 3 files changed, 67 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 793c076da6..6c2924dfb7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4283,12 +4283,15 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogReadRecord(xlogreader, &errormsg);
 		ReadRecPtr = xlogreader->ReadRecPtr;
 		EndRecPtr = xlogreader->EndRecPtr;
 		if (record == NULL)
 		{
+			ErrRecPtr = ReadRecPtr ? ReadRecPtr : EndRecPtr;
+
 			if (readFile >= 0)
 			{
 close(readFile);
@@ -4296,13 +4299,13 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			}
 
 			/*
-			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * If we are fetching checkpoint, we emit the error message right
+			 * now. Otherwise the error is regarded as "end of WAL" and the
+			 * message if any is shown as a part of the end-of-WAL message
+			 * below.
 			 */
-			if (errormsg)
-ereport(emode_for_corrupt_record(emode, EndRecPtr),
+			if (fetching_ckpt && errormsg)
+ereport(emode_for_corrupt_record(emode, ErrRecPtr),
 		(errmsg_internal("%s", errormsg) /* already translated */ ));
 		}
 
@@ -4333,11 +4336,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
-		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
 
+		/* No valid record available from this s

Re: pg_stat_progress_basebackup - progress reporting for pg_basebackup, in the server side

2020-03-23 Thread Fujii Masao





On 2020/03/19 17:22, Fujii Masao wrote:



On 2020/03/19 12:02, Amit Langote wrote:

On Thu, Mar 19, 2020 at 11:45 AM Fujii Masao
 wrote:

On 2020/03/19 11:32, Amit Langote wrote:

On Thu, Mar 19, 2020 at 11:24 AM Alvaro Herrera
 wrote:

On 2020-Mar-19, Amit Langote wrote:


Magnus' idea of checking the values in pg_stat_get_progress_info() to
determine whether to return NULL seems fine to me.


So you think that the latest patch is good enough?


I see that the latest patch modifies pg_stat_progress_basebackup view
to return NULL, so not exactly.  IIUC, Magnus seems to be advocating
to *centralize* this in pg_stat_get_progress_info(), which all views
are based on, which means we need to globally define a NULL param
value, as Alvaro also pointed out.

But...


  We will need to
update the documentation of st_progress_param, because it currently
says:

   *  ...but the meaning of each element in the
   * st_progress_param array is command-specific.
   */
  ProgressCommandType st_progress_command;
  Oid st_progress_command_target;
  int64   st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
} PgBackendStatus;

If we are to define -1 in st_progress_param[] as NULL to the users,
that must be mentioned here.


Hmm, why -1?  It seems like a value that we might want to use for other
purposes in other params.  Maybe INT64_MIN is a better choice?


Yes, maybe.


I don't think that we need to define the specific value like -1 as NULL 
globally.
Which value should be used for that purpose may vary by each command. Only for
pg_stat_progress_basebackup.backup_total, IMO using -1 as special value for
NULL is not so bad idea.


This is the first instance of needing to display NULL in a progress
view, so a non-general solution may be enough for now.  IOW, your
latest patch is good enough for that. :)


Ok, so barring any objection, I will commit the latest patch.


Pushed! Thanks all!

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

Re: pgsql: Improve autovacuum logging for aggressive and anti-wraparound ru

2020-03-23 Thread Andres Freund

Hi,

On 2019-03-29 20:51:38 +0900, Michael Paquier wrote:
> So, coming back to this thread, and studying the problem again, it
> looks that the diagnostic that a non-aggressive, anti-wraparound
> vacuum could be triggered because the worker sees trouble in the
> force because of some activity happening in parallel.  Hence, if we
> face this case, it looks right to skip the vacuum for this relation.
> 
> Attached is an updated patch with a better error message, more
> comments, and the removal of the anti-wraparound non-aggressive log
> which was added in 28a8fa9.  The error message could be better I
> guess.  Suggestions are welcome.

> diff --git a/src/backend/access/heap/vacuumlazy.c 
> b/src/backend/access/heap/vacuumlazy.c
> index 5c554f9465..82be8c81f3 100644
> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -248,6 +248,25 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
>   if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
>   aggressive = true;
>  
> + /*
> +  * When running an anti-wraparound vacuum, we expect relfrozenxid to be
> +  * old enough so as aggressive is always set.  If this is not the case,
> +  * it could be possible that another concurrent vacuum process has done
> +  * the work for this relation so that relfrozenxid in relcache has
> +  * already been moved forward enough, causing this vacuum run to be
> +  * non-aggressive.  If that happens, note that this relation no longer
> +  * needs to be vacuumed, so just skip it.
> +  */
> + if (params->is_wraparound && !aggressive)
> + {
> + ereport(LOG,
> + (errmsg_internal("found vacuum to prevent 
> wraparound of table \"%s.%s.%s\" to be not aggressive, so skipping",
> +  
> get_database_name(MyDatabaseId),
> +  
> get_namespace_name(RelationGetNamespace(onerel)),
> +  
> RelationGetRelationName(onerel;
> + return;
> + }
> +

Which valid scenario can lead to this? Neither the comment, nor commit
message explain it. Unless you're thinking of scenarios where autovacuum
and manual vacuum are mixed, I don't really see valid reasons? Normally
autovacuum's locking + the table_recheck_autovac() check should prevent
problematic scenarios.

I do see a few scenarios that can trigger this - but they all more or
less are bugs.

It doesn't strike me as a good idea to work around such bugs by silently
neutering heap_vacuum_rel(). The likelihood of that temporarily covering
up more severe problems seems significant - they're likely to then later
bite you with a cluster shutdown.

Greetings,

Andres Freund

Re: Additional improvements to extended statistics

2020-03-23 Thread Tomas Vondra


On Mon, Mar 23, 2020 at 08:21:42AM +, Dean Rasheed wrote:

On Sat, 21 Mar 2020 at 21:59, Tomas Vondra  wrote:


Ah, right. Yeah, I think that should work. I thought there would be some
volatility due to groups randomly not making it into the MCV list, but
you're right it's possible to construct the data in a way to make it
perfectly deterministic. So that's what I've done in the attached patch.



Looking through those new tests, another issue occurred to me -- under
some circumstances this patch can lead to extended stats being applied
twice to the same clause, which is not great, because it involves
quite a lot of extra work, and also because it can lead to
overestimates because of the way that MCV stats are applied as a delta
correction to simple_sel.

The way this comes about is as follows -- if we start with an OR
clause, that will be passed as a single-item implicitly AND'ed list to
clauselist_selectivity(), and from there to
statext_clauselist_selectivity() and then to
statext_mcv_clauselist_selectivity(). This will call
clauselist_selectivity_simple() to get simple_sel, before calling
mcv_clauselist_selectivity(), which will recursively compute all the
MCV corrections. However, the call to clauselist_selectivity_simple()
will call clause_selectivity() for each clause (just a single OR
clause in this example), which will now call
clauselist_selectivity_or(), which will go back into
statext_clauselist_selectivity() with "is_or = true", which will apply
the MCV corrections to the same set of clauses that the outer call was
about to process.



Hmmm. So let's consider a simple OR clause with two arguments, both
covered by single statistics object. Something like this:

  CREATE TABLE t (a int, b int);
  INSERT INTO t SELECT mod(i, 10), mod(i, 10)
FROM generate_series(1,10);
  CREATE STATISTICS s (mcv) ON a,b FROM t;

  SELECT * FROM t WHERE a = 0 OR b = 0;

Which is estimated correctly, but the calls graph looks like this:

  clauselist_selectivity
statext_clauselist_selectivity
  statext_mcv_clauselist_selectivity  <---
clauselist_selectivity_simple
  clause_selectivity
clauselist_selectivity_or
  statext_clauselist_selectivity
statext_mcv_clauselist_selectivity  <---
  clauselist_selectivity_simple_or
clause_selectivity
clause_selectivity
  mcv_clauselist_selectivity
  clauselist_selectivity_simple_or
mcv_clauselist_selectivity
  clauselist_selectivity_simple
(already estimated)

IIUC the problem you have in mind is that we end up calling
statext_mcv_clauselist_selectivity twice, once for the top-level AND
clause with a single argument, and then recursively for the expanded OR
clause. Indeed, that seems to be applying the correction twice.



I'm not sure what's the best way to resolve that. Perhaps
statext_mcv_clauselist_selectivity() / statext_is_compatible_clause()
should ignore OR clauses from an AND-list, on the basis that they will
get processed recursively later. Or perhaps estimatedclauses can
somehow be used to prevent this, though I'm not sure exactly how that
would work.


I don't know. I feel uneasy about just ignoring some of the clauses,
because what happens for complex clauses, where the OR is not directly
in the AND clause, but is negated or something like that?

Isn't it the case that clauselist_selectivity_simple (and the OR
variant) should ignore extended stats entirely? That is, we'd need to
add a flag (or _simple variant) to clause_selectivity, so that it calls
causelist_selectivity_simple_or. So the calls would look like this:

  clauselist_selectivity
statext_clauselist_selectivity
  statext_mcv_clauselist_selectivity
clauselist_selectivity_simple  <--- disable extended stats
  clause_selectivity
clauselist_selectivity_simple_or
  clause_selectivity
  clause_selectivity
mcv_clauselist_selectivity
clauselist_selectivity_simple
  already estimated

I've only quickly hacked clause_selectivity, but it does not seems very
invasive (of course, it means disruption to clause_selectivity callers,
but I suppose most call clauselist_selectivity).

BTW do you have an example where this would actually cause an issue?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Define variables in the approprieate scope

2020-03-23 Thread Michael Paquier

On Mon, Mar 23, 2020 at 08:50:55PM -0400, Bruce Momjian wrote:
> On Mon, Mar 23, 2020 at 01:00:24PM -0300, Alvaro Herrera wrote:
>> If we're talking about personal preference, my own is what Antonin
>> shows.  However, since disagreement has been expressed, I think we
>> should only change it if the generated code turns out better.
> 
> I am fine with either usage, frankly.  I was just pointing out what
> might be the benefit of the current coding.

Personal opinion here.  I tend to prefer putting variable declarations
into the inner portions because it makes it easier to reason about the
code, though I agree that this concept does not need to be applied all
the time.
--
Michael


signature.asc
Description: PGP signature

Re: Define variables in the approprieate scope

2020-03-23 Thread Bruce Momjian

On Mon, Mar 23, 2020 at 01:00:24PM -0300, Alvaro Herrera wrote:
> On 2020-Mar-18, Bruce Momjian wrote:
> 
> > On Tue, Feb 25, 2020 at 09:35:52AM +0100, Antonin Houska wrote:
> > > I've noticed that two variables in RelationCopyStorage() are defined in a
> > > scope higher than necessary. Please see the patch.
> > 
> > It seems cleaner to me to allocate the variables once before the loop
> > starts, rather than for each loop iteration.
> 
> If we're talking about personal preference, my own is what Antonin
> shows.  However, since disagreement has been expressed, I think we
> should only change it if the generated code turns out better.

I am fine with either usage, frankly.  I was just pointing out what
might be the benefit of the current coding.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Autovacuum vs vac_update_datfrozenxid() vs ?

2020-03-23 Thread Andres Freund

Hi,

While looking to understand what could be going on with [1], I think I
might have stumbled on a few issues that could at least explain parts of
the problem.

First, it seems to me that vac_update_datfrozenxid() can spuriously
(but temporarily) fail due to the 'bogus'
logic. vac_update_datfrozenxid() first determines the 'newest' xid it
accepts:
/*
 * Identify the latest relfrozenxid and relminmxid values that we could
 * validly see during the scan.  These are conservative values, but it's
 * not really worth trying to be more exact.
 */
lastSaneFrozenXid = ReadNewTransactionId();
lastSaneMinMulti = ReadNextMultiXactId();

and then starts a full table scan of pg_class to determine the database
wide relfrozenxid:
scan = systable_beginscan(relation, InvalidOid, false,
  NULL, 0, NULL);

Whenever vac_update_datfrozenxid() encounters a relfrozenxid that's "too
new", it'll bail out:
/*
 * If things are working properly, no relation should have a
 * relfrozenxid or relminmxid that is "in the future".  However, such
 * cases have been known to arise due to bugs in pg_upgrade.  If we
 * see any entries that are "in the future", chicken out and don't do
 * anything.  This ensures we won't truncate clog before those
 * relations have been scanned and cleaned up.
 */
if (TransactionIdPrecedes(lastSaneFrozenXid, classForm->relfrozenxid) ||
MultiXactIdPrecedes(lastSaneMinMulti, classForm->relminmxid))
{
bogus = true;
break;
}

Which all in theory makes sense - but there's two fairly sizable races
here:
1) Because lastSaneFrozenXid is determined *before* the snapshot for
   the pg_class, it's entirely possible that concurrently another vacuum on
   another table in the same database starts & finishes. And that very well
   can end up using a relfrozenxid that's newer than the current
   database.
2) Much worse, because vac_update_relstats() uses heap_inplace_update(),
   there's actually no snapshot semantics here anyway! Which means that
   the window for a race isn't just between the ReadNewTransactionId()
   and the systable_beginscan(), but instead lasts for the duration of
   the entire scan.

Either of these can then triggers vac_update_datfrozenxid() to
*silently* bail out without touching anything.



I think there's also another (even larger?) race in
vac_update_datfrozenxid(): Unless I miss something, two backends can
concurrently run through the scan in vac_update_datfrozenxid() for two
different tables in the same database, both can check that they need to
update the pg_database row, and then one of them can overwrite the
results of the other. And the one that updates last might actually be
the one with an older horizon.  This is possible since there is no 'per
database' locking in heap_vacuum_rel().



The way I suspect that interacts with the issue in [1] ist that once
that has happened, do_start_worker() figures out it's doing a xid
wraparound vacuum based on the "wrong" datfrozenxid:
/* Check to see if this one is at risk of wraparound */
if (TransactionIdPrecedes(tmp->adw_frozenxid, xidForceLimit))
{
if (avdb == NULL ||
TransactionIdPrecedes(tmp->adw_frozenxid,
  
avdb->adw_frozenxid))
avdb = tmp;
for_xid_wrap = true;
continue;
}
else if (for_xid_wrap)
continue;   /* ignore not-at-risk 
DBs */

which means autovac launcher will only start workers for that database -
but there might actually not be any work for it.


I have some theories as to why this is more common in 12, but I've not
fully nailed them down.


Greetings,

Andres Freund

[1] https://postgr.es/m/20200323152247.GB52612%40nol

Re: range_agg

2020-03-23 Thread Alvaro Herrera

On 2020-Mar-19, Paul A Jungwirth wrote:

> On Thu, Mar 19, 2020 at 1:43 PM Paul A Jungwirth
>  wrote:
> > On Thu, Mar 19, 2020 at 1:42 PM Alvaro Herrera  
> > wrote:
> > > There's been another flurry of commits in the polymorphic types area.
> > > Can you please rebase again?
> >
> > I noticed that too. :-) I'm about halfway through a rebase right now.
> > I can probably finish it up tonight.
> 
> Here is that patch. I should probably add an anycompatiblemultirange
> type now too? I'll get started on that tomorrow.

Thanks for the new version.  Here's a few minor adjustments while I
continue to read through it.

Thinking about the on-disk representation, can we do better than putting
the contained ranges in long-varlena format, including padding; also we
include the type OID with each element.  Sounds wasteful.  A more
compact representation might be to allow short varlenas and doing away
with the alignment padding, put the the type OID just once.  This is
important because we cannot change it later.

I'm also wondering if multirange_in() is the right strategy.  Would it
be sensible to give each range to range_parse or range_parse_bounde, so
that it determines where each range starts and ends?  Then that function
doesn't have to worry about each quote and escape, duplicating range
parsing code.  (This will probably require changing signature of the
rangetypes.c function, and exporting it; for example have
range_parse_bound allow bound_str to be NULL and in that case don't mess
with the StringInfo and just return the end position of the parsed
bound.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From bd8b814426bef6f5a620351557cdc953ba0ae22e Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 23 Mar 2020 18:50:02 -0300
Subject: [PATCH 1/5] Fix typo

---
 src/backend/utils/adt/multirangetypes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/utils/adt/multirangetypes.c b/src/backend/utils/adt/multirangetypes.c
index bd3cf3dc93..6b3cde8eca 100644
--- a/src/backend/utils/adt/multirangetypes.c
+++ b/src/backend/utils/adt/multirangetypes.c
@@ -79,7 +79,7 @@ static int32 multirange_canonicalize(TypeCacheEntry *rangetyp, int32 input_range
  * to range_in, but we have to detect quoting and backslash-escaping
  * which can happen for range bounds.
  * Backslashes can escape something inside or outside a quoted string,
- * and a quoted string can escape quote marks either either backslashes
+ * and a quoted string can escape quote marks with either backslashes
  * or double double-quotes.
  */
 Datum
-- 
2.20.1

>From afd017faab3d6aff3ca8618d95a6f9c918ab8cca Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 23 Mar 2020 18:50:19 -0300
Subject: [PATCH 2/5] Remove trailing useless ;

---
 src/backend/utils/adt/multirangetypes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/utils/adt/multirangetypes.c b/src/backend/utils/adt/multirangetypes.c
index 6b3cde8eca..6e9cf77651 100644
--- a/src/backend/utils/adt/multirangetypes.c
+++ b/src/backend/utils/adt/multirangetypes.c
@@ -131,7 +131,7 @@ multirange_in(PG_FUNCTION_ARGS)
 			input_str),
 	 errdetail("Unexpected end of input.")));
 
-		 /* skip whitespace */ ;
+		/* skip whitespace */
 		if (isspace((unsigned char) ch))
 			continue;
 
-- 
2.20.1

>From d7c1ea171de018d025ab5de57f572a9353b02fdf Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 23 Mar 2020 18:50:40 -0300
Subject: [PATCH 3/5] reduce palloc+strlcpy to pnstrdup

---
 src/backend/utils/adt/multirangetypes.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/multirangetypes.c b/src/backend/utils/adt/multirangetypes.c
index 6e9cf77651..5b57416cfd 100644
--- a/src/backend/utils/adt/multirangetypes.c
+++ b/src/backend/utils/adt/multirangetypes.c
@@ -167,9 +167,8 @@ multirange_in(PG_FUNCTION_ARGS)
 	parse_state = MULTIRANGE_IN_RANGE_ESCAPED;
 else if (ch == ']' || ch == ')')
 {
-	range_str_len = ptr - range_str + 2;
-	range_str_copy = palloc0(range_str_len);
-	strlcpy(range_str_copy, range_str, range_str_len);
+	range_str_len = ptr - range_str + 1;
+	range_str_copy = pnstrdup(range_str, range_str_len);
 	if (range_capacity == range_count)
 	{
 		range_capacity *= 2;
-- 
2.20.1

>From 7698f57d1f5ab870c89a84da7bea24bef241a458 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 23 Mar 2020 18:50:59 -0300
Subject: [PATCH 4/5] silence compiler warning

---
 src/backend/utils/fmgr/funcapi.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/src/backend/utils/fmgr/funcapi.c b/src/backend/utils/fmgr/funcapi.c
index ab24359981..3bfefcf48a 100644
--- a/src/backend/utils/fmgr/funcapi.c
+++ b/src/backend/utils/fmgr/funcapi.c
@@ -508,10 +508,13 @@ resolve_anyelement_from_others(polymorphic_actuals *actuals)
 	else if (OidIsVal

Re: Option to dump foreign data in pg_dump

2020-03-23 Thread Alvaro Herrera

On 2020-Jan-29, Peter Eisentraut wrote:

> On 2020-01-21 10:36, Luis Carril wrote:
> > > Yes we can support --include-foreign-data without parallel option and
> > > later add support for parallel option as a different patch.
> > 
> >      I've attached a new version of the patch in which an error is
> > emitted if the parallel backup is used with the --include-foreign-data
> > option.
> 
> This seems like an overreaction.  The whole point of lockTableForWorker() is
> to avoid deadlocks, but foreign tables don't have locks, so it's not a
> problem.  I think you can just skip foreign tables in lockTableForWorker()
> using the same logic that getTables() uses.
> 
> I think parallel data dump would be an especially interesting option when
> using foreign tables, so it's worth figuring this out.

I agree it would be nice to implement this, so I tried to implement it.

I found it's not currently workable, because parallel.c only has a tocEntry
to work with, not a DumpableObject, so it doesn't know that the table is
foreign; to find that out, parallel.c could use findObjectByDumpId, but
parallel.c is used by both pg_dump and pg_restore, and findObjectByDumpId is
in common.c which cannot be linked in pg_restore because of numerous
incompatibilities.

One way to make this work would be to put lockTableForWorker somewhere other
than parallel.c.  Foe example maybe have CreateArchive() set up a new "lock
table" ArchiveHandle function ptr that parallel.c can call;
lockTableForWorker() becomes the pg_dump implementation of that, while
pg_restore uses NULL.

Anyway, I think Luis has it right that this should not be a blocker for
this feature.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Option to dump foreign data in pg_dump

2020-03-23 Thread Alvaro Herrera

v8 attached.

I modified Luis' v7 a little bit by putting the ftserver acquisition in
the main pg_class query instead of adding one separate query for each
foreign table.  That seems better overall.

I don't understand why this code specifically disallows the empty string
as an option to --dump-foreign-data.  The other pattern-matching options
don't do that.  This seems to have been added in response to Daniel's
review[1], but I don't quite understand the rationale.  No other option
behaves that way.  I'm inclined to remove that, and I have done so in
this version.

I removed DumpOptions new bool flag.  Seems pointless; we can just check
that the list is not null, as we do for other such lists.

I split out the proposed test in a different commit; there's no
consensus that this test is acceptable as-is.  Tom proposed a different
strategy[2]; if you try to dump a table with a dummy handler, you'll get
this:

COPY public.ft1 (c1, c2, c3) FROM stdin;
pg_dump: error: query failed: ERROR:  foreign-data wrapper "dummy" has no 
handler
pg_dump: error: query was: COPY (SELECT c1, c2, c3 FROM public.ft1 ) TO stdout;

Maybe what we should do just verify that you do get that error (and no
other errors).

[1] https://postgr.es/m/e9c5b25c-52e4-49ec-9958-69cd5bd14...@yesql.se
[2] https://postgr.es/m/8001.1573759...@sss.pgh.pa.us

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 7e484cb8b4fa9421d2a64fe98fd1c5058c5a5fd0 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 20 Mar 2020 18:42:03 -0300
Subject: [PATCH v8 1/2] pg_dump: Allow dumping data of specific foreign
 servers

Author: Luis Carril
Discussion: https://postgr.es/m/lejpr01mb0185483c0079d2f651b16231e7...@lejpr01mb0185.deuprd01.prod.outlook.de
---
 doc/src/sgml/ref/pg_dump.sgml |  30 ++
 src/bin/pg_dump/pg_dump.c | 110 --
 src/bin/pg_dump/pg_dump.h |   1 +
 3 files changed, 136 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 13bd320b31..a9bc397165 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -767,6 +767,36 @@ PostgreSQL documentation
   
  
 
+ 
+  --include-foreign-data=foreignserver
+  
+   
+Dump the data for any foreign table with a foreign server
+matching foreignserver
+pattern. Multiple foreign servers can be selected by writing multiple
+--include-foreign-data switches.
+Also, the foreignserver parameter is
+interpreted as a pattern according to the same rules used by
+psql's \d commands (see ),
+so multiple foreign servers can also be selected by writing wildcard characters
+in the pattern.  When using wildcards, be careful to quote the pattern
+if needed to prevent the shell from expanding the wildcards; see
+.
+The only exception is that an empty pattern is disallowed.
+   
+
+   
+
+ When --include-foreign-data is specified,
+ pg_dump does not check that the foreign
+ table is writeable.  Therefore, there is no guarantee that the
+ results of a foreign table dump can be successfully restored.
+
+   
+  
+ 
+
  
   --inserts
   
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 959b36a95c..1849dfe3d7 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -119,6 +119,8 @@ static SimpleStringList table_exclude_patterns = {NULL, NULL};
 static SimpleOidList table_exclude_oids = {NULL, NULL};
 static SimpleStringList tabledata_exclude_patterns = {NULL, NULL};
 static SimpleOidList tabledata_exclude_oids = {NULL, NULL};
+static SimpleStringList foreign_servers_include_patterns = {NULL, NULL};
+static SimpleOidList foreign_servers_include_oids = {NULL, NULL};
 
 
 /* placeholders for the delimiters for comments */
@@ -153,6 +155,9 @@ static void expand_schema_name_patterns(Archive *fout,
 		SimpleStringList *patterns,
 		SimpleOidList *oids,
 		bool strict_names);
+static void expand_foreign_server_name_patterns(Archive *fout,
+SimpleStringList *patterns,
+SimpleOidList *oids);
 static void expand_table_name_patterns(Archive *fout,
 	   SimpleStringList *patterns,
 	   SimpleOidList *oids,
@@ -385,6 +390,7 @@ main(int argc, char **argv)
 		{"no-sync", no_argument, NULL, 7},
 		{"on-conflict-do-nothing", no_argument, &dopt.do_nothing, 1},
 		{"rows-per-insert", required_argument, NULL, 10},
+		{"include-foreign-data", required_argument, NULL, 11},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -600,6 +606,11 @@ main(int argc, char **argv)
 dopt.dump_inserts = (int) rowsPerInsert;
 break;
 
+			case 11:			/* include foreign data */
+simple_string_list_append(&foreign_servers_include_patterns,
+		  optarg);
+

Re: Option to dump foreign data in pg_dump

2020-03-23 Thread Alvaro Herrera

On 2020-Mar-23, Alvaro Herrera wrote:

> > This seems like an overreaction.  The whole point of lockTableForWorker() is
> > to avoid deadlocks, but foreign tables don't have locks, so it's not a
> > problem.  I think you can just skip foreign tables in lockTableForWorker()
> > using the same logic that getTables() uses.
> > 
> > I think parallel data dump would be an especially interesting option when
> > using foreign tables, so it's worth figuring this out.
> 
> I agree it would be nice to implement this, so I tried to implement it.

(Here's patch for this, which of course doesn't compile)

diff --git a/src/bin/pg_dump/parallel.c b/src/bin/pg_dump/parallel.c
index c25e3f7a88..b3000da409 100644
--- a/src/bin/pg_dump/parallel.c
+++ b/src/bin/pg_dump/parallel.c
@@ -1316,17 +1316,33 @@ IsEveryWorkerIdle(ParallelState *pstate)
  * then we know that somebody else has requested an ACCESS EXCLUSIVE lock and
  * so we have a deadlock.  We must fail the backup in that case.
  */
+#include "pg_dump.h"
+#include "catalog/pg_class_d.h"
 static void
 lockTableForWorker(ArchiveHandle *AH, TocEntry *te)
 {
const char *qualId;
PQExpBuffer query;
PGresult   *res;
+   DumpableObject *obj;
 
/* Nothing to do for BLOBS */
if (strcmp(te->desc, "BLOBS") == 0)
return;
 
+   /*
+* Nothing to do for foreign tables either, since they don't support 
LOCK
+* TABLE.
+*/
+   obj = findObjectByDumpId(te->dumpId);
+   if (obj->objType == DO_TABLE_DATA)
+   {
+   TableInfo *tabinfo = (TableInfo *) obj;
+
+   if (tabinfo->relkind == RELKIND_FOREIGN_TABLE)
+   return;
+   }
+
query = createPQExpBuffer();
 
qualId = fmtQualifiedId(te->namespace, te->tag);

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: GSoC chat link not working

2020-03-23 Thread Stephen Frost

Greetings,

* Ananya Srivastava (ananyavsrivat...@gmail.com) wrote:
> irc://irc.freenode.net/postgresql  link is not working and I am not able to
> use the chat option to clear some doubt. here is an ss if you require it.

You simply need a client that works with irc links to utilize that link.

For a web-based interface, you could go here:

https://webchat.freenode.net/

And then, after logging in, join the #postgresql channel.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: backup manifests

2020-03-23 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> I think I forgot an initializer. Try this version.

Just took a quick look through this.  I'm pretty sure David wants to
look at it too.  Anyway, some comments below.

> diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
> index f139ba0231..d1ff53e8e8 100644
> --- a/doc/src/sgml/protocol.sgml
> +++ b/doc/src/sgml/protocol.sgml
> @@ -2466,7 +2466,7 @@ The commands accepted in replication mode are:
>
>  
> xreflabel="BASE_BACKUP">
> -BASE_BACKUP [ LABEL 
> 'label' ] [ PROGRESS ] [ 
> FAST ] [ WAL ] [ 
> NOWAIT ] [ MAX_RATE 
> rate ] [ TABLESPACE_MAP ] [ 
> NOVERIFY_CHECKSUMS ]
> +BASE_BACKUP [ LABEL 
> 'label' ] [ PROGRESS ] [ 
> FAST ] [ WAL ] [ 
> NOWAIT ] [ MAX_RATE 
> rate ] [ TABLESPACE_MAP ] [ 
> NOVERIFY_CHECKSUMS ] [ MANIFEST 
> manifest_option ] [ 
> MANIFEST_CHECKSUMS 
> checksum_algorithm ]
>   BASE_BACKUP
>  
>  
> @@ -2576,6 +2576,37 @@ The commands accepted in replication mode are:
>   
>  
> 
> +
> +   
> +MANIFEST
> +
> + 
> +  When this option is specified with a value of 
> ye'
> +  or force-escape, a backup manifest is created
> +  and sent along with the backup. The latter value forces all 
> filenames
> +  to be hex-encoded; otherwise, this type of encoding is performed 
> only
> +  for files whose names are non-UTF8 octet sequences.
> +  force-escape is intended primarily for testing
> +  purposes, to be sure that clients which read the backup manifest
> +  can handle this case. For compatibility with previous releases,
> +  the default is MANIFEST 'no'.
> + 
> +
> +   
> +
> +   
> +MANIFEST_CHECKSUMS
> +
> + 
> +  Specifies the algorithm that should be used to checksum each file
> +  for purposes of the backup manifest. Currently, the available
> +  algorithms are NONE, CRC32C,
> +  SHA224, SHA256,
> +  SHA384, and SHA512.
> +  The default is CRC32C.
> + 
> +
> +   
>
>   
>   

While I get the desire to have a default here that includes checksums,
the way the command is structured, it strikes me as odd that the lack of
MANIFEST_CHECKSUMS in the command actually results in checksums being
included.  I would think that we'd either:

- have the lack of MANIFEST_CHECKSUMS mean 'No checksums'

or

- Require MANIFEST_CHECKSUMS to be specified and not have it be optional

We aren't expecting people to actually be typing these commands out and
so I don't think it's a *huge* deal to have it the way you've written
it, but it still strikes me as odd.  I don't think I have a real
preference between the two options that I suggest above, maybe very
slightly in favor of the first.

> diff --git a/doc/src/sgml/ref/pg_basebackup.sgml 
> b/doc/src/sgml/ref/pg_basebackup.sgml
> index 90638aad0e..bf6963a595 100644
> --- a/doc/src/sgml/ref/pg_basebackup.sgml
> +++ b/doc/src/sgml/ref/pg_basebackup.sgml
> @@ -561,6 +561,69 @@ PostgreSQL documentation
> 
>
>   
> +
> + 
> +  --no-manifest
> +  
> +   
> +Disables generation of a backup manifest. If this option is not
> +specified, the server will and send generate a backup manifest
> +which can be verified using .
> +   
> +  
> + 

How about "If this option is not specified, the server will generate and
send a backup manifest which can be verified using ..."

> + 
> +  --manifest-checksums= class="parameter">algorithm
> +  
> +   
> +Specifies the algorithm that should be used to checksum each file
> +for purposes of the backup manifest. Currently, the available
> +algorithms are NONE, CRC32C,
> +SHA224, SHA256,
> +SHA384, and SHA512.
> +The default is CRC32C.
> +   

As I recall, there was an invitation to argue about the defaults at one
point, and so I'm going to say here that I would advocate for a
different default than 'crc32c'.  Specifically, I would think sha256 or
512 would be better.  I don't recall seeing a debate about this that
conclusively found crc32c to be better, but I'm happy to go back and
reread anything someone wants to point me at.

> +   
> +If NONE is selected, the backup manifest will
> +not contain any checksums. Otherwise, it will contain a checksum
> +of each file in the backup using the specified algorithm. In 
> addition,
> +the manifest itself will always contain a SHA256
> +checksum of its own contents. The SHA algorithms
> +are significantly more CPU-intensive than CRC32C,
> +so selecting one of them may increase the time required to complete
> +the backup.
> +   

It also seems a bit silly to me that using the defaults means having to
deal with two different alg

Re: backend type in log_line_prefix?

2020-03-23 Thread Bruce Momjian

On Thu, Mar 19, 2020 at 01:37:17PM -0300, Fabrízio de Royes Mello wrote:
> 
> On Sun, Mar 15, 2020 at 7:32 AM Peter Eisentraut <
> peter.eisentr...@2ndquadrant.com> wrote:
> >
> >
> > I have committed that last one also, after some corrections.
> >
> 
> IMHO we should also update file_fdw documentation. See attached!
> 
> Regards,
> 
> --
>    Fabrízio de Royes Mello         Timbira - http://www.timbira.com.br/
>    PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

> diff --git a/doc/src/sgml/file-fdw.sgml b/doc/src/sgml/file-fdw.sgml
> index 28b61c8f2d..ed028e4ec9 100644
> --- a/doc/src/sgml/file-fdw.sgml
> +++ b/doc/src/sgml/file-fdw.sgml
> @@ -261,7 +261,8 @@ CREATE FOREIGN TABLE pglog (
>query text,
>query_pos integer,
>location text,
> -  application_name text
> +  application_name text,
> +  backend_type text
>  ) SERVER pglog
>  OPTIONS ( filename '/home/josh/data/log/pglog.csv', format 'csv' );
>  

Patch applied to master, thanks.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Improve heavyweight locks instead of building new lock managers?

2020-03-23 Thread Andres Freund

Hi,

On 2020-02-21 12:40:06 +1300, Thomas Munro wrote:
> On Thu, Feb 20, 2020 at 5:14 PM Andres Freund  wrote:
> >  16 files changed, 569 insertions(+), 1053 deletions(-)
> 
> Nice!

Thanks!

> Some comments on 0001, 0003, 0004:
> 
> > Subject: [PATCH v1 1/6] Add additional prev/next & detached node functions 
> > to
> 
> +extern void dlist_check(const dlist_head *head);
> +extern void slist_check(const slist_head *head);
> 
> I approve of the incidental constification in this patch.

It was just necessary fallout :)

> +/*
> + * Like dlist_delete(), but also sets next/prev to NULL to signal not being 
> in
> + * list.
> + */
> +static inline void
> +dlist_delete_thoroughly(dlist_node *node)
> +{
> +   node->prev->next = node->next;
> +   node->next->prev = node->prev;
> +   node->next = NULL;
> +   node->prev = NULL;
> +}
> 
> Instead of introducing this strange terminology, why not just have the
> callers do ...
> 
> dlist_delete(node);
> dlist_node_init(node);

There's quite a few callers in predicate.c - I actually did that first.

> ..., or perhaps supply dlist_delete_and_reinit(node) that does exactly
> that?  That is, reuse the code and terminology.

Yea, that's might be better, but see paragraph below.  I quite dislike
adding any "empty node" state.

> +/*
> + * Check if node is detached. A node is only detached if it either has been
> + * initialized with dlist_init_node(), or deleted with
> + * dlist_delete_thoroughly().
> + */
> +static inline bool
> +dlist_node_is_detached(const dlist_node *node)
> +{
> +Assert((node->next == NULL && node->prev == NULL) ||
> +   (node->next != NULL && node->prev != NULL));
> +
> +return node->next == NULL;
> +}
> 
> How about dlist_node_is_linked()?  I don't like introducing random new
> verbs when we already have 'linked' in various places, and these
> things are, y'know, linked lists.

Well, but that doesn't signal that you can't just delete and have
dlist_node_is_linked() work. I *want* it to sound "different".  We could
of course make delete always do this, but I don't want to add that
overhead unnecessarily.

> > Subject: [PATCH v1 4/6] Use dlists for predicate locking.
> 
> +dlist_foreach(iter, &unconstify(SERIALIZABLEXACT *, 
> reader)->outConflicts)
> 
> Yuck...

It doesn't seem *that* bad to me, at least signals properly what we're
doing, and only does so in one place.

> I suppose you could do this:
> 
> -dlist_foreach(iter, &unconstify(SERIALIZABLEXACT *, 
> reader)->outConflicts)
> +dlist_foreach_const(iter, &reader->outConflicts)

We'd need a different iterator type too, I think? Because the iterator
itself can't be constant, but we'd want the elements themselves be
pointers to constants.

const just isn't a very granular thing in C :(.

> ... given:
> 
> +/* Variant for when you have a pointer to const dlist_head. */
> +#define dlist_foreach_const(iter, lhead)\
> +for (AssertVariableIsOfTypeMacro(iter, dlist_iter),\
> + AssertVariableIsOfTypeMacro(lhead, const dlist_head *),\
> + (iter).end = (dlist_node *) &(lhead)->head,\
> + (iter).cur = (iter).end->next ? (iter).end->next : (iter).end;\
> + (iter).cur != (iter).end;\
> + (iter).cur = (iter).cur->next)
> +
> 
> ... or find a way to make dlist_foreach() handle that itself, which
> seems pretty reasonable given its remit to traverse lists without
> modifying them, though perhaps that would require a different iterator
> type...

I was trying that first, but I don't easily see how we can do
so. Iterating over a non-constant list with dlist_foreach obviously
still allows to to manipulate the list members. Thus dlist_iter.cur
can't be a 'pointer to const'.  Whereas that's arguably what'd be needed
for a correct dlist_foreach() of a constant list?

We could just accept const pointers for dlist_foreach(), but then we'd
*always* accept them, and we'd thus unconditionally have iter.cur as
non-const.  Would that be better?

> Otherwise looks OK to me and passes various tests I threw at it

Thanks!

Greetings,

Andres Freund

Re: Internal key management system

2020-03-23 Thread Bruce Momjian

On Mon, Mar 23, 2020 at 03:55:34PM +0900, Masahiko Sawada wrote:
> On Sat, 21 Mar 2020 at 23:50, Bruce Momjian  wrote:
> > Actually, I think we need three files:
> >
> > *  TDE WAL key file
> > *  TDE block key file
> > *  SQL-level file
> >
> > Primaries and standbys have to use the same TDE WAL key file, but can
> > use different TDE block key files to allow for key rotation, so having
> > separate files makes sense --- maybe they need to be in their own
> > directory.
> 
> I've considered to have separate key files once but it would make
> things complex to update multiple files atomically. Postgres server
> will never start if it crashes in the middle of cluster passphrase
> rotation. Can we consider to have keys related to TDE after we
> introduce the basic key management system? Probably having keys in a
> separate file rather than in pg_control file would be better but we
> don't need these keys so far.

Well, we need to be able to upgrade this so we have to set it up now in
a way that allows that.

I am not sure we have ever had a case where we needed to update multiple
files atomically at the same time, without the help of WAL.

Perhaps we should put the three keys in separate files in a directory
called 'cryptokeys', and when we change the pass phrase, we create a new
directory called 'cryptokeys.new'.  Then once we have created the files
in there with the new pass phrase, we remove cryptokeys and rename
directory cryptokeys.new to cryptokeys.  On boot, if cryptokeys exists
and cryptokeys.new does too, remove cryptokeys.new because we crashed
during key rotation,  If cryptokeys.new exists and cryptokeys doesn't,
we rename cryptokeys.new to cryptokeys because we crashed before the
rename.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Missing errcode() in ereport

2020-03-23 Thread Tom Lane

Andres Freund  writes:
> I wondered before whether there's a way we could move the elevel check
> in errstart to the macro. For it to be a win we'd presumably have to
> have a "synthesized" log_level variable, basically
> min(log_min_messages, client_min_messages, ERROR).
> Probably not worth it.

Yeah, I don't really agree with that idea.  The whole business of which
elevels will trigger output is fundamentally policy, not mechanism,
and you do not want policy decisions embedded into ABI so that there
are hundreds of copies of them.  It's a loss for code size and a worse
loss if you ever want to change the behavior.

> On 2020-03-23 17:24:49 -0400, Tom Lane wrote:
>> One thing I'm not totally sure about is whether we can rely on
>> this change:
>> 
>> -extern void errfinish(int dummy,...);
>> +extern void errfinish(void);
>> 
>> being fully ABI-transparent.  One would think that as long as errfinish
>> doesn't inspect its argument(s) it doesn't matter whether any are passed,
>> but maybe somewhere there's an architecture where the possible presence
>> of varargs arguments makes for a significant difference in the calling
>> convention?  We could leave that change out of the v12 patch if we're
>> worried about it.

> I would vote for leaving that out of the v12 patch. I'm less worried
> about odd architectures, and more about sanitizers and/or compilers
> emitting "security enhanced" code checking signatures match etc
> ("control flow integrity").

Yeah, good point.  Let's skinny the v12 patch down to just macro
and docs changes, then, not touching elog.c at all.

I made a commitfest entry for this just to see what the cfbot
thinks of it, but if there are not objections we should go ahead
and push in a day or so, IMO.

regards, tom lane

BTW, the CF app is showing me as the first author, even though
I definitely put you in first.  Guess it doesn't care about
author ordering ... is that a bug to be fixed?

Re: ALTER INDEX fails on partitioned index

2020-03-23 Thread Justin Pryzby

On Thu, Feb 27, 2020 at 09:11:14PM -0300, Alvaro Herrera wrote:
> On 2020-Feb-27, Justin Pryzby wrote:
> > The attached allows CREATE/ALTER to specify reloptions on a partitioned 
> > table
> > which are used as defaults for future children.
> > 
> > I think that's a desirable behavior, same as for tablespaces.  Michael
> > mentioned that ALTER INDEX ONLY doesn't exist, but that's only an issue if
> > ALTER acts recursively, which isn't the case here.
> 
> I think ALTER not acting recursively is a bug that we would do well not
> to propagate any further.  Enough effort we have to spend trying to fix
> that already.  Let's add ALTER .. ONLY if needed.

I was modeling after the behavior for tablespaces, and didn't realize that
non-recursive alter was considered discouraged. 

On Thu, Feb 27, 2020 at 05:25:13PM -0600, Justin Pryzby wrote:
> The first patch makes a prettier message, per Robert's suggestion.

Is there any interest in this one ?

> From e5bb363f514d768a4f540d9c82ad5745944b1486 Mon Sep 17 00:00:00 2001
> From: Justin Pryzby 
> Date: Mon, 30 Dec 2019 09:31:03 -0600
> Subject: [PATCH v2 1/2] More specific error message when failing to alter a
>  partitioned index..
> 
> "..is not an index" is deemed to be unfriendly
> 
> https://www.postgresql.org/message-id/CA%2BTgmobq8_-DS7qDEmMi-4ARP1_0bkgFEjYfiK97L2eXq%2BQ%2Bnw%40mail.gmail.com
> ---
>  src/backend/commands/tablecmds.c | 23 +--
>  1 file changed, 17 insertions(+), 6 deletions(-)
> 
> diff --git a/src/backend/commands/tablecmds.c 
> b/src/backend/commands/tablecmds.c
> index b7c8d66..1b271af 100644
> --- a/src/backend/commands/tablecmds.c
> +++ b/src/backend/commands/tablecmds.c
> @@ -366,7 +366,7 @@ static void ATRewriteTables(AlterTableStmt *parsetree,
>  static void ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE 
> lockmode);
>  static AlteredTableInfo *ATGetQueueEntry(List **wqueue, Relation rel);
>  static void ATSimplePermissions(Relation rel, int allowed_targets);
> -static void ATWrongRelkindError(Relation rel, int allowed_targets);
> +static void ATWrongRelkindError(Relation rel, int allowed_targets, int 
> actual_target);
>  static void ATSimpleRecursion(List **wqueue, Relation rel,
> AlterTableCmd *cmd, 
> bool recurse, LOCKMODE lockmode,
> 
> AlterTableUtilityContext *context);
> @@ -5458,7 +5458,7 @@ ATSimplePermissions(Relation rel, int allowed_targets)
>  
>   /* Wrong target type? */
>   if ((actual_target & allowed_targets) == 0)
> - ATWrongRelkindError(rel, allowed_targets);
> + ATWrongRelkindError(rel, allowed_targets, actual_target);
>  
>   /* Permissions checks */
>   if (!pg_class_ownercheck(RelationGetRelid(rel), GetUserId()))
> @@ -5479,7 +5479,7 @@ ATSimplePermissions(Relation rel, int allowed_targets)
>   * type.
>   */
>  static void
> -ATWrongRelkindError(Relation rel, int allowed_targets)
> +ATWrongRelkindError(Relation rel, int allowed_targets, int actual_target)
>  {
>   char   *msg;
>  
> @@ -5527,9 +5527,20 @@ ATWrongRelkindError(Relation rel, int allowed_targets)
>   break;
>   }
>  
> - ereport(ERROR,
> - (errcode(ERRCODE_WRONG_OBJECT_TYPE),
> -  errmsg(msg, RelationGetRelationName(rel;
> + if (actual_target == ATT_PARTITIONED_INDEX &&
> + (allowed_targets&ATT_INDEX) &&
> + !(allowed_targets&ATT_PARTITIONED_INDEX))
> + /* Add a special errhint for this case, since "is not an index" 
> message is unfriendly */
> + ereport(ERROR,
> + (errcode(ERRCODE_WRONG_OBJECT_TYPE),
> +  errmsg(msg, RelationGetRelationName(rel)),
> +  // errhint("\"%s\" is a partitioned index", 
> RelationGetRelationName(rel;
> +  errhint("operation is not supported on 
> partitioned indexes")));
> + else
> + ereport(ERROR,
> + (errcode(ERRCODE_WRONG_OBJECT_TYPE),
> +  errmsg(msg, RelationGetRelationName(rel;
> +
>  }
>  
>  /*
> -- 
> 2.7.4
> 

-- 
Justin

Re: Missing errcode() in ereport

2020-03-23 Thread Andres Freund

Hi,

On 2020-03-23 17:24:49 -0400, Tom Lane wrote:
> I wrote:
> > On balance I'm leaning towards keeping the parens as preferred style
> > for now, adjusting v12 so that the macro will allow paren omission
> > but we don't break ABI, and not touching the older branches.
> 
> Hearing no objections, I started to review Andres' patchset with
> that plan in mind.  I noted two things that I don't agree with:
> 
> 1. I think we should write the ereport macro as
> 
> if (errstart(...)) \
> __VA_ARGS__, errfinish(...); \
> 
> as I had it, not
> 
> if (errstart(...)) \
> { \
> __VA_ARGS__; \
> errfinish(...); \
> } \
> 
> as per Andres.  The reason is that I don't have as much faith in the
> latter construct producing warnings for no-op expressions.

Hm. I don't think it'll be better, but I don't have a problem going with
yours either.


> 2. We cannot postpone the passing of the "domain" argument as Andres'
> 0003 patch proposes, because that has to be available to the auxiliary
> error functions so they do message translation properly.

Ah, good point.


I wondered before whether there's a way we could move the elevel check
in errstart to the macro. For it to be a win we'd presumably have to
have a "synthesized" log_level variable, basically
min(log_min_messages, client_min_messages, ERROR).

Probably not worth it.


> Maybe it'd be possible to finagle things to postpone translation to
> the very end, but the provisions for errcontext messages being
> translated in a different domain would make that pretty ticklish.
> Frankly I don't think it'd be worth the complication.  There is a
> clear benefit in delaying the passing of filename (since we can skip
> that strchr() call) but beyond that it seems pretty marginal.

Fair enough.


> Other than that it looks pretty good.  I wrote some documentation
> adjustments and re-split the patch into 0001, which is proposed for
> back-patch into v12, and 0002 which would have to be HEAD only.

Cool.


> One thing I'm not totally sure about is whether we can rely on
> this change:
> 
> -extern void errfinish(int dummy,...);
> +extern void errfinish(void);
> 
> being fully ABI-transparent.  One would think that as long as errfinish
> doesn't inspect its argument(s) it doesn't matter whether any are passed,
> but maybe somewhere there's an architecture where the possible presence
> of varargs arguments makes for a significant difference in the calling
> convention?  We could leave that change out of the v12 patch if we're
> worried about it.

I would vote for leaving that out of the v12 patch. I'm less worried
about odd architectures, and more about sanitizers and/or compilers
emitting "security enhanced" code checking signatures match etc
("control flow integrity").

Greetings,

Andres Freund

Re: Missing errcode() in ereport

2020-03-23 Thread Tom Lane

I wrote:
> On balance I'm leaning towards keeping the parens as preferred style
> for now, adjusting v12 so that the macro will allow paren omission
> but we don't break ABI, and not touching the older branches.

Hearing no objections, I started to review Andres' patchset with
that plan in mind.  I noted two things that I don't agree with:

1. I think we should write the ereport macro as

if (errstart(...)) \
__VA_ARGS__, errfinish(...); \

as I had it, not

if (errstart(...)) \
{ \
__VA_ARGS__; \
errfinish(...); \
} \

as per Andres.  The reason is that I don't have as much faith in the
latter construct producing warnings for no-op expressions.

2. We cannot postpone the passing of the "domain" argument as Andres'
0003 patch proposes, because that has to be available to the auxiliary
error functions so they do message translation properly.  Maybe it'd
be possible to finagle things to postpone translation to the very end,
but the provisions for errcontext messages being translated in a different
domain would make that pretty ticklish.  Frankly I don't think it'd be
worth the complication.  There is a clear benefit in delaying the passing
of filename (since we can skip that strchr() call) but beyond that it
seems pretty marginal.

Other than that it looks pretty good.  I wrote some documentation
adjustments and re-split the patch into 0001, which is proposed for
back-patch into v12, and 0002 which would have to be HEAD only.

One thing I'm not totally sure about is whether we can rely on
this change:

-extern void errfinish(int dummy,...);
+extern void errfinish(void);

being fully ABI-transparent.  One would think that as long as errfinish
doesn't inspect its argument(s) it doesn't matter whether any are passed,
but maybe somewhere there's an architecture where the possible presence
of varargs arguments makes for a significant difference in the calling
convention?  We could leave that change out of the v12 patch if we're
worried about it.

regards, tom lane

diff --git a/doc/src/sgml/sources.sgml b/doc/src/sgml/sources.sgml
index 32ca220..283c3e0 100644
--- a/doc/src/sgml/sources.sgml
+++ b/doc/src/sgml/sources.sgml
@@ -103,9 +103,9 @@ less -x4
 message text.  In addition there are optional elements, the most
 common of which is an error identifier code that follows the SQL spec's
 SQLSTATE conventions.
-ereport itself is just a shell function, that exists
+ereport itself is just a shell macro, that exists
 mainly for the syntactic convenience of making message generation
-look like a function call in the C source code.  The only parameter
+look like a single function call in the C source code.  The only parameter
 accepted directly by ereport is the severity level.
 The primary message text and any optional message elements are
 generated by calling auxiliary functions, such as errmsg,
@@ -116,36 +116,50 @@ less -x4
 A typical call to ereport might look like this:

 ereport(ERROR,
-(errcode(ERRCODE_DIVISION_BY_ZERO),
- errmsg("division by zero")));
+errcode(ERRCODE_DIVISION_BY_ZERO),
+errmsg("division by zero"));

 This specifies error severity level ERROR (a run-of-the-mill
 error).  The errcode call specifies the SQLSTATE error code
 using a macro defined in src/include/utils/errcodes.h.  The
-errmsg call provides the primary message text.  Notice the
-extra set of parentheses surrounding the auxiliary function calls —
-these are annoying but syntactically necessary.
+errmsg call provides the primary message text.
+   
+
+   
+You will also frequently see this older style, with an extra set of
+parentheses surrounding the auxiliary function calls:
+
+ereport(ERROR,
+(errcode(ERRCODE_DIVISION_BY_ZERO),
+ errmsg("division by zero")));
+
+The extra parentheses were required
+before PostgreSQL version 12, but are now
+optional.

 Here is a more complex example:

 ereport(ERROR,
-(errcode(ERRCODE_AMBIGUOUS_FUNCTION),
- errmsg("function %s is not unique",
-func_signature_string(funcname, nargs,
-  NIL, actual_arg_types)),
- errhint("Unable to choose a best candidate function. "
- "You might need to add explicit typecasts.")));
+errcode(ERRCODE_AMBIGUOUS_FUNCTION),
+errmsg("function %s is not unique",
+   func_signature_string(funcname, nargs,
+ NIL, actual_arg_types)),
+errhint("Unable to choose a best candidate function. "
+"You might need to add explicit typecasts."));

 This illustrates the use of format codes to embed run-time values into
 a message text.  Also, an optional hint message is provided.
+The auxiliary function calls can be written in any order, but
+

Re: Additional size of hash table is alway zero for hash aggregates

2020-03-23 Thread Andres Freund

Hi,

On 2020-03-23 13:29:02 -0700, Jeff Davis wrote:
> On Sat, 2020-03-21 at 18:26 -0700, Andres Freund wrote:
> > I don't see how? That'd require making the hash bucket addressing
> > deal
> > with variable sizes, which'd be bad for performance reasons. Since
> > there
> > can be a aggstate->numtrans AggStatePerGroupDatas for each hash table
> > entry, I don't see how to avoid a variable size?
> 
> It would not vary for a given hash table. Do you mean the compile-time
> specialization (of simplehash.h) would not work any more?

Yes.


> If we aren't storing the "additional" inline in the hash entry, I don't
> see any purpose for the argument to BuildTupleHashTableExt(), nor the
> purpose of the "entrysize" field of TupleHashTableData.

Yea, that was my conclusion too. It looked like Andrew was going to
commit a fix, but that hasn't happened yet.

Greetings,

Andres Freund

Corruption during WAL replay

2020-03-23 Thread Teja Mupparti

This is my *first* attempt to submit a Postgres patch, please let me know if I 
missed any process or format of the patch (I used this link 
https://wiki.postgresql.org/wiki/Working_with_Git
 As reference)



The original bug reporting-email and the relevant discussion is here



https://www.postgresql.org/message-id/20191207001232.klidxnm756wqxvwx%40alap3.anarazel.de

https://www.postgresql.org/message-id/822113470.250068.1573246011818%40connect.xfinity.com

https://www.postgresql.org/message-id/20191206230640.2dvdjpcgn46q3ks2%40alap3.anarazel.de

https://www.postgresql.org/message-id/1880.1281020...@sss.pgh.pa.us



The crux of the fix is, in the current code, engine drops the buffer and then 
truncates the file, but a crash before the truncate and after the buffer-drop 
is causing the corruption. Patch reverses the order i.e. truncate the file and 
drop the buffer later.



Warm regards,

Teja



bug-fix-patch
Description: bug-fix-patch

Re: weird hash plan cost, starting with pg10

2020-03-23 Thread Thomas Munro

On Tue, Mar 24, 2020 at 6:01 AM Tom Lane  wrote:
> Alvaro Herrera  writes:
> > While messing with EXPLAIN on a query emitted by pg_dump, I noticed that
> > current Postgres 10 emits weird bucket/batch/memory values for certain
> > hash nodes:
>
> >  ->  Hash  (cost=0.11..0.11 rows=10 width=12) 
> > (actual time=0.002..0.002 rows=1 loops=8)
> >Buckets: 2139062143  Batches: 2139062143  
> > Memory Usage: 8971876904722400kB
> >->  Function Scan on unnest init_1  
> > (cost=0.01..0.11 rows=10 width=12) (actual time=0.001..0.001 rows=1 loops=8)
>
> Looks suspiciously like uninitialized memory ...

I think "hashtable" might have been pfree'd before
ExecHashGetInstrumentation() ran, because those numbers look like
CLOBBER_FREED_MEMORY's pattern:

>>> hex(2139062143)
'0x7f7f7f7f'
>>> hex(8971876904722400 / 1024)
'0x7f7f7f7f7f7'

Maybe there is something wrong with the shutdown order of nested subplans.

Re: Additional size of hash table is alway zero for hash aggregates

2020-03-23 Thread Jeff Davis

On Sat, 2020-03-21 at 18:26 -0700, Andres Freund wrote:
> I don't see how? That'd require making the hash bucket addressing
> deal
> with variable sizes, which'd be bad for performance reasons. Since
> there
> can be a aggstate->numtrans AggStatePerGroupDatas for each hash table
> entry, I don't see how to avoid a variable size?

It would not vary for a given hash table. Do you mean the compile-time
specialization (of simplehash.h) would not work any more?

If we aren't storing the "additional" inline in the hash entry, I don't
see any purpose for the argument to BuildTupleHashTableExt(), nor the
purpose of the "entrysize" field of TupleHashTableData.

> If we want to optimize memory usage, I think I'd first go for
> allocating
> the group's "firstTuple" together with all the AggStatePerGroupDatas.

That's a good idea.

Regards,
Jeff Davis

Re: Make mesage at end-of-recovery less scary.

2020-03-23 Thread Andres Freund

Hi,

On 2020-03-23 10:43:09 -0700, Ashwin Agrawal wrote:
> Plus, I am hoping message will improve for pg_waldump as well?
> Since it reads confusing and every-time have to explain new developer it's
> expected behavior which is annoying.
> 
> pg_waldump: fatal: error in WAL record at 0/1553F70: invalid record length
> at 0/1553FA8: wanted 24, got 0

What would you like to see here? There's inherently a lot less
information about the context in waldump. We can't know whether it's to
be expected that the WAL ends at that point, or whether there was
corruption.

Greetings,

Andres Freund

Re: Make mesage at end-of-recovery less scary.

2020-03-23 Thread Andres Freund

Hi,

On 2020-03-23 10:37:16 +0100, Peter Eisentraut wrote:
> On 2020-03-05 08:06, Kyotaro Horiguchi wrote:
> > | [20866] LOG:  replication terminated by primary server
> > | [20866] DETAIL:  End of WAL reached on timeline 1 at 0/30001C8.
> > | [20866] FATAL:  could not send end-of-streaming message to primary: no 
> > COPY in progress

IMO it's a bug that we see this FATAL. I seem to recall that we didn't
use to get that?

> > | [20851] LOG:  reached end of WAL at 0/30001C8 on timeline 1 in archive 
> > during standby mode
> > | [20851] DETAIL:  invalid record length at 0/30001C8: wanted 24, got 0
> > 
> > I changed the above to the below, which looks more adequate.
> > 
> > | [24271]  LOG:  replication terminated by primary server on timeline 1 at 
> > 0/3000240.
> > | [24271]  FATAL:  could not send end-of-streaming message to primary: no 
> > COPY in progress
> > | [24267]  LOG:  reached end of WAL at 0/3000240 on timeline 1 in archive 
> > during standby mode
> > | [24267]  DETAIL:  invalid record length at 0/3000240: wanted 24, got 0
> 
> Is this the before and after?  That doesn't seem like a substantial
> improvement to me.  You still get the "scary" message at the end.

It seems like a minor improvement - folding the DETAIL into the message
makes sense to me here. But it indeed doesn't really address the main
issue.

I think we don't want to elide the information about how the end of the
WAL was detected - there are some issues where I found that quite
helpful. But we could reformulate it to be clearer that it's informative
output, not a bug.  E.g. something roughly like

LOG:  reached end of WAL at 0/3000240 on timeline 1 in archive during standby 
mode
DETAIL: End detected due to invalid record length at 0/3000240: expected 24, 
got 0
(I first elided the position in the DETAIL, but it could differ from the
one in LOG)

I don't find that very satisfying, but I can't come up with something
that provides the current information, while being less scary than my
suggestion?

Greetings,

Andres Freund

Re: Assert() failures during RI checks

2020-03-23 Thread Andres Freund

Hi,

On 2020-03-23 13:54:31 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2020-03-22 18:30:04 -0400, Tom Lane wrote:
> >> Yeah, I was wondering about giving that a new result code, too.
> >> It would be a little bit invasive and not at all back-patchable,
> >> but (say) TM_SerializationViolation seems like a cleaner output ---
> >> and we could define it to not return any TID info, as TM_Delete
> >> doesn't.  But we also need a fix that *can* be back-patched.
> 
> > I wonder if returning TM_Invisible in the backbranches would be the
> > right answer. "The affected tuple wasn't visible to the relevant
> > snapshot" is not the worst description for a crosscheck snapshot
> > violation.
> 
> I'm not for that.  Abusing an existing result code is what got us
> into this mess in the first place; abusing a different result code
> than what we did in prior minor releases can only make things worse
> not better.

Well, I think TM_Invisible is much less error prone - callers wouldn't
try to chase ctid chains or such. There's really not much they can do
except to error out. It also seems semantically be a better match than
TM_Updated.

> Of course, if there's no external callers of these
> functions then we could get away with changing it ... but I'm not
> prepared to assume that, are you?)

I'm certain we can't assume that.

> I think the right thing in the back branches is to keep on returning
> the "Updated" result code, but adjust the crosscheck code path so that
> the TID that's sent back is always the tuple's own TID.

If we do that, and set xmax to InvalidTransactionId, I think we'd likely
prevent any "wrong" chaining from outside heapam, since that already
needed to check xmax to be correct.

> >> The stanza right after the crosscheck one has always assumed that
> >> it is dealing with an outdated-by-update tuple, so that chasing up
> >> to the next TID is a sane thing to do.> 
> > Do you just mean the
> > tmfd->ctid = tp.t_data->t_ctid;
> > and
> > tmfd->xmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
> > ? I would not have described that as chasing, which would explain my
> > difficulty following.
> 
> The bit about returning t_ctid rather than the tuple's own TID seems
> like chasing a next-tid link to me.  The Assert about xmax being valid
> is certainly intended to verify that that's what is happening.  I can't
> speak to what the code thinks it's returning in tmfd->xmax, because
> that functionality wasn't there the last time I studied this.

We could probably get rid of tmfd->xmax. It kind of was introduced as
part of

commit 6868ed7491b7ea7f0af6133bb66566a2f5fe5a75
Author: Kevin Grittner 
Date:   2012-10-26 14:55:36 -0500

Throw error if expiring tuple is again updated or deleted.

It did previously exist as an explicit heap_delete/heap_update
parameter, however - and has for a long time:

commit f57e3f4cf36f3fdd89cae8d566479ad747809b2f
Author: Tom Lane 
Date:   2005-08-20 00:40:32 +

Repair problems with VACUUM destroying t_ctid chains too soon, and with
insufficient paranoia in code that follows t_ctid links.  (We must do both

Since Kevin's commit the relevant logic has since been pushed down into
the AM layer as part of tableam. While still used in one place
internally inside heap AM (basically where the check from Kevin's commit
commit has been moved), but that looks trivial to solve differently.

> > I am wondering if the correct HEAD answer would be to somehow lift the
> > crosscheck logic out of heapam.c. ISTM that heapam.c is kind of too low
> > level for that type of concern. But it seems hard to do that without
> > performing unnecessary updates that immediately are thrown away by
> > throwing an error at a higher level.
> 
> Yeah, I'd be the first to agree that the crosscheck stuff is messy.
> But it's hard to see how to do better.

Obviously not a short-term solution at all: I wonder if some of this
stems from implementing RI triggers partially in SQL, even though we
require semantics that aren't efficiently doable in SQL. With the
consequences of having RI specific code in various parts of the executor
and AMs, and of needing a lot more memory etc.

Greetings,

Andres Freund

Re: Collation versions on Windows (help wanted, apply within)

2020-03-23 Thread Juan José Santamaría Flecha

On Mon, Mar 23, 2020 at 5:59 AM Thomas Munro  wrote:

>
> Done in this new 0002 patch (untested).  0001 removes the comment that
> individual collations can't have a NULL version, reports NULL for
> Linux/glibc collations like C.UTF-8 by stripping the suffix and
> comparing with C and POSIX as suggested by Peter E.
>

 It applies and passes tests without a problem in Windows, and works as
expected.

Regards

Re: Making psql error out on output failures

2020-03-23 Thread Daniel Verite

Peter Eisentraut wrote:

> > If there's no more review to do, would you consider moving it to
> > Ready for Committer?
> 
> committed

Thanks!

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: RFC: Add 'taint' field to pg_control.

2020-03-23 Thread Justin Pryzby

On Wed, Feb 28, 2018 at 04:16:53PM -0600, Justin Pryzby wrote:
> On Wed, Feb 28, 2018 at 01:43:11PM -0800, Andres Freund wrote:
> > a significant number of times during investigations of bugs I wondered
> > whether running the cluster with various settings, or various tools
> > could've caused the issue at hand.  Therefore I'd like to propose adding
> > a 'tainted' field to pg_control, that contains some of the "history" of
> > the cluster. Individual bits inside that field that I can think of right
> > now are:
> > - pg_resetxlog was used non-passively on cluster
> > - ran with fsync=off
> > - ran with full_page_writes=off
> > - pg_upgrade was used

+ cluster was downgraded (between minor releases)

Dregging up a years old thread since I thought that was worth adding..

-- 
Justin

Re: Assert() failures during RI checks

2020-03-23 Thread Tom Lane

Andres Freund  writes:
> On 2020-03-22 18:30:04 -0400, Tom Lane wrote:
>> Yeah, I was wondering about giving that a new result code, too.
>> It would be a little bit invasive and not at all back-patchable,
>> but (say) TM_SerializationViolation seems like a cleaner output ---
>> and we could define it to not return any TID info, as TM_Delete
>> doesn't.  But we also need a fix that *can* be back-patched.

> I wonder if returning TM_Invisible in the backbranches would be the
> right answer. "The affected tuple wasn't visible to the relevant
> snapshot" is not the worst description for a crosscheck snapshot
> violation.

I'm not for that.  Abusing an existing result code is what got us
into this mess in the first place; abusing a different result code
than what we did in prior minor releases can only make things worse
not better.  (Of course, if there's no external callers of these
functions then we could get away with changing it ... but I'm not
prepared to assume that, are you?)

Also it'd mean something quite different in the direct output of
HeapTupleSatisfiesUpdate than what it would mean one level up,
which is certain to cause confusion.

I think the right thing in the back branches is to keep on returning
the "Updated" result code, but adjust the crosscheck code path so that
the TID that's sent back is always the tuple's own TID.

>> The stanza right after the crosscheck one has always assumed that
>> it is dealing with an outdated-by-update tuple, so that chasing up
>> to the next TID is a sane thing to do.

> Do you just mean the
>   tmfd->ctid = tp.t_data->t_ctid;
> and
>   tmfd->xmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
> ? I would not have described that as chasing, which would explain my
> difficulty following.

The bit about returning t_ctid rather than the tuple's own TID seems
like chasing a next-tid link to me.  The Assert about xmax being valid
is certainly intended to verify that that's what is happening.  I can't
speak to what the code thinks it's returning in tmfd->xmax, because
that functionality wasn't there the last time I studied this.

> I am wondering if the correct HEAD answer would be to somehow lift the
> crosscheck logic out of heapam.c. ISTM that heapam.c is kind of too low
> level for that type of concern. But it seems hard to do that without
> performing unnecessary updates that immediately are thrown away by
> throwing an error at a higher level.

Yeah, I'd be the first to agree that the crosscheck stuff is messy.
But it's hard to see how to do better.

regards, tom lane

Re: Make mesage at end-of-recovery less scary.

2020-03-23 Thread Ashwin Agrawal

On Mon, Mar 23, 2020 at 2:37 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2020-03-05 08:06, Kyotaro Horiguchi wrote:
> > | [20866] LOG:  replication terminated by primary server
> > | [20866] DETAIL:  End of WAL reached on timeline 1 at 0/30001C8.
> > | [20866] FATAL:  could not send end-of-streaming message to primary: no
> COPY in progress
> > | [20851] LOG:  reached end of WAL at 0/30001C8 on timeline 1 in archive
> during standby mode
> > | [20851] DETAIL:  invalid record length at 0/30001C8: wanted 24, got 0
> >
> > I changed the above to the below, which looks more adequate.
> >
> > | [24271]  LOG:  replication terminated by primary server on timeline 1
> at 0/3000240.
> > | [24271]  FATAL:  could not send end-of-streaming message to primary:
> no COPY in progress
> > | [24267]  LOG:  reached end of WAL at 0/3000240 on timeline 1 in
> archive during standby mode
> > | [24267]  DETAIL:  invalid record length at 0/3000240: wanted 24, got 0
>
> Is this the before and after?  That doesn't seem like a substantial
> improvement to me.  You still get the "scary" message at the end.
>

+1 I agree it still reads scary and doesn't seem improvement.

Plus, I am hoping message will improve for pg_waldump as well?
Since it reads confusing and every-time have to explain new developer it's
expected behavior which is annoying.

pg_waldump: fatal: error in WAL record at 0/1553F70: invalid record length
at 0/1553FA8: wanted 24, got 0

Re: SQL/JSON: functions

2020-03-23 Thread Nikita Glukhov


Attached 47th version of the patches.

On 21.03.2020 22:38, Pavel Stehule wrote:


On 21. 3. 2020 v 11:07 Nikita Glukhov > wrote:


Attached 46th version of the patches.

On 20.03.2020 22:34, Pavel Stehule wrote:


On 19.03.2020 23:57 Nikita Glukhov mailto:n.glu...@postgrespro.ru>> wrote:

Attached 45th version of the patches.

Nodes JsonFormat, JsonReturning, JsonPassing, JsonBehavior were fixed.

On 17.03.2020 21:35, Pavel Stehule wrote:


User functions json[b]_build_object_ext() and 
json[b]_build_array_ext() also
can be easily removed.   But it seems harder to remove new 
aggregate functions
json[b]_objectagg() and json[b]_agg_strict(), because they can't be 
called
directly from JsonCtorExpr node.


I don't see reasons for another reduction now. Can be great
if you can finalize work what you plan for pg13.


I have removed json[b]_build_object_ext() and json[b]_build_array_ext().

But json[b]_objectagg() and json[b]_agg_strict() are still present.
It seems that removing them requires majors refactoring of the execution
of Aggref and WindowFunc nodes.


I have replaced aggregate function

json[b]_objectagg(key any, val any, absent_on_null boolean, unique_keys boolean)

with three separate functions:

json[b]_object_agg_strict(any, any)
json[b]_object_agg_unique(any, any)
json[b]_object_agg_unique_strict(any, any)


This should be more correct than single aggregate with additional parameters.


--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-Add-common-SQL_JSON-clauses-v47.patch.gz
Description: application/gzip


0002-SQLJSON-constructors-v47.patch.gz
Description: application/gzip


0003-IS-JSON-predicate-v47.patch.gz
Description: application/gzip


0004-SQLJSON-query-functions-v47.patch.gz
Description: application/gzip

Re: Assert() failures during RI checks

2020-03-23 Thread Andres Freund

Hi,

On 2020-03-22 18:30:04 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I wonder if we shouldn't just change the crosscheck case to set
> > something other than TM_Updated, as it's not really accurate to say the
> > tuple was updated.
>
> Yeah, I was wondering about giving that a new result code, too.
> It would be a little bit invasive and not at all back-patchable,
> but (say) TM_SerializationViolation seems like a cleaner output ---
> and we could define it to not return any TID info, as TM_Delete
> doesn't.  But we also need a fix that *can* be back-patched.

I wonder if returning TM_Invisible in the backbranches would be the
right answer. "The affected tuple wasn't visible to the relevant
snapshot" is not the worst description for a crosscheck snapshot
violation.

We would have to start issuing a better error message for it, as it's
unexpected in most places right now. It'd be kind of odd for
heap_delete/update() to error out with "attempted to delete invisible
tuple" but still return it in some cases, though.


> >> 2. Does the following stanza actually need to allow for the case
> >> of a deleted tuple as well as an updated one?  If so, blindly
> >> trying to follow the ctid link is not so cool.
>
> > Which ctid link following are you referring to? I'm not sure I quite
> > follow.
>
> The stanza right after the crosscheck one has always assumed that
> it is dealing with an outdated-by-update tuple, so that chasing up
> to the next TID is a sane thing to do.

Do you just mean the
tmfd->ctid = tp.t_data->t_ctid;
and
tmfd->xmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
? I would not have described that as chasing, which would explain my
difficulty following.


> Maybe that's been wrong
> since day one, or maybe we made it wrong somewhere along the line,
> but it seems clearly wrong now (even without accounting for the
> serialization crosscheck case).  I think it just accidentally
> doesn't cause problems in non-assert builds, as long as nobody
> is assuming too much about which TID or XID gets returned.

It'd probably be worthwhile to rejigger those branches to something
roughly like:

if (result != TM_Ok)
{
switch (result)
{
case TM_SelfModified:
Assert(!ItemPointerEquals(&tp.t_self, 
&tp.t_data->t_ctid));
tmfd->ctid = tp.t_data->t_ctid;
tmfd->xmax = 
HeapTupleHeaderGetUpdateXid(tp.t_data);
tmfd->cmax = HeapTupleHeaderGetCmax(tp.t_data);
break;
case TM_Updated:
Assert(!ItemPointerEquals(&tp.t_self, 
&tp.t_data->t_ctid));
tmfd->ctid = tp.t_data->t_ctid;
tmfd->xmax = 
HeapTupleHeaderGetUpdateXid(tp.t_data);
tmfd->cmax = InvalidCommandId;
break;
case TM_Deleted:
tmfd->ctid = tp.t_self;
tmfd->xmax = 
HeapTupleHeaderGetUpdateXid(tp.t_data);
tmfd->cmax = InvalidCommandId;
break;
case TM_BeingModified:
Assert(!ItemPointerEquals(&tp.t_self, 
&tp.t_data->t_ctid));
tmfd->ctid = tp.t_data->t_ctid;
tmfd->xmax = 
HeapTupleHeaderGetUpdateXid(tp.t_data);
tmfd->cmax = InvalidCommandId;
break;
default:
elog(ERROR, "unexpected visibility %d", result);
}

UnlockReleaseBuffer(buffer);
if (have_tuple_lock)
UnlockTupleTuplock(relation, &(tp.t_self), 
LockTupleExclusive);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
return result;
}


> > But for the crosscheck case (most of?)
> > those wouldn't ever reach the ctid equivalency check, because of
> > +if (IsolationUsesXactSnapshot())
> > +ereport(ERROR,
> > +
> > (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
> > + errmsg("could not serialize access due to 
> > concurrent update")));
> > like checks beforehand.
>
> Yeah.  For HEAD I'm imagining that callers would just throw
> ERRCODE_T_R_SERIALIZATION_FAILURE unconditionally if they
> get back TM_SerializationViolation.

I am wondering if the correct HEAD answer would be to somehow lift the
crosscheck logic out of heapam.c. ISTM that heapam.c is kind of too low
level for that type of concern. But it seems hard to do that without
performing unnec

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-23 Thread James Coleman

On Mon, Mar 23, 2020 at 1:05 PM Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > ... all plan types that use only one child use the outer one.  They
> > could use either, as long as it does that consistently, I think.
>
> Yeah, exactly.  The outer/inner terminology is really only sensible
> for join nodes, but there isn't a third child-plan pointer reserved
> for single-child node types, so you gotta use one of those.  And
> conventionally we use the "outer" one.
>
> regards, tom lane

Great, thanks for the explanation Alvaro and Tom; I'll fix that up in
my next patch series.

I idly wonder if a macro childPlanState() defined exactly the same as
outerPlanState() might _kinda_ make sense here, but I'm also content
to follow convention.

I might submit a small patch to the comment on those macros though to
expand on the explanation.

James

RE: ASLR support for Postgres12

2020-03-23 Thread Joel Mariadasan (jomariad)

Thanks Tom for the quick Update. 

Can you please point me to a link or give the list of OSes where ASLR is 
officially supported by Postgres?

Regards,
Joel

-Original Message-
From: Tom Lane  
Sent: Monday, March 23, 2020 8:16 PM
To: Joel Mariadasan (jomariad) 
Cc: pgsql-hack...@postgresql.org
Subject: Re: ASLR support for Postgres12

"Joel Mariadasan (jomariad)"  writes:
> We would like to know if there is a roadmap to enable ASLR support for 
> postgre.

Not on Windows --- since that OS doesn't support fork(), it's too difficult to 
get different child processes to map shared memory at the same address if ASLR 
is active.

If that feature is important to you, use a different operating system.

regards, tom lane

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-23 Thread Tom Lane

Alvaro Herrera  writes:
> ... all plan types that use only one child use the outer one.  They
> could use either, as long as it does that consistently, I think.

Yeah, exactly.  The outer/inner terminology is really only sensible
for join nodes, but there isn't a third child-plan pointer reserved
for single-child node types, so you gotta use one of those.  And
conventionally we use the "outer" one.

regards, tom lane

Re: weird hash plan cost, starting with pg10

2020-03-23 Thread Tom Lane

Alvaro Herrera  writes:
> While messing with EXPLAIN on a query emitted by pg_dump, I noticed that
> current Postgres 10 emits weird bucket/batch/memory values for certain
> hash nodes:

>  ->  Hash  (cost=0.11..0.11 rows=10 width=12) (actual 
> time=0.002..0.002 rows=1 loops=8)
>Buckets: 2139062143  Batches: 2139062143  
> Memory Usage: 8971876904722400kB
>->  Function Scan on unnest init_1  
> (cost=0.01..0.11 rows=10 width=12) (actual time=0.001..0.001 rows=1 loops=8)

Looks suspiciously like uninitialized memory ...

> The complete query is:

Reproduces here, though oddly only a couple of the several hash subplans
are doing that.

I'm not planning to dig into it right this second either.

regards, tom lane

Re: Unqualified pg_catalog casts in pg_dump

2020-03-23 Thread Daniel Gustafsson

> On 23 Mar 2020, at 17:54, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>> When looking at something different, I happened to notice that pg_dump is a 
>> bit
>> inconsistent in how it qualifies casts to pg_catalog entities like regclass 
>> and
>> oid.  Most casts are qualified, but not all.  Even though it functionally is
>> the same, being consistent is a good thing IMO and I can't see a reason not 
>> to,
>> so the attached patch adds qualifications (the unqualified regclass cast in 
>> the
>> TAP test left on purpose).
> 
> While this used to be important before we made pg_dump force a minimal
> search_path, I'm not sure that there's any point in being picky about
> it anymore.  (psql's describe.c is a different story though.)

Correct, there is no functional importance with this.  IMO the value is in
readability and grep-ability.

cheers ./daniel

Re: Unqualified pg_catalog casts in pg_dump

2020-03-23 Thread Tom Lane

Daniel Gustafsson  writes:
> When looking at something different, I happened to notice that pg_dump is a 
> bit
> inconsistent in how it qualifies casts to pg_catalog entities like regclass 
> and
> oid.  Most casts are qualified, but not all.  Even though it functionally is
> the same, being consistent is a good thing IMO and I can't see a reason not 
> to,
> so the attached patch adds qualifications (the unqualified regclass cast in 
> the
> TAP test left on purpose).

While this used to be important before we made pg_dump force a minimal
search_path, I'm not sure that there's any point in being picky about
it anymore.  (psql's describe.c is a different story though.)

regards, tom lane

weird hash plan cost, starting with pg10

2020-03-23 Thread Alvaro Herrera

Hello

While messing with EXPLAIN on a query emitted by pg_dump, I noticed that
current Postgres 10 emits weird bucket/batch/memory values for certain
hash nodes:

 ->  Hash  (cost=0.11..0.11 rows=10 width=12) (actual 
time=0.002..0.002 rows=1 loops=8)
   Buckets: 2139062143  Batches: 2139062143  Memory 
Usage: 8971876904722400kB
   ->  Function Scan on unnest init_1  
(cost=0.01..0.11 rows=10 width=12) (actual time=0.001..0.001 rows=1 loops=8)

It shows normal values in 9.6.

The complete query is:

SELECT c.tableoid, c.oid, c.relname, (SELECT pg_catalog.array_agg(acl ORDER BY 
row_n) FROM (SELECT acl, row_n FROM 
pg_catalog.unnest(coalesce(c.relacl,pg_catalog.acldefault(CASE WHEN c.relkind = 
'S' THEN 's' ELSE 'r' END::"char",c.relowner))) WITH ORDINALITY AS 
perm(acl,row_n) WHERE NOT EXISTS ( SELECT 1 FROM 
pg_catalog.unnest(coalesce(pip.initprivs,pg_catalog.acldefault(CASE WHEN 
c.relkind = 'S' THEN 's' ELSE 'r' END::"char",c.relowner))) AS init(init_acl) 
WHERE acl = init_acl)) as foo) AS relacl, (SELECT pg_catalog.array_agg(acl 
ORDER BY row_n) FROM (SELECT acl, row_n FROM 
pg_catalog.unnest(coalesce(pip.initprivs,pg_catalog.acldefault(CASE WHEN 
c.relkind = 'S' THEN 's' ELSE 'r' END::"char",c.relowner))) WITH ORDINALITY AS 
initp(acl,row_n) WHERE NOT EXISTS ( SELECT 1 FROM 
pg_catalog.unnest(coalesce(c.relacl,pg_catalog.acldefault(CASE WHEN c.relkind = 
'S' THEN 's' ELSE 'r' END::"char",c.relowner))) AS permp(orig_acl) WHERE acl = 
orig_acl)) as foo) as rrelacl, NULL AS initrelacl, NULL as initrrelacl, 
c.relkind, c.relnamespace, (SELECT rolname FROM pg_catalog.pg_roles WHERE oid = 
c.relowner) AS rolname, c.relchecks, c.relhastriggers, c.relhasindex, 
c.relhasrules, 'f'::bool AS relhasoids, c.relrowsecurity, 
c.relforcerowsecurity, c.relfrozenxid, c.relminmxid, tc.oid AS toid, 
tc.relfrozenxid AS tfrozenxid, tc.relminmxid AS tminmxid, c.relpersistence, 
c.relispopulated, c.relreplident, c.relpages, am.amname, CASE WHEN c.reloftype 
<> 0 THEN c.reloftype::pg_catalog.regtype ELSE NULL END AS reloftype, 
d.refobjid AS owning_tab, d.refobjsubid AS owning_col, (SELECT spcname FROM 
pg_tablespace t WHERE t.oid = c.reltablespace) AS reltablespace, 
array_remove(array_remove(c.reloptions,'check_option=local'),'check_option=cascaded')
 AS reloptions, CASE WHEN 'check_option=local' = ANY (c.reloptions) THEN 
'LOCAL'::text WHEN 'check_option=cascaded' = ANY (c.reloptions) THEN 
'CASCADED'::text ELSE NULL END AS checkoption, tc.reloptions AS 
toast_reloptions, c.relkind = 'S' AND EXISTS (SELECT 1 FROM pg_depend WHERE 
classid = 'pg_class'::regclass AND objid = c.oid AND objsubid = 0 AND 
refclassid = 'pg_class'::regclass AND deptype = 'i') AS is_identity_sequence, 
EXISTS (SELECT 1 FROM pg_attribute at LEFT JOIN pg_init_privs pip ON (c.oid = 
pip.objoid AND pip.classoid = 'pg_class'::regclass AND pip.objsubid = 
at.attnum)WHERE at.attrelid = c.oid AND ((SELECT pg_catalog.array_agg(acl ORDER 
BY row_n) FROM (SELECT acl, row_n FROM 
pg_catalog.unnest(coalesce(at.attacl,pg_catalog.acldefault('c',c.relowner))) 
WITH ORDINALITY AS perm(acl,row_n) WHERE NOT EXISTS ( SELECT 1 FROM 
pg_catalog.unnest(coalesce(pip.initprivs,pg_catalog.acldefault('c',c.relowner)))
 AS init(init_acl) WHERE acl = init_acl)) as foo) IS NOT NULL OR (SELECT 
pg_catalog.array_agg(acl ORDER BY row_n) FROM (SELECT acl, row_n FROM 
pg_catalog.unnest(coalesce(pip.initprivs,pg_catalog.acldefault('c',c.relowner)))
 WITH ORDINALITY AS initp(acl,row_n) WHERE NOT EXISTS ( SELECT 1 FROM 
pg_catalog.unnest(coalesce(at.attacl,pg_catalog.acldefault('c',c.relowner))) AS 
permp(orig_acl) WHERE acl = orig_acl)) as foo) IS NOT NULL OR NULL IS NOT NULL 
OR NULL IS NOT NULL))AS changed_acl, pg_get_partkeydef(c.oid) AS partkeydef, 
c.relispartition AS ispartition, pg_get_expr(c.relpartbound, c.oid) AS 
partbound FROM pg_class c LEFT JOIN pg_depend d ON (c.relkind = 'S' AND 
d.classid = c.tableoid AND d.objid = c.oid AND d.objsubid = 0 AND d.refclassid 
= c.tableoid AND d.deptype IN ('a', 'i')) LEFT JOIN pg_class tc ON 
(c.reltoastrelid = tc.oid AND c.relkind <> 'p') LEFT JOIN pg_am am ON (c.relam 
= am.oid) LEFT JOIN pg_init_privs pip ON (c.oid = pip.objoid AND pip.classoid = 
'pg_class'::regclass AND pip.objsubid = 0) WHERE c.relkind in ('r', 'S', 'v', 
'c', 'm', 'f', 'p') ORDER BY c.oid

I'm not looking into this right now.  If somebody is bored in
quarantine, they might have a good time bisecting this.

-- 
Álvaro Herrera

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-23 Thread Alvaro Herrera

On 2020-Mar-22, James Coleman wrote:

> One question I have while I work on that: I've noticed some confusion
> in the patch as to whether we should refer to the node below the
> incremental sort node in the plan tree (i.e., the node we get tuples
> from) as the inner node or the outer node. Intuitively I'd expect to
> call it the inner node, but the original patch referred to it
> frequently as the outer node. The outerPlanState/innerPlanState macro
> comments don't offer a lot of clarification though they're "to avoid
> confusion" about right/left inner/outer. I suppose if the
> outerPlanState macro is working here the correct term should be outer?

I think the inner/outer distinction comes from join nodes wanting to
distinguish which child drives the scan of the other.  If there's a
single child, there's no need to make such a distinction: it's just "the
child".  And if it's the only child, conventionally we use the first
one, which conventionally is (for us westerners) the one on the left.
This view is supported by the fact that outerPlanState() appears 113
times in the code whereas innerPlanState() appears only 27 times --
that is, all plan types that use only one child use the outer one.  They
could use either, as long as it does that consistently, I think.

Therefore the term should be "outer".  It's not "outer" to the parent
incremental sort; it's just the "outer" of its two possible children.

I think.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Unicode normalization SQL functions

2020-03-23 Thread Daniel Verite

Peter Eisentraut wrote:

> What is that status of this patch set?  I think we have nailed down the 
> behavior, but there were some concerns about certain performance 
> characteristics.  Do people feel that those are required to be addressed 
> in this cycle?

Not finding any other issue with v3 or objections in the thread,
I've set the status to Ready For Committer in the CF.

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: SQL/JSON: JSON_TABLE

2020-03-23 Thread Pavel Stehule

Hi

This patch needs rebase

Regards

Pavel

Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-03-23 Thread Alexander Korotkov

On Mon, Mar 23, 2020 at 6:16 PM Alvaro Herrera  wrote:
> On 2020-Mar-23, Alexander Korotkov wrote:
> > On Fri, Mar 13, 2020 at 3:18 AM Michael Paquier  wrote:
> > > No issues with that either.  Are you fine with the updated version
> > > attached for 0001?
> >
> > I think patchset is almost ready for commit.  Michael is waiting for
> > your answer on whether you're fine with current shape of 0001.  Could
> > you please provide a feedback?
>
> Hi, I didn't realize that this was waiting on me.  It looks good to me
> -- I'd just remove the comment on the function prototype in archiver.h,
> which is not customary (we only put these comments in the .c file).
> Let's get it pushed.
>
> Thanks for everybody's work on this.

Great, thank you!

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

some AppVeyor files

2020-03-23 Thread Peter Eisentraut

Many people enjoy the Windows testing the cfbot runs on the AppVeyor 
service.


You can also run this yourself without the detour through the commit 
fest app.  Attached are three patches that add .appveyor.yml files, for 
MSVC, MinGW, and Cygwin respectively.  (An open problem is to combine 
them all into one.)  I have been using these regularly over the last few 
months to test code on these Windows variants.


To use them, first you need to set up your AppVeyor account and link it 
to a github (or gitlab or ...) repository.  Then git am the patch on top 
of some branch, push to github (or ...) and watch it build.


This is just for individual enjoyment; I don't mean to commit them at 
this time.


(Some of this has been cribbed from the cfbot work and many other places 
all over the web.)


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 732e7ac33eb1bf5a3d42d20307747089d6f54fa5 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 10 Mar 2020 19:22:22 +0100
Subject: [PATCH] AppVeyor configuration for Cygwin

---
 .appveyor.yml | 32 
 1 file changed, 32 insertions(+)
 create mode 100644 .appveyor.yml

diff --git a/.appveyor.yml b/.appveyor.yml
new file mode 100644
index 00..1d271dbaaa
--- /dev/null
+++ b/.appveyor.yml
@@ -0,0 +1,32 @@
+environment:
+  global:
+CYG_MIRROR: http://cygwin.mirror.rafal.ca/
+CYG_CACHE: C:/cygwin/var/cache/setup
+  matrix:
+- CYG_ARCH: x86
+  CYG_ROOT: C:/cygwin
+- CYG_ARCH: x86_64
+  CYG_ROOT: C:/cygwin64
+
+install:
+  - set PATH=%CYG_ROOT%/bin;%PATH%
+  - 'appveyor DownloadFile http://cygwin.com/setup-%CYG_ARCH%.exe -FileName 
setup.exe'
+  - 'setup.exe -qnNdO -R "%CYG_ROOT%" -s "%CYG_MIRROR%" -l "%CYG_CACHE%" -P 
flex -P libreadline-devel -P libssl-devel -P libxml2-devel -P libxslt-devel -P 
openldap-devel -P zlib-devel'
+  - sh -l -c "cygserver-config --yes"
+  - sh -l -c "cygrunsrv -S cygserver"
+
+build_script:
+  - set HOME=.
+  - set PATH=%CYG_ROOT%/bin;%PATH%
+  - sh -l -c "./configure --enable-cassert --enable-debug --enable-nls 
--with-ldap --with-libxml --with-libxslt --with-openssl"
+  - sh -l -c "make world COPT=-Werror"
+
+test_script:
+  - set HOME=.
+  - set PATH=%CYG_ROOT%/bin;%PATH%
+  - sh -l -c "make check-world COPT=-Werror"
+
+on_failure:
+  - set HOME=.
+  - sh -l -c "test -f config.status || cat config.log"
+  - sh -l -c "find . -name regression.diffs | xargs cat"
-- 
2.25.0

From a04c2140482645a810bfd682b204bac5194eea8d Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 10 Mar 2020 19:23:46 +0100
Subject: [PATCH] AppVeyor configuration for MinGW

---
 .appveyor.yml | 31 +++
 1 file changed, 31 insertions(+)
 create mode 100644 .appveyor.yml

diff --git a/.appveyor.yml b/.appveyor.yml
new file mode 100644
index 00..fffa351e90
--- /dev/null
+++ b/.appveyor.yml
@@ -0,0 +1,31 @@
+environment:
+  matrix:
+- MSYSTEM: MINGW32
+- MSYSTEM: MINGW64
+
+# 
https://www.postgresql.org/message-id/flat/f1caef93-9640-022e-9211-bbe8755a56b0%402ndQuadrant.com
+matrix:
+  allow_failures:
+- MSYSTEM: MINGW64
+
+install:
+  - set PATH=C:/msys64/usr/bin;%PATH%
+  - sh -l -c "pacman --noconfirm -S --needed ${MINGW_PACKAGE_PREFIX}-gettext 
${MINGW_PACKAGE_PREFIX}-icu ${MINGW_PACKAGE_PREFIX}-libxml2 
${MINGW_PACKAGE_PREFIX}-libxslt ${MINGW_PACKAGE_PREFIX}-openssl"
+  - sh -l -c "curl -L http://cpanmin.us | perl - --self-upgrade"
+  - sh -l -c "cpanm --notest IPC::Run"
+
+build_script:
+  - set HOME=.
+  - set PATH=C:/msys64/usr/bin;%PATH%
+  - sh -l -c "./configure --enable-cassert --enable-debug --enable-nls 
--enable-tap-tests --with-icu --with-ldap --with-libxml --with-libxslt 
--with-openssl"
+  - sh -l -c "make world COPT=-Werror"
+
+test_script:
+  - set HOME=.
+  - set PATH=C:/msys64/usr/bin;%PATH%
+  - sh -l -c "make check-world COPT=-Werror"
+
+on_failure:
+  - set HOME=.
+  - sh -l -c "test -f config.status || cat config.log"
+  - sh -l -c "find . -name regression.diffs | xargs cat"
-- 
2.25.0

From ed6610fb02d8c63d66315a83b752de5c6b9b84eb Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 10 Mar 2020 19:25:54 +0100
Subject: [PATCH] AppVeyor configuration for MSVC

---
 .appveyor.yml | 19 +++
 1 file changed, 19 insertions(+)
 create mode 100644 .appveyor.yml

diff --git a/.appveyor.yml b/.appveyor.yml
new file mode 100644
index 00..124736ce3a
--- /dev/null
+++ b/.appveyor.yml
@@ -0,0 +1,19 @@
+install:
+  - appveyor-retry cinst winflexbison
+  - rename c:\ProgramData\chocolatey\bin\win_flex.exe flex.exe
+  - rename c:\ProgramData\chocolatey\bin\win_bison.exe bison.exe
+  - '"C:\Program Files\Microsoft SDKs\Windows\v7.1\Bin\SetEnv.cmd" /x64'
+  - '"C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\vcvarsall.bat" 
x86_amd64'
+
+configuration:
+  - Release
+
+before_build:
+  - perl src/tools/msvc/mkvcbuild.pl
+
+build

Re: ASLR support for Postgres12

2020-03-23 Thread Tom Lane

"Joel Mariadasan (jomariad)"  writes:
> Can you please point me to a link or give the list of OSes where ASLR is 
> officially supported by Postgres?

Everything except Windows.

regards, tom lane

Re: Define variables in the approprieate scope

2020-03-23 Thread Alvaro Herrera

On 2020-Mar-18, Bruce Momjian wrote:

> On Tue, Feb 25, 2020 at 09:35:52AM +0100, Antonin Houska wrote:
> > I've noticed that two variables in RelationCopyStorage() are defined in a
> > scope higher than necessary. Please see the patch.
> 
> It seems cleaner to me to allocate the variables once before the loop
> starts, rather than for each loop iteration.

If we're talking about personal preference, my own is what Antonin
shows.  However, since disagreement has been expressed, I think we
should only change it if the generated code turns out better.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: replay pause vs. standby promotion

2020-03-23 Thread Fujii Masao





On 2020/03/24 0:17, Sergei Kornilov wrote:

Hello


You meant that the promotion request should cause the recovery
to finish immediately even if there are still outstanding WAL records,
and cause the standby to become the master?


Oh, I get your point. But yes, I expect that in case of promotion request 
during a pause, the user (me too) will want to have exactly the current state, 
not latest available in WALs.


Basically I'd like the promotion to make the standby replay all the WAL
even if it's requested during pause state. OTOH I understand there
are use cases where immediate promotion is useful, as you explained.
So, +1 to add something like "pg_ctl promote -m immediate".

But I'm afraid that now it's too late to add such feature into v13.
Probably it's an item for v14

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

Re: replay pause vs. standby promotion

2020-03-23 Thread Fujii Masao





On 2020/03/23 23:55, Robert Haas wrote:

On Mon, Mar 23, 2020 at 10:36 AM Fujii Masao
 wrote:

If we would like to have the promotion method to finish recovery
immediately, IMO we should implement something like
"pg_ctl promote -m fast". That is, we need to add new method into
the promotion.


I think 'immediate' would be a better choice. One reason is that we've
used the term 'fast promotion' in the past for a different feature.
Another is that 'immediate' might sound slightly scary to people who
are familiar with what 'pg_ctl stop -mimmediate' does. And you want
people doing this to be just a little bit scared: not too scared, but
a little scared.


+1

When I proposed the feature five years before, I used "immediate"
as the option value.
https://postgr.es/m/cahgqgwhtvydqkzaywya9zyylecakif5p0kpcpune_tsrgtf...@mail.gmail.com

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

Re: optimisation? collation "C" sorting for GroupAggregate for all deterministic collations

2020-03-23 Thread Maxim Ivanov



> Perhaps this is what you mean by "deterministic", but isn't it
> possible for some collations to treat multiple byte sequences as equal
> values? And those multiple byte sequences wouldn't necessarily occur
> sequentially in C collation, so it wouldn't be possible to work around
> that by having the grouping node use one collation but the sorting
> node use the C one.
>
> If my memory is incorrect, then this sounds like an intriguing idea.


Yes, as per doc 
(https://www.postgresql.org/docs/12/collation.html#COLLATION-NONDETERMINISTIC) 
some collations can result in symbols(chars? codes? runes?) to be equal, while 
their byte representations is not. This optimisation should check for source 
table collation and do not change sorting collation if columns being sorted use 
non deterministic collation.

Luckily in practice it is probably to be very rare, all builtin collations are 
deterministic.

Re: replay pause vs. standby promotion

2020-03-23 Thread Sergei Kornilov

Hello

> You meant that the promotion request should cause the recovery
> to finish immediately even if there are still outstanding WAL records,
> and cause the standby to become the master?

Oh, I get your point. But yes, I expect that in case of promotion request 
during a pause, the user (me too) will want to have exactly the current state, 
not latest available in WALs.

Real usercase from my experience:
The user wants to update a third-party application. In case of problems, he 
wants to return to the old version of the application and the unchanged 
replica. Thus, it sets a pause on standby and performs an update. If all is ok 
- he will resume replay. In case of some problems he plans to promote standby.
But oops, standby will ignore promote signals during pause and we need get 
currect LSN from standby and restart it with recovery_target_lsn = ? and 
recovery_target_action = promote to achieve this state.

regards, Sergei

Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-03-23 Thread Alvaro Herrera

On 2020-Mar-23, Alexander Korotkov wrote:

> Dear Alvaro,
> 
> On Fri, Mar 13, 2020 at 3:18 AM Michael Paquier  wrote:
> > No issues with that either.  Are you fine with the updated version
> > attached for 0001?
> 
> I think patchset is almost ready for commit.  Michael is waiting for
> your answer on whether you're fine with current shape of 0001.  Could
> you please provide a feedback?

Hi, I didn't realize that this was waiting on me.  It looks good to me
-- I'd just remove the comment on the function prototype in archiver.h,
which is not customary (we only put these comments in the .c file).
Let's get it pushed.

Thanks for everybody's work on this.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: replay pause vs. standby promotion

2020-03-23 Thread Robert Haas

On Mon, Mar 23, 2020 at 10:36 AM Fujii Masao
 wrote:
> If we would like to have the promotion method to finish recovery
> immediately, IMO we should implement something like
> "pg_ctl promote -m fast". That is, we need to add new method into
> the promotion.

I think 'immediate' would be a better choice. One reason is that we've
used the term 'fast promotion' in the past for a different feature.
Another is that 'immediate' might sound slightly scary to people who
are familiar with what 'pg_ctl stop -mimmediate' does. And you want
people doing this to be just a little bit scared: not too scared, but
a little scared.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: ASLR support for Postgres12

2020-03-23 Thread Tom Lane

"Joel Mariadasan (jomariad)"  writes:
> We would like to know if there is a roadmap to enable ASLR support for 
> postgre.

Not on Windows --- since that OS doesn't support fork(), it's too
difficult to get different child processes to map shared memory
at the same address if ASLR is active.

If that feature is important to you, use a different operating
system.

regards, tom lane

Re: is somewhere documented x LIKE ANY(ARRAY)?

2020-03-23 Thread Tom Lane

Pavel Stehule  writes:
> po 23. 3. 2020 v 13:54 odesílatel Dagfinn Ilmari Mannsåker <
> ilm...@ilmari.org> napsal:
>> It's documented in
>> https://www.postgresql.org/docs/current/functions-comparisons.html, and
>> has been around since at least 7.4.

Well, to be fair, we don't really say anywhere that LIKE acts enough
like a plain operator to be used in this syntax.  And the underlying
code is the subquery_Op production in gram.y, which is specific to
this syntax, so I'm not sure offhand to what extent LIKE acts like
an operator for other corner cases.

> My customer reports some issues on Postgres 9.3.

Doesn't look to me like subquery_Op has changed much since 2004,
so you'd really need to be more specific.

regards, tom lane

Re: replay pause vs. standby promotion

2020-03-23 Thread Fujii Masao





On 2020/03/23 22:46, Sergei Kornilov wrote:

Hello

(I am trying to find an opportunity to review this patch...)


Thanks for the review! It's really helpful!



Consider test case with streaming replication:

on primary: create table foo (i int);
on standby:

postgres=# select pg_wal_replay_pause();
  pg_wal_replay_pause
-
  
(1 row)


postgres=# select pg_is_wal_replay_paused();
  pg_is_wal_replay_paused
-
  t
(1 row)

postgres=# table foo;
  i
---
(0 rows)

Execute "insert into foo values (1);" on primary

postgres=# select pg_promote ();
  pg_promote

  t
(1 row)

postgres=# table foo;
  i
---
  1

And we did replay one additional change during promote. I think this is wrong 
behavior. Possible can be fixed by

+if (PromoteIsTriggered()) break;
 /* Setup error traceback support for ereport() */
 errcallback.callback = rm_redo_error_callback;


You meant that the promotion request should cause the recovery
to finish immediately even if there are still outstanding WAL records,
and cause the standby to become the master? I don't think that
it's the expected (also existing) behavior of the promotion. That is,
the promotion request should cause the recovery to replay as much
WAL records as possible, to the end, in order to avoid data loss. No?

If we would like to have the promotion method to finish recovery
immediately, IMO we should implement something like
"pg_ctl promote -m fast". That is, we need to add new method into
the promotion.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

Re: WAL usage calculation patch

2020-03-23 Thread Fujii Masao





On 2020/03/23 21:01, Fujii Masao wrote:



On 2020/03/23 7:32, Kirill Bychik wrote:

I'm attaching a v5 with fp records only for temp tables, so there's no risk of
instability.  As I previously said I'm fine with your two patches, so unless
you have objections on the fpi test for temp tables or the documentation
changes, I believe those should be ready for committer.


You added the columns into pg_stat_database, but seem to forget to
update the document for pg_stat_database.


Ah right, I totally missed that when I tried to clean up the original POC.


Is it really reasonable to add the columns for vacuum's WAL usage into
pg_stat_database? I'm not sure how much the information about
the amount of WAL generated by vacuum per database is useful.


The amount per database isn't really useful, but I didn't had a better idea on
how to expose (auto)vacuum WAL usage until this:


Isn't it better to make VACUUM VERBOSE and autovacuum log include
that information, instead, to see how much each vacuum activity
generates the WAL? Sorry if this discussion has already been done
upthread.


That's a way better idea!  I'm attaching the full patchset with the 3rd patch
to use this approach instead.  There's a bit a duplicate code for computing the
WalUsage, as I didn't find a better way to avoid that without exposing
WalUsageAccumDiff().

Autovacuum log sample:

2020-03-19 15:49:05.708 CET [5843] LOG:  automatic vacuum of table 
"rjuju.public.t1": index scans: 0
 pages: 0 removed, 2213 remain, 0 skipped due to pins, 0 skipped frozen
 tuples: 25 removed, 25 remain, 0 are dead but not yet 
removable, oldest xmin: 502
 buffer usage: 4448 hits, 4 misses, 4 dirtied
 avg read rate: 0.160 MB/s, avg write rate: 0.160 MB/s
 system usage: CPU: user: 0.13 s, system: 0.00 s, elapsed: 0.19 s
 WAL usage: 6643 records, 4 full page records, 1402679 bytes

VACUUM log sample:

# vacuum VERBOSE t1;
INFO:  vacuuming "public.t1"
INFO:  "t1": removed 5 row versions in 443 pages
INFO:  "t1": found 5 removable, 0 nonremovable row versions in 443 out of 
443 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 512
There were 5 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
1332 WAL records, 4 WAL full page records, 306901 WAL bytes
CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s.
INFO:  "t1": truncated 443 to 0 pages
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  vacuuming "pg_toast.pg_toast_16385"
INFO:  index "pg_toast_16385_index" now contains 0 row versions in 1 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "pg_toast_16385": found 0 removable, 0 nonremovable row versions in 0 
out of 0 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 513
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
0 WAL records, 0 WAL full page records, 0 WAL bytes
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM

Note that the 3rd patch is an addition on top of Kirill's original patch, as
this is information that would have been greatly helpful to investigate in some
performance issues I had to investigate recently.  I'd be happy to have it land
into v13, but if that's controversial or too late I'm happy to postpone it to
v14 if the infrastructure added in Kirill's patches can make it to v13.


Dear all, can we please focus on getting the core patch committed?
Given the uncertainity regarding autovacuum stats, can we please get
parts 1 and 2 into the codebase, and think about exposing autovacuum
stats later?


Here are the comments for 0001 patch.

+    /*
+ * Report a full page image constructed for the WAL record
+ */
+    pgWalUsage.wal_fp_records++;

Isn't it better to use "fpw" or "fpi" for the variable name rather than
"fp" here? In other places, "fpw" and "fpi" are used for full page
writes/image.

ISTM that this counter could be incorrect if XLogInsertRecord() determines to
calculate again whether FPI is necessary or not. No? IOW, this issue could
happen if XLogInsert() calls  XLogRecordAssemble() multiple times in
its do-while loop. Isn't this problematic?

+    long    wal_bytes;    /* size of wal records produced */

Isn't it safer to use uint64 (i.e., XLogRecPtr) as the type of this variable
rather than long?

+    shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, bufusage_space);

bufusage_space should be walusage_space here?

/*
  * Finish parallel execution.  We wait for parallel workers to finish, and
  * accumulate their buffer usage.
  */

There are some comments mentioning buffer usage, in execParallel.c.
For example, the top comment for ExecParallelFinish(), as the above.
These should be updated.


Here are the comm

Re: replay pause vs. standby promotion

2020-03-23 Thread Sergei Kornilov

Hello

(I am trying to find an opportunity to review this patch...)

Consider test case with streaming replication:

on primary: create table foo (i int);
on standby:

postgres=# select pg_wal_replay_pause();
 pg_wal_replay_pause 
-
 
(1 row)

postgres=# select pg_is_wal_replay_paused();
 pg_is_wal_replay_paused 
-
 t
(1 row)

postgres=# table foo;
 i 
---
(0 rows)

Execute "insert into foo values (1);" on primary

postgres=# select pg_promote ();
 pg_promote 

 t
(1 row)

postgres=# table foo;
 i 
---
 1

And we did replay one additional change during promote. I think this is wrong 
behavior. Possible can be fixed by

+if (PromoteIsTriggered()) break;
/* Setup error traceback support for ereport() */
errcallback.callback = rm_redo_error_callback;

regards, Sergei

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2020-03-23 Thread Ashutosh Bapat

On Tue, Mar 17, 2020 at 1:44 PM Etsuro Fujita  wrote:
>
> > + /*
> > + * If this segment of the join is empty, it means that this segment
> >
> > "partition of the join" looks consistent with other usages than "segment of 
> > the
> > join".
>
> Actually, "segment" is used in the existing comments in the caller
> function try_partitionwise_join(), so I think "segment" is better here
> for consistency.

A segment can be any part of the join relation, not necessarily a
partition. May be we should change the caller.

>
> > + /*
> > + * Get a relids set of partition(s) involved in this join segment that
> > + * are from the rel1 side.
> > + */
> > + child_relids1 = bms_intersect(child_joinrel->relids,
> > +  rel1->all_partrels);
> >
> > The partition bounds are sorted by their values. Even for a list partitioned
> > table, the bounds are sorted by the least partition value. We do not allow
> > multiple paritions from one side to be joined with one partition on the 
> > other
> > and vice versa. All this together means that the partitions of the join
> > relation are formed by joining partitions from either side in the order of
> > their bounds. This means that the matching pairs of partitions can be found 
> > by
> > matching relids of partitions of join with those of the joining relation by
> > traversing partitions from all the three relations only once in the order 
> > they
> > appears in partition bounds of corresponding relations.
>
> Consider this 2-way join for list partitioned tables:
>
> CREATE TABLE plt1_ad (a int, b int, c text) PARTITION BY LIST (c);
> CREATE TABLE plt1_ad_p1 PARTITION OF plt1_ad FOR VALUES IN ('0001', '0003');
> CREATE TABLE plt1_ad_p2 PARTITION OF plt1_ad FOR VALUES IN ('0002');
> INSERT INTO plt1_ad SELECT i, i, to_char(i % 10, 'FM') FROM
> generate_series(1, 100) i WHERE i % 10 in (1, 2, 3);
> ANALYZE plt1_ad;
> CREATE TABLE plt2_ad (a int, b int, c text) PARTITION BY LIST (c);
> CREATE TABLE plt2_ad_p1 PARTITION OF plt2_ad FOR VALUES IN ('0002', '0004');
> CREATE TABLE plt2_ad_p2 PARTITION OF plt2_ad FOR VALUES IN ('0003');
> INSERT INTO plt2_ad SELECT i, i, to_char(i % 10, 'FM') FROM
> generate_series(1, 100) i WHERE i % 10 IN (2, 3, 4);
> ANALYZE plt2_ad;
>
> EXPLAIN (COSTS OFF)
> SELECT t1.c, t1.a, t2.a FROM plt1_ad t1 INNER JOIN plt2_ad t2 ON (t1.c
> = t2.c) WHERE t1.a < 10  ORDER BY t1.c, t1.a, t2.a;
>  QUERY PLAN
> -
>  Sort
>Sort Key: t1.c, t1.a, t2.a
>->  Append
>  ->  Hash Join
>Hash Cond: (t2_1.c = t1_2.c)
>->  Seq Scan on plt2_ad_p1 t2_1
>->  Hash
>  ->  Seq Scan on plt1_ad_p2 t1_2
>Filter: (a < 10)
>  ->  Hash Join
>Hash Cond: (t2_2.c = t1_1.c)
>->  Seq Scan on plt2_ad_p2 t2_2
>->  Hash
>  ->  Seq Scan on plt1_ad_p1 t1_1
>Filter: (a < 10)
> (15 rows)
>
> As you can see from the EXPLAIN result, the first partition on the
> outer side matches the second partition on the inner side, and the
> second partition on the outer side matches the first partition on the
> inner side.  How does the algorithm you proposed work e.g., when an
> N-way join for list partitioned tables contains this join as its lower
> join?

Hmm, this is a good example. I tried to work out the algorithm based
on the bound ordering. The algorithm worked well when all the bounds
on both the sides were included in the join. But it didn't work well,
when some bounds vanished. In order to detect whether a bound has
vanished, we need to either compare that bound with the bounds of join
(an operation costlier than comparing bitmapset) or compare relids of
all the partitions of the join. Either way it looks costlier than what
you have right now. May be we could improve by keeping track of such
lost bounds and corresponding partitions. But I didn't get time to
work on that part. Anyway, even if such an algorithm exists, we will
have to change just a single function. That could be done later I
think. So we are good here right now. Thanks.


> > + if (rel1_is_simple)
> >
> > This variable is used only in one place. So instead we should the expression
> > assigning the value to it. Changed in the attached patch.
>
> I don't think that's a good idea, because this check is done
> repeatedly in a for loop.

Compiler's optimizer would anyway optimize it away. But anyway, I
won't insist on this.

>
> > - rel->nparts = 0;
> > + rel->nparts = -1;
> >
> > I think we need to add comments around various values that nparts can take. 
> > How
> > about like something attached.
>
> +1
>
> > + case PARTITION_STRATEGY_HASH:
> > + merged_bounds = NULL;
> >
> > I think, we need to explain why we aren't merging hash partition bounds. 
> > AFAIU,
> > the reason is thus: When the modulo of used for partition mapping (i.e. 
> > maximum
> >

Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-23 Thread Laurenz Albe

On Fri, 2020-03-20 at 14:43 +0100, Laurenz Albe wrote:
> I.e. with the default settings we will perform a whole-index scan
> > (without visibility map or such) after every 10% growth of the
> > table. Which means that, even if the visibility map prevents repeated
> > tables accesses, increasing the rate of vacuuming for insert-only tables
> > can cause a lot more whole index scans.  Which means that vacuuming an
> > insert-only workload frequently *will* increase the total amount of IO,
> > even if there is not a single dead tuple. Rather than just spreading the
> > same amount of IO over more vacuums.
> > 
> > And both gin and gist just always do a full index scan, regardless of
> > vacuum_cleanup_index_scale_factor (either during a bulk delete, or
> > during the cleanup).  Thus more frequent vacuuming for insert-only
> > tables can cause a *lot* of pain (even an approx quadratic increase of
> > IO?  O(increased_frequency * peak_index_size)?) if you have large
> > indexes - which is very common for gin/gist.
> 
> In the light of that, I agree that we should increase the scale_factor.

Here is version 10 of the patch, which uses a scale factor of 0.2.

Yours,
Laurenz Albe
From 2160b241ccfb8cd1a684108db9cec55fb9d6f3c9 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Mon, 23 Mar 2020 14:23:56 +0100
Subject: [PATCH] Autovacuum tables that have received only inserts

Add "autovacuum_vacuum_insert_threshold" and
"autovacuum_vacuum_insert_scale_factor" GUC and reloption.
The default value for the threshold is 1000;
the scale factor defaults to 0.2.

Any table that has received more inserts since it was
last vacuumed (and that is not vacuumed for another
reason) will be autovacuumed.

This avoids the known problem that insert-only tables
are never autovacuumed until they need to have their
anti-wraparound autovacuum, which then can be massive
and disruptive.

The feature can also be used to vacuum insert-only
tables often enough to get an index-only scan.
For that, one would use a lower threshold and scale factor.

To track the number of inserts since the last vacuum,
introduce a StatTabEntry "inserts_since_vacuum" that
gets reset to 0 after a vacuum.  This value is available
in "pg_stat_*_tables" as "n_ins_since_vacuum".

Author: Laurenz Albe, based on a suggestion from Darafei Praliaskouski
Reviewed-by: David Rowley, Justin Pryzby, Masahiko Sawada, Andres Freund
Discussion: https://postgr.es/m/CAC8Q8t+j36G_bLF=+0imo6jgnwnlnwb1tujxujr-+x8zcct...@mail.gmail.com
---
 doc/src/sgml/config.sgml  | 41 +++
 doc/src/sgml/maintenance.sgml | 23 ---
 doc/src/sgml/monitoring.sgml  |  5 +++
 doc/src/sgml/ref/create_table.sgml| 30 ++
 src/backend/access/common/reloptions.c| 22 ++
 src/backend/catalog/system_views.sql  |  1 +
 src/backend/postmaster/autovacuum.c   | 22 --
 src/backend/postmaster/pgstat.c   |  5 +++
 src/backend/utils/adt/pgstatfuncs.c   | 16 
 src/backend/utils/misc/guc.c  | 20 +
 src/backend/utils/misc/postgresql.conf.sample |  4 ++
 src/bin/psql/tab-complete.c   |  4 ++
 src/include/catalog/pg_proc.dat   |  5 +++
 src/include/pgstat.h  |  1 +
 src/include/postmaster/autovacuum.h   |  2 +
 src/include/utils/rel.h   |  2 +
 src/test/regress/expected/rules.out   |  3 ++
 17 files changed, 198 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 70854ae298..e50018c917 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7301,6 +7301,26 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
   
  
 
+ 
+  autovacuum_vacuum_insert_threshold (integer)
+  
+   autovacuum_vacuum_insert_threshold
+   configuration parameter
+  
+  
+  
+   
+Specifies the number of inserted tuples needed to trigger a
+VACUUM in any one table.
+The default is 1000 tuples.
+This parameter can only be set in the postgresql.conf
+file or on the server command line;
+but the setting can be overridden for individual tables by
+changing table storage parameters.
+   
+  
+ 
+
  
   autovacuum_analyze_threshold (integer)
   
@@ -7342,6 +7362,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
   
  
 
+ 
+  autovacuum_vacuum_insert_scale_factor (floating point)
+  
+   autovacuum_vacuum_insert_scale_factor
+   configuration parameter
+  
+  
+  
+   
+Specifies a fraction of the table size to add to
+autovacuum_vacuum_insert_threshold
+when deciding whether to trigger a VACUUM.
+The default is 0.2 (20% of table size).
+This parameter can only be set in the postgr

Negative cost is seen for plan node

2020-03-23 Thread Richard Guo

Hi all,

With the following statements on latest master (c81bd3b9), I find
negative cost for plan nodes.

create table a (i int, j int);
insert into a select i%10, i from generate_series(1,100)i;
analyze a;

# explain select i from a group by i;
   QUERY PLAN
-
 HashAggregate  (cost=1300.00..-1585.82 rows=102043 width=4)
   Group Key: i
   Planned Partitions: 4
   ->  Seq Scan on a  (cost=0.00..14425.00 rows=100 width=4)
(4 rows)

In function cost_agg, when we add the disk costs of hash aggregation
that spills to disk, nbatches is calculated as 1.18 in this case. It is
greater than 1, so there will be spill. And the depth is calculated as
-1 in this case, with num_partitions being 4. I think this is where
thing goes wrong.

Thanks
Richard

Re: is somewhere documented x LIKE ANY(ARRAY)?

2020-03-23 Thread Pavel Stehule

po 23. 3. 2020 v 13:54 odesílatel Dagfinn Ilmari Mannsåker <
ilm...@ilmari.org> napsal:

> Pavel Stehule  writes:
>
> > Hi
> >
> > I try to search notice about it, to get info about release date of this
> > feature, but I cannot find it.
>
> It's documented in
> https://www.postgresql.org/docs/current/functions-comparisons.html, and
> has been around since at least 7.4.
>

My customer reports some issues on Postgres 9.3.



> > Regards
> >
> > Pavel
>
> - ilmari
> --
> - Twitter seems more influential [than blogs] in the 'gets reported in
>   the mainstream press' sense at least.   - Matt McLeod
> - That'd be because the content of a tweet is easier to condense down
>   to a mainstream media article.  - Calle Dybedahl
>

Re: is somewhere documented x LIKE ANY(ARRAY)?

2020-03-23 Thread Dagfinn Ilmari Mannsåker

Pavel Stehule  writes:

> Hi
>
> I try to search notice about it, to get info about release date of this
> feature, but I cannot find it.

It's documented in
https://www.postgresql.org/docs/current/functions-comparisons.html, and
has been around since at least 7.4.

> Regards
>
> Pavel

- ilmari
-- 
- Twitter seems more influential [than blogs] in the 'gets reported in
  the mainstream press' sense at least.   - Matt McLeod
- That'd be because the content of a tweet is easier to condense down
  to a mainstream media article.  - Calle Dybedahl

Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-03-23 Thread Alexander Korotkov

Dear Alvaro,

On Fri, Mar 13, 2020 at 3:18 AM Michael Paquier  wrote:
> No issues with that either.  Are you fine with the updated version
> attached for 0001?

I think patchset is almost ready for commit.  Michael is waiting for
your answer on whether you're fine with current shape of 0001.  Could
you please provide a feedback?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

is somewhere documented x LIKE ANY(ARRAY)?

2020-03-23 Thread Pavel Stehule

Hi

I try to search notice about it, to get info about release date of this
feature, but I cannot find it.

Regards

Pavel

Re: WAL usage calculation patch

2020-03-23 Thread Fujii Masao





On 2020/03/23 7:32, Kirill Bychik wrote:

I'm attaching a v5 with fp records only for temp tables, so there's no risk of
instability.  As I previously said I'm fine with your two patches, so unless
you have objections on the fpi test for temp tables or the documentation
changes, I believe those should be ready for committer.


You added the columns into pg_stat_database, but seem to forget to
update the document for pg_stat_database.


Ah right, I totally missed that when I tried to clean up the original POC.


Is it really reasonable to add the columns for vacuum's WAL usage into
pg_stat_database? I'm not sure how much the information about
the amount of WAL generated by vacuum per database is useful.


The amount per database isn't really useful, but I didn't had a better idea on
how to expose (auto)vacuum WAL usage until this:


Isn't it better to make VACUUM VERBOSE and autovacuum log include
that information, instead, to see how much each vacuum activity
generates the WAL? Sorry if this discussion has already been done
upthread.


That's a way better idea!  I'm attaching the full patchset with the 3rd patch
to use this approach instead.  There's a bit a duplicate code for computing the
WalUsage, as I didn't find a better way to avoid that without exposing
WalUsageAccumDiff().

Autovacuum log sample:

2020-03-19 15:49:05.708 CET [5843] LOG:  automatic vacuum of table 
"rjuju.public.t1": index scans: 0
 pages: 0 removed, 2213 remain, 0 skipped due to pins, 0 skipped frozen
 tuples: 25 removed, 25 remain, 0 are dead but not yet 
removable, oldest xmin: 502
 buffer usage: 4448 hits, 4 misses, 4 dirtied
 avg read rate: 0.160 MB/s, avg write rate: 0.160 MB/s
 system usage: CPU: user: 0.13 s, system: 0.00 s, elapsed: 0.19 s
 WAL usage: 6643 records, 4 full page records, 1402679 bytes

VACUUM log sample:

# vacuum VERBOSE t1;
INFO:  vacuuming "public.t1"
INFO:  "t1": removed 5 row versions in 443 pages
INFO:  "t1": found 5 removable, 0 nonremovable row versions in 443 out of 
443 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 512
There were 5 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
1332 WAL records, 4 WAL full page records, 306901 WAL bytes
CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s.
INFO:  "t1": truncated 443 to 0 pages
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  vacuuming "pg_toast.pg_toast_16385"
INFO:  index "pg_toast_16385_index" now contains 0 row versions in 1 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "pg_toast_16385": found 0 removable, 0 nonremovable row versions in 0 
out of 0 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 513
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
0 WAL records, 0 WAL full page records, 0 WAL bytes
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM

Note that the 3rd patch is an addition on top of Kirill's original patch, as
this is information that would have been greatly helpful to investigate in some
performance issues I had to investigate recently.  I'd be happy to have it land
into v13, but if that's controversial or too late I'm happy to postpone it to
v14 if the infrastructure added in Kirill's patches can make it to v13.


Dear all, can we please focus on getting the core patch committed?
Given the uncertainity regarding autovacuum stats, can we please get
parts 1 and 2 into the codebase, and think about exposing autovacuum
stats later?


Here are the comments for 0001 patch.

+   /*
+* Report a full page image constructed for the WAL 
record
+*/
+   pgWalUsage.wal_fp_records++;

Isn't it better to use "fpw" or "fpi" for the variable name rather than
"fp" here? In other places, "fpw" and "fpi" are used for full page
writes/image.

ISTM that this counter could be incorrect if XLogInsertRecord() determines to
calculate again whether FPI is necessary or not. No? IOW, this issue could
happen if XLogInsert() calls  XLogRecordAssemble() multiple times in
its do-while loop. Isn't this problematic?

+   longwal_bytes;  /* size of wal records produced 
*/

Isn't it safer to use uint64 (i.e., XLogRecPtr) as the type of this variable
rather than long?

+   shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, bufusage_space);

bufusage_space should be walusage_space here?

/*
 * Finish parallel execution.  We wait for parallel workers to finish, and
 * accumulate their buffer usage.
 */

There are some comments mentioning buffer usage, in execParallel.c.
For example, the top comment for ExecParallelFinish(), as the above.
These should be updated.

Re

Re: Created feature for to_date() conversion using patterns 'YYYY-WW', 'YYYY-WW-D', 'YYYY-MM-W' and 'YYYY-MM-W-D'

2020-03-23 Thread Mark Lorenz


Hi Tom,

with a bit space to this issue, I re-read your comments. I am beginning 
to understand what you mean or - better - what's wrong with my thoughts. 
When I understand you correctly, you say, the WW can start at any 
weekday, and is not fixed to Sunday, right? In your opinion the WW 
starts with the weekday of Jan, 1st? That's what could be my problem: I 
always thought (maybe triggered through the D pattern), that WW has to 
start sundays. But, now I agree with you, the docs fit better to your 
interpretation:


"the first week starts on the first day of the year"

I interpreted it with: It starts on the week, which includes the first 
of the year, but the Sunday before.


Did I understand you correctly? In that case, I accept, that my patch is 
no bugfix (I think, it would be one, if my interpretion would be the 
expected behaviour.).


But, nevertheless, what about adding the function to accept the DAY, D 
(and maybe the Q) patterns for to_date() - in this case, of course, in 
the uncorrelated version? to_char() handles them properly. And, from my 
point of view, there is no reason why they should give "1" instead the 
real day number. What do you think?

Re: Index Skip Scan

2020-03-23 Thread Andy Fan

>
>
> On Mon, Mar 23, 2020 at 1:55 AM Floris Van Nee 
> wrote:
> > I'm unsure which version number to give this patch (to continue with
> numbers from previous skip scan patches, or to start numbering from scratch
> again). It's a rather big change, so one could argue it's mostly a separate
> patch. I guess it mostly depends on how close the original versions were to
> be committable. Thoughts?
>
> I don't know, but from the sidelines, it'd be nice to see the unique
> path part go into PG13, where IIUC it can power the "useless unique
> removal" patch.
>

Actually I have a patch to remove the distinct clause some long time ago[1],
and later it came to the UniqueKey as well,  you can see [2] for the current
status.

[1]
https://www.postgresql.org/message-id/flat/CAKU4AWqOORqW900O-%2BL4L2%2B0xknsEqpfcs9FF7SeiO9TmpeZOg%40mail.gmail.com#f5d97cc66b9cd330add2fbb004a4d107

[2]
https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL=uafudmf4wgovkel3onbatqju9nsxtuucpp...@mail.gmail.com

Unqualified pg_catalog casts in pg_dump

2020-03-23 Thread Daniel Gustafsson

When looking at something different, I happened to notice that pg_dump is a bit
inconsistent in how it qualifies casts to pg_catalog entities like regclass and
oid.  Most casts are qualified, but not all.  Even though it functionally is
the same, being consistent is a good thing IMO and I can't see a reason not to,
so the attached patch adds qualifications (the unqualified regclass cast in the
TAP test left on purpose).

cheers ./daniel



pg_dump_casts.patch
Description: Binary data

[PATCH] Keeps tracking the uniqueness with UniqueKey

2020-03-23 Thread Andy Fan

Greetings.

This thread is a follow-up thread for [1], where I submit a patch for
erasing the
distinct node if we have known the data is unique for sure.  But since the
implementation has changed a lot from the beginning and they are not very
related, so I start this new thread to discuss the new strategy to save the
time
of reviewers.

As I said above, my original intention is just used to erase the distinct
clause,
then Tom Lane suggested function query_is_distinct_for,  I found the
uniqueness
can be used for costing,  remove_useless_join, reduce_unqiue_semijoins.
David suggested to maintain the uniqueness from bottom to top, like join
& subqueries, group-by, distinct, union and so on(we call it as UniqueKey).

Ideally the uniqueness will be be lost in any case. This current
implementation
follows the David's suggestion and also thanks Ashutosh who reminded me
the cost should be ok while I had concerns of this at the beginning.

A new field named uniquekeys was added in RelOptInfo struct, which is a
list of UniqueKey struct.

typedef struct UniqueKey
{
NodeTag type;
List   *exprs;
List   *positions;
bool   grantee;
} UniqueKey;

exprs is a list of exprs which is unique if we don't care about the null
vaues on
current RelOptInfo.

positions is a list of the sequence no. of the exprs in the current
RelOptInfo,
which is used for SubQuery. like

create table t1 (a int primary key, b int);
create table t2 (a int primary key, b int);
select .. from t1,  (select b from t2 group by t2) t2 ..;

The UniqueKey for the subquery will be Var(varno=1, varattno=2),  but for
the
top query, the UniqueKey of t2 should be Var(varno=2, varattrno=1),  the 1
here
need to be calculated by UnqiueKey->positions.

grantee field is introduced mainly for remove_useless_join &
reduce_unique_semijions.
Take the above case for example:

-- b is nullable.  so select b from t2 still can result in duplicated rows.
create unique index t2_uk_b on t2(b);

-- the left join still can be removed since t2.b is a unique index and the
nullable
doesn't matter here.
select t1.* from t1 left join t2 on (t1.b = t2.b);

so t2.b will still be an UniqueKey for t2, just that the grantee = false.

A branch of functions like populate_xxx_unqiuekeys for manipulating
uniquekeys
for a lot of cases, xxx maybe baserel, joinrel, paritioned table, unionrel,
groupbyrel,
distincrel and so on.  partitioned table has some not obviously troubles
due to
users can create index on the childrel directly and differently.  You can
check
the comments of the code for details.


When maintaining the uniquekeys of joinrel,  we have a rule that if both
rels have
UniqueKeys,  then any combination from the 2 sides is a unqiquekey of the
joinrel.
I used two algorithms to keep the length of the UniqueKeys short.  One is
we only
add useful UniqueKey to the RelOptInfo.uniquekeys.  If the expr isn't shown
in
rel->reltargets->exprs, it will not be used for others, so we can ignore it
safely.
The another one is if column sets A is unqiuekey already,  any superset of
A
will no need to be added as an UnqiueKey.


The overall cost of the maintaining unqiuekeys should be ok.  If you check
the code,
you may find there are many 2 or 3 levels foreach, but most of them are
started with
unique index, and I used UnqiueKeyContext and SubqueryUnqiueKeyContext in
joinrel
and subquery case to avoid too many loops.

Now I have used the UnqiueKey to erase the unnecessary distinct/group by,
and also changed
the rel_is_distinct_for to use UnqiueKeys.  so we can handle more cases.

create table m1 (a int primary key,  b int, c int);
create table m2 (a int primary key,  b int, c int);
create table m3 (a int primary key,  b int, c int);

Wit the current patch, we can get:
task3=# explain select  t1.a from m3 t1 left join  (select m1.a from m1, m2
where m1.b = m2.a limit 1) t2 on (t1.a = t2.a);
   QUERY PLAN
-
 Seq Scan on m3 t1  (cost=0.00..32.60 rows=2260 width=4)


Before the patch, we will get:
postgres=# explain select  t1.a from m3 t1 left join  (select m1.a from m1,
m2 where m1.b = m2.a limit 1) t2 on (t1.a = t2.a)
postgres-# ;
  QUERY PLAN
---
 Hash Left Join  (cost=0.39..41.47 rows=2260 width=4)
   Hash Cond: (t1.a = m1.a)
   ->  Seq Scan on m3 t1  (cost=0.00..32.60 rows=2260 width=4)
   ->  Hash  (cost=0.37..0.37 rows=1 width=4)
 ->  Limit  (cost=0.15..0.36 rows=1 width=4)
   ->  Nested Loop  (cost=0.15..470.41 rows=2260 width=4)
 ->  Seq Scan on m1  (cost=0.00..32.60 rows=2260
width=8)
 ->  Index Only Scan using m2_pkey on m2
 (cost=0.15..0.19 rows=1 width=4)
   Index Cond: (a = m1.b)


The "limit 1" here is just want to avoid the pull_up_subquery to pull up
the subquery,
I think w

Re: error context for vacuum to include block number

2020-03-23 Thread Amit Kapila

On Sat, Mar 21, 2020 at 1:33 PM Justin Pryzby  wrote:
>
> On Sat, Mar 21, 2020 at 01:00:03PM +0530, Amit Kapila wrote:
> > I have addressed your comments in the attached patch.  Today, while
> > testing error messages from various phases, I noticed that the patch
> > fails to display error context if the error occurs during the truncate
> > phase.  The reason was that we had popped the error stack in
> > lazy_scan_heap due to which it never calls the callback.  I think we
> > need to set up callback at a higher level as is done in the attached
> > patch.  I have done the testing by inducing errors in various phases
> > and it prints the required information.  Let me know what you think of
> > the attached?
>
> Thanks.  My tests with TRUNCATE were probably back when we had multiple
> push/pop cycles of local error callbacks.
>
> This passes my tests.
>

Today, I have done some additional testing with parallel workers and
it seems to display the appropriate errors.  See below:

postgres=# create table t1(c1 int, c2 char(500), c3 char(500));
CREATE TABLE
postgres=# insert into t1 values(generate_series(1,30),'', '');
INSERT 0 30
postgres=# delete from t1 where c1 > 20;
DELETE 10
postgres=# vacuum t1;
ERROR:  Error induced during index vacuum
CONTEXT:  while vacuuming index "idx_t1_c3" of relation "public.t1"
parallel worker
while vacuuming index "idx_t1_c2" of relation "public.t1"

Here, you can see that the index names displayed in two messages are
different, basically, the leader backend got the error generated in
worker when it was vacuuming the other index.

I have used the attached patch to induce error.

I think the patch is in good shape now and I am happy with it.  We can
think of proceeding with this unless we want the further enhancement
for FSM which I am not sure is a good idea.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


induce_error_index.patch
Description: Binary data

Re: Make mesage at end-of-recovery less scary.

2020-03-23 Thread Peter Eisentraut


On 2020-03-05 08:06, Kyotaro Horiguchi wrote:

| [20866] LOG:  replication terminated by primary server
| [20866] DETAIL:  End of WAL reached on timeline 1 at 0/30001C8.
| [20866] FATAL:  could not send end-of-streaming message to primary: no COPY 
in progress
| [20851] LOG:  reached end of WAL at 0/30001C8 on timeline 1 in archive during 
standby mode
| [20851] DETAIL:  invalid record length at 0/30001C8: wanted 24, got 0

I changed the above to the below, which looks more adequate.

| [24271]  LOG:  replication terminated by primary server on timeline 1 at 
0/3000240.
| [24271]  FATAL:  could not send end-of-streaming message to primary: no COPY 
in progress
| [24267]  LOG:  reached end of WAL at 0/3000240 on timeline 1 in archive 
during standby mode
| [24267]  DETAIL:  invalid record length at 0/3000240: wanted 24, got 0


Is this the before and after?  That doesn't seem like a substantial 
improvement to me.  You still get the "scary" message at the end.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: error context for vacuum to include block number

2020-03-23 Thread Amit Kapila

On Mon, Mar 23, 2020 at 1:46 PM Justin Pryzby  wrote:
>
> On Mon, Mar 23, 2020 at 04:39:54PM +0900, Masahiko Sawada wrote:
> > I've already commented on earlier patch but I personally think we'd be
> > better to report freespace map vacuum as a separate phase. The
> > progress report of vacuum command is used to know the progress but
> > this error context would be useful for diagnostic of failure such as
> > disk corruption. For visibility map, I think the visibility map bit
> > that are processed during vacuum is exactly corresponding to the block
> > number but since freespace map vacuum processes the range of blocks
> > I've sometimes had trouble with identifying the cause of the problem.
>

What extra information we can print that can help?  The main problem I
see is that we need to sprinkle errorcallback update function at many
more places.  We can think of writing a wrapper function for FSM calls
used in a vacuum, but I think those can be used only for vacuum.

> Yea, and it would be misleading if we reported "while scanning block..of
> relation" if we actually failed while writing its FSM.
>
> My previous patches did this:
>
> +   case VACUUM_ERRCB_PHASE_VACUUM_FSM:
> +   errcontext("while vacuuming free space map of 
> relation \"%s.%s\"",
> +   cbarg->relnamespace, cbarg->relname);
> +   break;
>

In what kind of errors will this help?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: color by default

2020-03-23 Thread Peter Eisentraut


On 2020-03-23 06:04, Michael Paquier wrote:

On Fri, Mar 20, 2020 at 11:22:07PM -0400, Bruce Momjian wrote:

On Fri, Mar 20, 2020 at 11:15:07PM -0400, Tom Lane wrote:

Yeah, but the point is precisely that pg_logging_init()
also responds to PG_COLORS, which is not documented anywhere.


Oh, I thought it was a typo.  OK, so it still an open item.


Yes, I really think that we should have a new section in the docs for
that with more meaningful examples rather than copy-paste that stuff
across more pages of the docs.  Note that 5aaa584 has plugged all the
holes related to PG_COLOR I could find, and that pg_ctl and pg_upgrade
initialize logging with pg_logging_init() but these two cannot use
coloring because they have their own idea of what logging should be.


I'm giving up on making color the default, since there is clearly no 
consensus.


Attached is the documentation patch reworked.

Should we delete all the repetitive mentions on the man pages?

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 7164d4196ace14a1acc9077a7a4d32e57131ff6e Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 23 Mar 2020 09:22:50 +0100
Subject: [PATCH v2] Document color support

Add a documentation appendix that explains the PG_COLOR and PG_COLORS
environment variables.
---
 doc/src/sgml/color.sgml| 71 ++
 doc/src/sgml/filelist.sgml |  1 +
 doc/src/sgml/postgres.sgml |  1 +
 3 files changed, 73 insertions(+)
 create mode 100644 doc/src/sgml/color.sgml

diff --git a/doc/src/sgml/color.sgml b/doc/src/sgml/color.sgml
new file mode 100644
index 00..c30b1942a5
--- /dev/null
+++ b/doc/src/sgml/color.sgml
@@ -0,0 +1,71 @@
+
+
+
+ Color Support
+
+ 
+  color
+ 
+
+ 
+  Most programs in the PostgreSQL package can produce colorized console
+  output.  This appendix describes how that is configured.
+ 
+
+ 
+  When Color is Used
+
+  
+   To use colorized output, set the environment variable
+   PG_COLORPG_COLOR
+   as follows:
+
+   
+
+ 
+  If the value is always, then color is used.
+ 
+
+
+
+ 
+  If the value is auto and the standard error stream
+  is associated with a terminal device, then color is used.
+ 
+
+
+
+ 
+  Otherwise, color is not used.
+ 
+
+   
+  
+ 
+
+ 
+  Configuring the Colors
+
+  
+   The actual colors to be used are configured using the environment variable
+   PG_COLORSPG_COLORS
+   (note plural).  The value is a colon-separated list of
+   
key=value
+   pairs.  The keys specify what the color is to be used for.  The values are
+   SGR (Select Graphic Rendition) specifications, which are interpreted by the
+   terminal.
+  
+
+  
+   The default value is error=01;31:warning=01;35:locus=01.
+  
+
+  
+   
+This color specification format is also used by other software packages
+such as GCC, GNU
+coreutils, and GNU grep.
+   
+  
+ 
+
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 3da2365ea9..1043d0f7ab 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -170,6 +170,7 @@
 
 
 
+
 
 
 
diff --git a/doc/src/sgml/postgres.sgml b/doc/src/sgml/postgres.sgml
index e59cba7997..1f7bd32878 100644
--- a/doc/src/sgml/postgres.sgml
+++ b/doc/src/sgml/postgres.sgml
@@ -278,6 +278,7 @@ Appendixes
   &docguide;
   &limits;
   &acronyms;
+  &color;
 
  
 
-- 
2.25.0

Re: Additional improvements to extended statistics

2020-03-23 Thread Dean Rasheed

On Sat, 21 Mar 2020 at 21:59, Tomas Vondra  wrote:
>
> Ah, right. Yeah, I think that should work. I thought there would be some
> volatility due to groups randomly not making it into the MCV list, but
> you're right it's possible to construct the data in a way to make it
> perfectly deterministic. So that's what I've done in the attached patch.
>

Looking through those new tests, another issue occurred to me -- under
some circumstances this patch can lead to extended stats being applied
twice to the same clause, which is not great, because it involves
quite a lot of extra work, and also because it can lead to
overestimates because of the way that MCV stats are applied as a delta
correction to simple_sel.

The way this comes about is as follows -- if we start with an OR
clause, that will be passed as a single-item implicitly AND'ed list to
clauselist_selectivity(), and from there to
statext_clauselist_selectivity() and then to
statext_mcv_clauselist_selectivity(). This will call
clauselist_selectivity_simple() to get simple_sel, before calling
mcv_clauselist_selectivity(), which will recursively compute all the
MCV corrections. However, the call to clauselist_selectivity_simple()
will call clause_selectivity() for each clause (just a single OR
clause in this example), which will now call
clauselist_selectivity_or(), which will go back into
statext_clauselist_selectivity() with "is_or = true", which will apply
the MCV corrections to the same set of clauses that the outer call was
about to process.

I'm not sure what's the best way to resolve that. Perhaps
statext_mcv_clauselist_selectivity() / statext_is_compatible_clause()
should ignore OR clauses from an AND-list, on the basis that they will
get processed recursively later. Or perhaps estimatedclauses can
somehow be used to prevent this, though I'm not sure exactly how that
would work.

Regards,
Dean

Re: [HACKERS] WAL logging problem in 9.4.3?

2020-03-23 Thread Kyotaro Horiguchi

Thanks for the labour on this.

At Sat, 21 Mar 2020 15:49:20 -0700, Noah Misch  wrote in 
> On Sat, Mar 21, 2020 at 12:01:27PM -0700, Noah Misch wrote:
> > Pushed, after adding a missing "break" to gist_identify() and tweaking two
..
> The proximate cause is the RelFileNodeSkippingWAL() call that we added to
> MarkBufferDirtyHint().  MarkBufferDirtyHint() runs in parallel workers, but
> parallel workers have zeroes for pendingSyncHash and rd_*Subid.  I hacked up
> the attached patch to understand the scope of the problem (not to commit).  It
> logs a message whenever a parallel worker uses pendingSyncHash or
> RelationNeedsWAL().  Some of the cases happen often enough to make logs huge,
> so the patch suppresses logging for them.  You can see the lower-volume calls
> like this:
> 
>   printf '%s\n%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' 
> 'max_wal_senders = 0' 'force_parallel_mode = regress' 
> >/tmp/minimal_parallel.conf
>   make check-world TEMP_CONFIG=/tmp/minimal_parallel.conf
>   find . -name log | xargs grep -rl 'nm0 invalid'
> 
> Not all are actual bugs.  For example, get_relation_info() behaves fine:
> 
>   /* Temporary and unlogged relations are inaccessible during recovery. */
>   if (!RelationNeedsWAL(relation) && RecoveryInProgress())

But the relcache entry shows wrong information about new-ness of its
storage and it is the root cause of the all other problems.

> Kyotaro, can you look through the affected code and propose a strategy for
> good coexistence of parallel query with the WAL skipping mechanism?

Bi-directional communication between leader and workers is too-much.
It wouldn't be acceptable to inhibit the problematic operations on
workers such like heap-prune or btree pin removal.  If we do pending
syncs just before worker start, it won't fix the issue.

The attached patch passes a list of pending-sync relfilenodes at
worker start.  Workers create (immature) pending sync hash from the
list and create relcache entries using the hash. Given that parallel
workers don't perform transactional operations and DDL operations,
workers needs only the list of relfilenodes. The list might be long,
but I don't think it realistic that such many tables are truncated or
created then scanned in parallel within a transaction while wal_level
= minimal.

> Since I don't expect one strategy to win clearly and quickly, I plan to revert
> the main patch around 2020-03-22 17:30 UTC.  That will give the patch about
> twenty-four hours in the buildfarm, so more animals can report in.  I will
> leave the three smaller patches in place.

Thank you for your trouble and the check code. Sorry for not
responding in time.

> > fairywren failed differently on 9.5; I have not yet studied it:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-03-21%2018%3A01%3A10
> 
> This did not remain specific to 9.5.  On platforms where SIZEOF_SIZE_T==4 or
> SIZEOF_LONG==4, wal_skip_threshold cannot exceed 2GB.  A simple s/1TB/1GB/ in
> the test should fix this.

Oops. I felt that the 2TB looks too large but didn't get it
seriously. 1GB is 1048576 is less than the said limit 2097151 so the
attached second patch does that.

The attached is a proposal of fix of the issue on top of the reverted
commit.

- v36-0001-Skip-WAL-for-new-relfilenodes-under-wal_level-mi.patch
 The reverted patch.

- v36-0002-Fix-GUC-value-in-TAP-test.patch
 Change wal_skip_threashold to 2TB to 2GB in TAP test.

v36-0003-Fix-the-name-of-struct-pendingSyncs.patch
 I found that the struct of pending sync hash entry is named
 differently way from pending delete hash entry. Change it so that the
 two are in similarly naming convention.

v36-0004-Propagage-pending-sync-information-to-parallel-w.patch

 The proposed fix for the parallel-worker problem.

The make check-world above didn't fail with this patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 7aa0ba8fe7bccda5e24a8d25af925372aae8402f Mon Sep 17 00:00:00 2001
From: Noah Misch 
Date: Sat, 21 Mar 2020 09:38:26 -0700
Subject: [PATCH v36 1/4] Skip WAL for new relfilenodes, under
 wal_level=minimal.

Until now, only selected bulk operations (e.g. COPY) did this.  If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY.  See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules.  Maintainers of table access
methods should examine that section.

To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL.  A new GUC,
wal_skip_threshold, guides that choice.  If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold.  Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT f

Re: shared-memory based stats collector

2020-03-23 Thread Kyotaro Horiguchi

Thank you for looking this.

At Thu, 19 Mar 2020 16:51:59 +1300, Thomas Munro  wrote 
in 
> > This seems like a failure prone API.
> 
> If I understand correctly, the only purpose of the seqscan_running
> variable is to control that behaviour ^^^.  That is, to make
> dshash_delete_entry() keep the partition lock if you delete an entry
> while doing a seq scan.  Why not get rid of that, and provide a
> separate interface for deleting while scanning?
> dshash_seq_delete(dshash_seq_status *scan, void *entry).  I suppose it
> would be most common to want to delete the "current" item in the seq
> scan, but it could allow you to delete anything in the same partition,
> or any entry if using the "consistent" mode.  Oh, I see that Andres
> said the same thing later.

The attached v25 in [1] is the new version.

> > Why does this patch add the consistent mode? There's no users currently?
> > Without it's not clear that we need a seperate _term function, I think?
> 
> +1, let's not do that if we don't need it!

Yes, it is removed.

> > The fact that you're locking the per-database entry unconditionally once
> > for each table almost guarantees contention - and you're not using the
> > 'conditional lock' approach for that. I don't understand.
> 
> Right, I also noticed that:

I think I fixed all cases except drop or something like that needs
exclusive lock.

> So pgstat_update_tabentry() goes to great trouble to take locks
> conditionally, but then pgstat_update_dbentry() immediately does:
> 
> LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
> dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
> dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
> dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
> dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
> dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
> dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
> dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
> LWLockRelease(&dbentry->lock);
> 
> Why can't we be "lazy" with the dbentry stats too?  Is it really
> important for the table stats and DB stats to agree with each other?
> Even if it were, your current coding doesn't achieve that: the table
> stats are updated before the DB stat under different locks, so I'm not
> sure why it can't wait longer.

It is done lazy way.

> Hmm.  Even if you change the above code use a conditional lock, I am
> wondering (admittedly entirely without data) if this approach is still
> too clunky: even trying and failing to acquire the lock creates
> contention, just a bit less.  I wonder if it would make sense to make
> readers do more work, so that writers can avoid contention.  For
> example, maybe PgStat_StatDBEntry could hold an array of N sets of
> counters, and readers have to add them all up.  An advanced version of

I thought that kind of solution but that needs more memory multipled
by the number of backends. If the contention is not negligible, we can
go back to stats collector process connected via sockets then share
the result on shared memory. The motive was the file I/O on reading
stats on backens.

> this idea would use a reasonably fresh copy of something like
> sched_getcpu() and numa_node_of_cpu() to select a partition to
> minimise contention and cross-node traffic, with a portable fallback
> based on PID or something.  CPU core/node awareness is something I
> haven't looked into too seriously, but it's been on my mind to solve
> some other problems.

I have got asked about the CPU core/node awareness several times.  It
might have a certain degree of needs.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: error context for vacuum to include block number

2020-03-23 Thread Justin Pryzby

On Mon, Mar 23, 2020 at 04:39:54PM +0900, Masahiko Sawada wrote:
> I've already commented on earlier patch but I personally think we'd be
> better to report freespace map vacuum as a separate phase. The
> progress report of vacuum command is used to know the progress but
> this error context would be useful for diagnostic of failure such as
> disk corruption. For visibility map, I think the visibility map bit
> that are processed during vacuum is exactly corresponding to the block
> number but since freespace map vacuum processes the range of blocks
> I've sometimes had trouble with identifying the cause of the problem.

Yea, and it would be misleading if we reported "while scanning block..of
relation" if we actually failed while writing its FSM.

My previous patches did this:

+   case VACUUM_ERRCB_PHASE_VACUUM_FSM: 

   
+   errcontext("while vacuuming free space map of relation 
\"%s.%s\"", 

+   cbarg->relnamespace, cbarg->relname);   

   
+   break;  

   

Are you suggesting it should report the start (or end?) block number ?

-- 
Justin

1 2 >

1 - 100 of 103 matches

Mail list logo