Re: [HACKERS] WAL logging problem in 9.4.3?

2020-02-20 Thread Kyotaro Horiguchi
Hello. I looked through the latest patch.

At Wed, 19 Feb 2020 17:29:08 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Tue, 18 Feb 2020 23:44:52 -0800, Noah Misch  wrote in 
> > - When reusing an index build, instead of storing the dropped relid in the
> >   IndexStmt and opening the dropped relcache entry in ATExecAddIndex(), 
> > store
> >   the subid fields in the IndexStmt.  This is less code, and I felt
> >   RelationIdGetRelationCache() invited misuse.
> 
> Hmm. I'm not sure that index_create having the new subid parameters is
> good. And the last if(OidIsValid) clause handles storage persistence
> so I did that there. But I don't strongly against it.
> 
> Please give a bit more time to look it.


The change on alter_table.sql and create_table.sql is expecting to
cause assertion failure.  Don't we need that kind of explanation in
the comment? 

In swap_relation_files, we can remove rel2-related code when #ifndef
USE_ASSERT_CHECKING.

The patch adds the test for createSubid to pg_visibility.out. It
doesn't fail without CLOBBER_CACHE_ALWAYS while regression test but
CLOBBER_CACHE_ALWAYS causes initdb fail and the added check won't be
reached. I'm not sure it is useful.

config.sgml:
+When wal_level is minimal and a
+transaction commits after creating or rewriting a permanent table,
+materialized view, or index, this setting determines how to persist

"creating or truncation" a permanent table?  and maybe "refreshing
matview and reindex". I'm not sure that they can be merged that way.

Other than the item related to pg_visibility.sql are in the attached.

The others look good to me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3068e1e94a..38a2edf860 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2889,13 +2889,13 @@ include_dir 'conf.d'
   

 When wal_level is minimal and a
-transaction commits after creating or rewriting a permanent table,
-materialized view, or index, this setting determines how to persist
-the new data.  If the data is smaller than this setting, write it to
-the WAL log; otherwise, use an fsync of the data file.  Depending on
-the properties of your storage, raising or lowering this value might
-help if such commits are slowing concurrent transactions.  The default
-is two megabytes (2MB).
+transaction commits after creating or truncating a permanent table,
+refreshing a materialized view, or reindexing, this setting determines
+how to persist the new data.  If the data is smaller than this
+setting, write it to the WAL log; otherwise, use an fsync of the data
+file.  Depending on the properties of your storage, raising or
+lowering this value might help if such commits are slowing concurrent
+transactions.  The default is two megabytes (2MB).

   
  
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 391a8a9ea3..682619c9db 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1118,16 +1118,18 @@ swap_relation_files(Oid r1, Oid r2, bool 
target_is_pg_class,
 */
{
Relationrel1,
-   rel2;
+   rel2 PG_USED_FOR_ASSERTS_ONLY;
 
rel1 = relation_open(r1, NoLock);
+#ifdef USE_ASSERT_CHECKING
rel2 = relation_open(r2, NoLock);
rel2->rd_createSubid = rel1->rd_createSubid;
rel2->rd_newRelfilenodeSubid = rel1->rd_newRelfilenodeSubid;
rel2->rd_firstRelfilenodeSubid = rel1->rd_firstRelfilenodeSubid;
+   relation_close(rel2, NoLock);
+#endif
RelationAssumeNewRelfilenode(rel1);
relation_close(rel1, NoLock);
-   relation_close(rel2, NoLock);
}
 
/*
diff --git a/src/test/regress/expected/alter_table.out 
b/src/test/regress/expected/alter_table.out
index 7c2181ac2f..3c500944cd 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -1985,7 +1985,8 @@ select * from another;
 
 drop table another;
 -- Create an index that skips WAL, then perform a SET DATA TYPE that skips
--- rewriting the index.
+-- rewriting the index. Inadvertent changes on rd_createSubid causes
+-- asseertion failure.
 begin;
 create table skip_wal_skip_rewrite_index (c varchar(10) primary key);
 alter table skip_wal_skip_rewrite_index alter c type varchar(20);
diff --git a/src/test/regress/expected/create_table.out 
b/src/test/regress/expected/create_table.out
index 6acf31725f..dae7595957 100644
--- a/src/test/regress/expected/create_table.out
+++ b/src/test/regress/expected/create_table.out
@@ -332,6 +332,7 @@ ERROR:  set-returning functions are not allowed in DEFAULT 

Re: Autovacuum on partitioned table

2020-02-20 Thread Masahiko Sawada
On Fri, 21 Feb 2020 at 15:14, yuzuko  wrote:
>
> Hello Amit-san,
>
> Thanks for your comments.
>
> > * White-space noise in the diff (space used where tab is expected);
> > please check with git diff --check and fix.
> Fixed it.
>
> > * Names changes_tuples, m_changes_tuples should be changed_tuples and
> > m_changed_tuples, respectively?
> Yes, I modified it.
>
> > * Did you intend to make it so that we now report *all* inherited
> > stats to the stats collector, not just those for partitioned tables?
> > IOW, do did you intend the new feature to also cover traditional
> > inheritance parents? I am talking about the following diff:
> >
> I modified as follows to apply this feature to only declaretive partitioning.
>
> - if (!inh)
> -  pgstat_report_analyze(onerel, totalrows, totaldeadrows,
> - (va_cols == NIL));
> + if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
> + pgstat_report_analyze(onerel, totalrows, totaldeadrows,
> +(va_cols == NIL));
>
>
> > Having read the relevant diffs again, I think this could be done
> > without duplicating code too much.  You seem to have added the same
> > logic in two places: do_autovacuum() and table_recheck_autovac().
> > More importantly, part of the logic of relation_needs_vacanalyze() is
> > duplicated in both of the aforementioned places, which I think is
> > unnecessary and undesirable if you consider maintainability. I think
> > we could just add the logic to compute reltuples for partitioned
> > tables at the beginning of relation_needs_vacanalyze() and be done.
> >
> Yes, indeed.  Partitioned tables don't need to vacuum so I added new
> checking process for partitioned tables outside relation_needs_vacanalyze().
> However, partitioned tables' tabentry->n_dead_tuples are always 0 so
> dovacuum is always false.   So I think that checking both auto vacuum
> and analyze for partitioned tables doesn't matter.  I merged 
> v3_amit_delta.patch
> into the new patch and found minor bug, partitioned table's reltuples is
> overwritten with it's classForm->reltuples, so I fixed it.
>
> Also, I think partitioned tables' changes_since_analyze should be reported
> only when Autovacuum process.  So I fixed it too.

Thank you for updating the patch. I tested v4 patch.

After analyze or autoanalyze on partitioned table n_live_tup and
n_dead_tup are updated. However, TRUNCATE and VACUUM on the
partitioned table don't change these values until invoking analyze or
autoanalyze whereas in normal tables these values are reset or
changed. For example, with your patch:

* Before
 relname | n_live_tup | n_dead_tup | n_mod_since_analyze
-+++-
 c1  | 11 |  0 |   0
 c2  | 11 |  0 |   0
 c3  | 11 |  0 |   0
 c4  | 11 |  0 |   0
 c5  | 11 |  0 |   0
 parent  | 55 |  0 |   0
(6 rows)

* After 'TRUNCATE parent'
 relname | n_live_tup | n_dead_tup | n_mod_since_analyze
-+++-
 c1  |  0 |  0 |   0
 c2  |  0 |  0 |   0
 c3  |  0 |  0 |   0
 c4  |  0 |  0 |   0
 c5  |  0 |  0 |   0
 parent  | 55 |  0 |   0
(6 rows)

* Before
 relname | n_live_tup | n_dead_tup | n_mod_since_analyze
-+++-
 c1  |  0 | 11 |   0
 c2  |  0 | 11 |   0
 c3  |  0 | 11 |   0
 c4  |  0 | 11 |   0
 c5  |  0 | 11 |   0
 parent  |  0 | 55 |   0
(6 rows)

* After 'VACUUM parent'
 relname | n_live_tup | n_dead_tup | n_mod_since_analyze
-+++-
 c1  |  0 |  0 |   0
 c2  |  0 |  0 |   0
 c3  |  0 |  0 |   0
 c4  |  0 |  0 |   0
 c5  |  0 |  0 |   0
 parent  |  0 | 55 |   0
(6 rows)

We can make it work correctly but I think perhaps we can skip updating
statistics values of partitioned tables other than n_mod_since_analyze
as the first step. Because if we support also n_live_tup and
n_dead_tup, user might get confused that other statistics values such
as seq_scan, seq_tup_read however are not supported.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [Proposal] Global temporary tables

2020-02-20 Thread Prabhat Sahu
Hi,
I have started testing the "Global temporary table" feature,
from "gtt_v11-pg13.patch". Below is my findings:

-- session 1:
postgres=# create global temporary table gtt1(a int);
CREATE TABLE

-- seeeion 2:
postgres=# truncate gtt1 ;
ERROR:  could not open file "base/13585/t3_16384": No such file or directory

is it expected?

On Sun, Feb 16, 2020 at 8:53 PM Pavel Stehule 
wrote:

>
>
> ne 16. 2. 2020 v 16:15 odesílatel 曾文旌(义从) 
> napsal:
>
>>
>>
>> 2020年2月15日 下午6:06,Pavel Stehule  写道:
>>
>>
>> postgres=# insert into foo select generate_series(1,1);
>>> INSERT 0 1
>>> postgres=# \dt+ foo
>>>   List of relations
>>> ┌┬──┬───┬───┬─┬┬─┐
>>> │ Schema │ Name │ Type  │ Owner │ Persistence │  Size  │ Description │
>>> ╞╪══╪═══╪═══╪═╪╪═╡
>>> │ public │ foo  │ table │ pavel │ session │ 384 kB │ │
>>> └┴──┴───┴───┴─┴┴─┘
>>> (1 row)
>>>
>>> postgres=# truncate foo;
>>> TRUNCATE TABLE
>>> postgres=# \dt+ foo
>>>   List of relations
>>> ┌┬──┬───┬───┬─┬───┬─┐
>>> │ Schema │ Name │ Type  │ Owner │ Persistence │ Size  │ Description │
>>> ╞╪══╪═══╪═══╪═╪═══╪═╡
>>> │ public │ foo  │ table │ pavel │ session │ 16 kB │ │
>>> └┴──┴───┴───┴─┴───┴─┘
>>> (1 row)
>>>
>>> I expect zero size after truncate.
>>>
>>> Thanks for review.
>>>
>>> I can explain, I don't think it's a bug.
>>> The current implementation of the truncated GTT retains two blocks of
>>> FSM pages.
>>> The same is true for truncating regular tables in subtransactions.
>>> This is an implementation that truncates the table without changing the
>>> relfilenode of the table.
>>>
>>>
>> This is not extra important feature - now this is little bit a surprise,
>> because I was not under transaction.
>>
>> Changing relfilenode, I think, is necessary, minimally for future VACUUM
>> FULL support.
>>
>> Not allowing relfilenode changes is the current limit.
>> I think can improve on it. But ,This is a bit complicated.
>> so I'd like to know the necessity of this improvement.
>> Could you give me more details?
>>
>
> I don't think so GTT without support of VACUUM FULL can be accepted. Just
> due consistency.
>
> Regards
>
> Pavel
>
>
>>
>> Regards
>>
>> Pavel Stehule
>>
>>
>>>
>>> Wenjing
>>>
>>>
>>> Regards
>>>
>>> Pavel
>>>
>>>
>
> Wenjing
>
>
>
>
> >
> > --
> > Robert Haas
> > EnterpriseDB: http://www.enterprisedb.com
> > The Enterprise PostgreSQL Company
>
>
>>>
>>

-- 

With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com


Re: [Patch] Make pg_checksums skip foreign tablespace directories

2020-02-20 Thread Michael Paquier
On Thu, Feb 20, 2020 at 05:38:15PM +0100, Bernd Helmle wrote:
> So i propose a different approach like the attached patch tries to
> implement: instead of just blindly iterating over directory contents
> and filter them out, reference the tablespace location and
> TABLESPACE_VERSION_DIRECTORY directly. This is done by a new function
> scan_tablespaces() which is specialized in just follow the
> symlinks/junctions in pg_tblspc and call scan_directory() with just
> what it has found there. It will also honour directories, just in case
> an experienced DBA has copied over the tablespace into pg_tblspc
> directly.

+   if (S_ISREG(st.st_mode))
+   {
+   pg_log_debug("ignoring file %s in pg_tblspc", de->d_name);
+   continue;
+   }
We don't do that for the normal directory scan path, so it does not
strike me as a good idea on consistency ground.  As a whole, I don't
see much point in having a separate routine which is just roughly a
duplicate of scan_directory(), and I think that we had better just add
the check looking for matches with TABLESPACE_VERSION_DIRECTORY
directly when having a directory, if subdir is "pg_tblspc".  That
also makes the patch much shorter.

+ * the direct path to it and check via lstat wether it exists.
s/wether/whether/, repeated three times.

We should have some TAP tests for that.  The first patch of this
thread from Michael had some, but I would just have added a dummy
tablespace with an empty file in 002_actions.pl, triggering an error
if pg_checksums is not fixed.  Dummy entries around the place where
dummy temp files are added would be fine.
--
Michael


signature.asc
Description: PGP signature


Re: PATCH: Add uri percent-encoding for binary data

2020-02-20 Thread Anders Åstrand
Thanks for keeping this alive even though I disappeared after submitting it!

I can write documentation this weekend.

Thanks again.
//Anders

On Thu, 20 Feb 2020, 23:28 Alvaro Herrera,  wrote:

> On 2019-Oct-07, Anders Åstrand wrote:
>
> > Attached is a patch for adding uri as an encoding option for
> > encode/decode. It uses what's called "percent-encoding" in rfc3986
> > (https://tools.ietf.org/html/rfc3986#section-2.1).
>
> Thanks.  Seems useful.  I made a few cosmetic tweaks and it looks almost
> ready to me; however, documentation is missing.  I added a stub; can you
> please complete that?
>
> To answer Arthur Zakirov's question: yes, the standard recommends
> ("should") to use uppercase characters:
>
> :  For consistency, URI producers and
> :  normalizers should use uppercase hexadecimal digits for all percent-
> :  encodings.
>
> Thanks,
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: Add PGURI env var for passing connection string to psql in Docker

2020-02-20 Thread Craig Ringer
On Fri, 21 Feb 2020 at 08:03, Michael Leonhard  wrote:
> 1. parse a perfectly good URI

You have a URI with embedded password, which to me is not a perfectly
good URI at all. I think the problem really lies with the input:
separate your secret credentials out to start with, don't munge them
into a URI.

> ~/.pgpass is useful for folks who manually connect to databases.  I'm
> writing deployment, backup, and restore automation tools.  I would
> like to keep these tools simple.  Using pgpass requires extra steps:

That's why we have pg_service.conf, though that only helps libpq applications.

It's a shame that Docker doesn't make it simpler to inject individual
files into containers at "docker run" time. But wrapper dockerfiles
are trivial. -v bind mounting is also an option but then you have the
file sitting around on the host, which is undesirable. You can unlink
the bind mounted dir though.

For Docker you have --env-file to avoid putting the environment on the
command line of the container-host, which helps explain why you are
willing to use an env var for this. I wouldn't be too confident in
assuming there's no way to peek at the environment of the
containerised process(es) from outside the container. Much more likely
than being able to peek at a file, anyway.

Then again, Docker relies on dropping capabilities and likes to run as
root-that-isn't-root-except-when-it's-root, which doesn't thrill me
when it comes to security. At all.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise




Re: Autovacuum on partitioned table

2020-02-20 Thread yuzuko
Hello Amit-san,

Thanks for your comments.

> * White-space noise in the diff (space used where tab is expected);
> please check with git diff --check and fix.
Fixed it.

> * Names changes_tuples, m_changes_tuples should be changed_tuples and
> m_changed_tuples, respectively?
Yes, I modified it.

> * Did you intend to make it so that we now report *all* inherited
> stats to the stats collector, not just those for partitioned tables?
> IOW, do did you intend the new feature to also cover traditional
> inheritance parents? I am talking about the following diff:
>
I modified as follows to apply this feature to only declaretive partitioning.

- if (!inh)
-  pgstat_report_analyze(onerel, totalrows, totaldeadrows,
- (va_cols == NIL));
+ if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+ pgstat_report_analyze(onerel, totalrows, totaldeadrows,
+(va_cols == NIL));


> Having read the relevant diffs again, I think this could be done
> without duplicating code too much.  You seem to have added the same
> logic in two places: do_autovacuum() and table_recheck_autovac().
> More importantly, part of the logic of relation_needs_vacanalyze() is
> duplicated in both of the aforementioned places, which I think is
> unnecessary and undesirable if you consider maintainability. I think
> we could just add the logic to compute reltuples for partitioned
> tables at the beginning of relation_needs_vacanalyze() and be done.
>
Yes, indeed.  Partitioned tables don't need to vacuum so I added new
checking process for partitioned tables outside relation_needs_vacanalyze().
However, partitioned tables' tabentry->n_dead_tuples are always 0 so
dovacuum is always false.   So I think that checking both auto vacuum
and analyze for partitioned tables doesn't matter.  I merged v3_amit_delta.patch
into the new patch and found minor bug, partitioned table's reltuples is
overwritten with it's classForm->reltuples, so I fixed it.

Also, I think partitioned tables' changes_since_analyze should be reported
only when Autovacuum process.  So I fixed it too.

-- 
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center


v4_autovacuum_on_partitioned_table.patch
Description: Binary data


Re: [Patch] Make pg_checksums skip foreign tablespace directories

2020-02-20 Thread Michael Paquier
 On Thu, Feb 20, 2020 at 07:37:15AM -0600, David Steele wrote:
> But since the name includes the backend's pid you would need to get lucky
> and have a new backend with the same pid create the file after a restart.  I
> tried it and the old temp file was left behind after restart and first
> connection to the database.
> 
> I doubt this is a big issue in the field, but it seems like it would be nice
> to do something about it.

The natural area to do that would be around ResetUnloggedRelations().
Still that would require combine both operations to not do any
unnecessary lookups at the data folder paths.

> I'm not excited about the amount of code duplication between these three
> tools.  I know this was because of back-patching various issues in the past,
> but I really think we need to unify these data structures/functions in HEAD.

The lists are duplicated because we have never really figured out how
to combine this code in one place.  The idea was to have all the data
folder path logic and the lists within one header shared between the
frontend and backend but there was not much support for that on HEAD.

>> For now, my proposal is to fix the prefix first, and then let's look
>> at the business with tablespaces where needed.
> 
> OK.

I'll let this patch round for a couple of extra day, and revisit it at
the beginning of next week.
--
Michael


signature.asc
Description: PGP signature


Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-20 Thread Tom Lane
I wrote:
> Clearly, we gotta do something about that too.  Maybe bumping up
> NUM_RESERVED_FDS would be a good idea, but I feel like more-honest
> accounting for permanently-eaten FDs would be a better idea.  And
> in any case we can't suppose that there's a fixed upper limit on
> the number of postgres_fdw connections.  I'm liking the idea I floated
> earlier of letting postgres_fdw forcibly close the oldest LRU entry.

A late-night glimmer of an idea: suppose we make fd.c keep count,
not just of the open FDs under its control, but also of open FDs
not under its control.  The latter count would include the initial
FDs (stdin/stdout/stderr), and FDs allocated by OpenTransientFile
et al, and we could provide functions for other callers to report
that they just allocated or released a FD.  So postgres_fdw could
report, when it opens or closes a PGconn, that the count of external
FDs should be increased or decreased.  fd.c would then forcibly
close VFDs as needed to keep NUM_RESERVED_FDS worth of slop.  We
still need slop, to provide some daylight for code that isn't aware
of this mechanism.  But we could certainly get all these known
long-lived FDs to be counted, and that would allow fd.c to reduce
the number of open VFDs enough to ensure that some slop remains.

This idea doesn't fix every possible problem.  For instance, if you
have a plperlu function that wants to open a bunch of files, I don't
see any easy way to ensure we release VFDs to make that possible.
But it's sure an improvement on the status quo.

regards, tom lane




Re: client-side fsync() error handling

2020-02-20 Thread Michael Paquier
On Thu, Feb 20, 2020 at 10:10:11AM +0100, Peter Eisentraut wrote:
> OK, added in new patch.

Thanks, that looks good.

> The frontends do neither right now, or at least the error handling is very
> inconsistent and inscrutable.  It would be possible in theory to add a retry
> option, but that would be a very different patch, and given what we have
> learned about fsync(), it probably wouldn't be widely useful.

Perhaps.  Let's have this discussion later if there are drawbacks
about changing things the way your patch does.  If we don't do that,
we'll never know about it either and this patch makes things safer.
--
Michael


signature.asc
Description: PGP signature


Re: allow running parts of src/tools/msvc/ under not Windows

2020-02-20 Thread Michael Paquier
On Thu, Feb 20, 2020 at 09:31:32AM -0500, Tom Lane wrote:
> Peter Eisentraut  writes:
>> The main benefit is that if you make "blind" edits in the Perl files, 
>> you can verify them easily, first by seeing that the Perl code runs, 
>> second, depending on the circumstances, by diffing the created project 
>> files.  Another is that if you do some nontrivial surgery in makefiles, 
>> you can check whether the Perl code can still process them.  So the 
>> benefit is mainly that you can iterate faster when working on build 
>> system related things.  You still need to do a full test on Windows at 
>> the conclusion, but then hopefully with a better chance of success.
> 
> I see.  No objection then.

None from here either, and the patch is working correctly.
--
Michael


signature.asc
Description: PGP signature


Re: bad wal on replica / incorrect resource manager data checksum in record / zfs

2020-02-20 Thread Amit Kapila
On Thu, Feb 20, 2020 at 7:40 PM Alex Malek  wrote:
>
> On Thu, Feb 20, 2020, 6:16 AM Amit Kapila  wrote:
>>
>> On Thu, Feb 20, 2020 at 3:06 AM Alex Malek  wrote:
>> >
>> > Some interesting data points while debugging:
>> > We had lowered the ZFS recordsize from 128K to 32K and for that week the 
>> > issue started happening every other day.
>> > Using xxd and diff we compared "good" and "bad" wal files and the 
>> > differences were not random bad bytes.
>> >
>> > The bad file either had a block of zeros that were not in the good file at 
>> > that position or other data.  Occasionally the bad data has contained 
>> > legible strings not in the good file at that position. At least one of 
>> > those exact strings has existed elsewhere in the files.
>> > However I am not sure if that is the case for all of them.
>> >
>> > This made me think that maybe there was an issue w/ wal file recycling and 
>> > ZFS under heavy load, so we tried lowering
>> > min_wal_size in order to "discourage" wal file recycling but my 
>> > understanding is a low value discourages recycling but it will still
>> > happen (unless setting wal_recycle in psql 12).
>> >
>>
>> We do print a message "recycled write-ahead log file .." in DEBUG2
>> mode.  You either want to run the server with DEBUG2 or maybe change
>> the code to make it LOG and see if that is printed.  If you do that,
>> you can verify if the corrupted WAL is the same as a recycled one.
>
>
> Are you suggesting having the master, the replicas or all in debug mode?
>

The system(s) where you are expecting that wal recycling would have
created some problem.

> How much extra logging would this generate?
>

To some extent, it depends on your workload.  It will certainly
generate much more than when you have not enabled the debug level.
But, what other option you have to identify the root cause or at least
find out whether your suspicion is right or not.  As mentioned
earlier, if you have the flexibility of changing code to find out the
reason, then you can change the code (at the place I told yesterday)
to make the level as LOG in which case you can set the
log_min_messages to LOG and it will generate much fewer logs on the
server.

> A replica typically consumes over 1 TB of WAL files before a bad wal file is 
> encountered.
>
>
>> >
>> > Any insight into the source of this bug or how to address it?
>> >
>> > Since the master has a good copy of the WAL file, can the replica 
>> > re-request  the file from the master? Or from the archive?
>> >
>>
>> I think we do check in the archive if we get the error during
>> streaming, but archive might also have the same data due to which this
>> problem happens.  Have you checked that the archive WAL file, is it
>> different from the bad WAL?  See the
>
>
> Typically the master, the archive and the other replicas all have a good copy 
> of the WAL file.
>
>> relevant bits of code in
>> WaitForWALToBecomeAvailable especially the code near below comment:
>>
>> "Failure while streaming. Most likely, we got here because streaming
>> replication was terminated, or promotion was triggered. But we also
>> get here if we find an invalid record in the WAL streamed from master,
>> in which case something is seriously wrong. There's little chance that
>> the problem will just go away, but PANIC is not good for availability
>> either, especially in hot standby mode. So, we treat that the same as
>> disconnection, and retry from archive/pg_wal again. The WAL in the
>> archive should be identical to what was streamed, so it's unlikely
>> that it helps, but one can hope..."
>>
>
> Thank you for this comment!
> This made me realize that on the replicas I had mentioned we had removed the 
> restore_command.
> The replica we thought was not having the issue, was actually also 
> getting/producing bad WAL files but was eventually recovering by getting a 
> good WAL file from the archive b/c it had the restore_command defined.
>

Good to know that there is some way to recover from the situation.
But, I think it is better to find the root cause of what led to bad
WAL files so that you can fix it if possible.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Clean up some old cruft related to Windows

2020-02-20 Thread Michael Paquier
On Thu, Feb 20, 2020 at 12:39:56PM +0100, Juan José Santamaría Flecha wrote:
> Actually, you can still use the vcvars% scripts to configure architecture,
> platform_type and winsdk_version with current VS [1].

We still support the build down to MSVC 2013, so I think that it is
good to mention the options available for 2013 and 2015 as well, as
noted here:
https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/compiler-options/how-to-set-environment-variables-for-the-visual-studio-command-line
"Visual Studio 2015 and earlier versions used VSVARS32.bat, not
VsDevCmd.bat for the same purpose."

> Please find attached a new version that addresses these issues.
> 
> [1]
> https://docs.microsoft.com/en-us/cpp/build/building-on-the-command-line?view=vs-2019

Thanks, applied after tweaking the text a bit.  I have applied that
down to 12 where we support MSVC from 2013.
--
Michael


signature.asc
Description: PGP signature


Re: Portal->commandTag as an enum

2020-02-20 Thread John Naylor
On Thu, Feb 20, 2020 at 5:16 AM Mark Dilger
 wrote:
>
> > On Feb 19, 2020, at 3:31 AM, John Naylor  
> > wrote:
> > thought it through. For one advantage, I think it might be nicer to
> > have indexing.dat and toasting.dat instead of having to dig the info
> > out of C macros. This was evident while recently experimenting with
> > generating catcache control data.
>
> I guess you mean macros DECLARE_UNIQUE_INDEX and DECLARE_TOAST.  I don’t mind 
> converting that to .dat files, though I’m mindful of Tom’s concern expressed 
> early in this thread about the amount of code churn.  Is there sufficient 
> demand for refactoring this stuff?  There are more reasons in the 
> conversation below to refactor the perl modules and code generation scripts.

Yes, I was referring to those macros, but I did not mean to imply you
should convert them, either now or ever. I was just thinking out loud
about the possibility. :-)

That said, if we ever want Catalog.pm to delegate vivification to
DataFile.pm, there will eventually need to be a way to optionally
preserve comments and whitespace.

> I have taken all this advice in v5 of the patch.  --inputfile and --outputdir 
> (previously named --sourcedir) are now optional with the defaults as you 
> suggested.

+my $inputfile = '../../include/utils/commandtag.dat';

I think we should skip the default for the input file, since the
relative path is fragile and every such script I've seen requires it
to be passed in.

> > DataFiles.pm
> > [...]
> > I'm not familiar with using different IO routines depending on the OS
> > -- what's the benefit of that?
>
> I think you are talking about the slurp_file routine.  That came directly 
> from the TestLib.pm module.  I don't have enough perl-on-windows experience 
> to comment about why it does things that way.

Yes, that one, sorry I wasn't explicit.

> Should I submit a separate patch refactoring the location of perl modules and 
> functions of general interest into src/tools?  src/tools/perl?

We may have accumulated enough disparate functionality by now to
consider this, but it sounds like PG14 material in any case.

> I expect there will have to be a v6 once this has been adequately debated.

So far I haven't heard any justification for why it should remain in
src/backend/catalog, when it has nothing to do with catalogs. We don't
have to have a standard other-place for there to be a better
other-place.

> > --
> > gencommandtag.pl

> > sanity_check_data() seems longer than the main body of the script
> > (minus header boilerplate), and I wonder if we can pare it down some.
> > For one, I have difficulty imagining anyone would accidentally type an
> > unprintable or non-ascii character in a command tag and somehow not
> > realize it.
> > [...]
> > For another, duplicating checks that were done earlier
> > seems like a maintenance headache.
>
> [ detailed rebuttals about the above points ]

Those were just examples that stuck out at me, so even if I were
convinced to your side on those, my broader point was: the sanity
check seems way over-engineered for something that spits out an enum
and an array. Maybe I'm not imaginative enough. I found it hard to
read in any case.

> Catalog.pm writes a temporary file and then moves it to the final file name 
> at the end.  DataFile buffers the output and only writes it after all the 
> code generation has succeeded.  There is no principled basis for these two 
> modules tackling the same problem in two different ways.

Not the same problem. The temp files were originally for parallel Make
hazards, and now to prevent large portions of the backend to be
rebuilt. I actually think partially written files can be helpful for
debugging, so I don't even think it's a problem. But as I said, it
doesn't matter terribly much either way.

> > Speaking of boilerplate, it's better for readability to separate that
> > from actual code such as:
> >
> > typedef enum CommandTag
> > {
> > #define FIRST_COMMANDTAG COMMANDTAG_$sorted[0]->{taglabel})
>
> Good idea.  While I was doing this, I also consolidated the duplicated 
> boilerplate into just one function.  I think this function, too, should go in 
> just one perl module somewhere.  See boilerplate_header() for details.

I like this a lot.

While thinking, I wonder if it makes sense to have a CodeGen module,
which would contain e.g. the new ParseData function,
FindDefinedSymbol, and functions for writing boilerplate.

--
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: custom postgres launcher for tests

2020-02-20 Thread Craig Ringer
On Tue, 11 Feb 2020 at 19:33, Ivan Taranov  wrote:
>
> This patch allow to use custom postgres launcher for tests (tap)
> by setting  environment variable PGLAUNCHER.

I thought I saw a related patch to this that proposed to add a pg_ctl
argument. Was that you too? I can't find it at the moment.

In any case I *very definitely* want the ability to wrap the
'postgres' executable we launch via some wrapper command. I currently
do this with a horrid shellscript hack that uses next-on-path lookups
and a wrapper 'postgres' executable. But because of how
find_other_exec works this is rather far from optimal so I'm a fan of
a cleaner, built-in alternative.

I was initially going to object to using an env-var. But if someone
controls your environment they can already LD_PRELOAD you or PATH-hack
into doing whatever they want, so it's not a security concern. And
you're right that getting a pg_ctl command line or the like into all
the places where we'd want it deep inside TAP tests etc is just too
much hassle otherwise.

I haven't reviewed the code yet, but for the idea a strong +1000 or so.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise




Re: Memory-Bounded Hash Aggregation

2020-02-20 Thread Jeff Davis
On Wed, 2020-02-19 at 20:16 +0100, Tomas Vondra wrote:
> 1) explain.c currently does this:
> 
> I wonder if we could show something for plain explain (without
> analyze).
> At least the initial estimate of partitions, etc. I know not showing
> those details until after execution is what e.g. sort does, but I
> find
> it a bit annoying.

Looks like you meant to include some example explain output, but I
think I understand what you mean. I'll look into it.

> 2) The ExecBuildAggTrans comment should probably explain "spilled".

Done.

> 3) I wonder if we need to invent new opcodes? Wouldn't it be simpler
> to
> just add a new flag to the agg_* structs instead? I haven't tried
> hacking
> this, so maybe it's a silly idea.

There was a reason I didn't do it this way, but I'm trying to remember
why. I'll look into this, also.

> 4) lookup_hash_entries says
> 
>/* check to see if we need to spill the tuple for this grouping
> set */
> 
> But that seems bogus, because AFAIK we can't spill tuples for
> grouping
> sets. So maybe this should say just "grouping"?

Yes, we can spill tuples for grouping sets. Unfortunately, I think my
tests (which covered this case previously) don't seem to be exercising
that path well now. I am going to improve my tests, too.

> 5) Assert(nbuckets > 0);

I did not repro this issue, but I did set a floor of 256 buckets.

> which fails with segfault at execution time:

Fixed. I was resetting the hash table context without setting the
pointers to NULL.

Thanks! I attached a new, rebased version. The fixes are quick fixes
for now and I will revisit them after I improve my test cases (which
might find more issues).

Regards,
Jeff Davis

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c1128f89ec7..85f559387f9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1751,6 +1751,23 @@ include_dir 'conf.d'
   
  
 
+ 
+  hashagg_mem_overflow (boolean)
+  
+   hashagg_mem_overflow configuration parameter
+  
+  
+  
+   
+ If hash aggregation exceeds work_mem at query
+ execution time, and hashagg_mem_overflow is set
+ to on, continue consuming more memory rather than
+ performing disk-based hash aggregation. The default
+ is off.
+   
+  
+ 
+
  
   max_stack_depth (integer)
   
@@ -4476,6 +4493,24 @@ ANY num_sync ( 
+  enable_hashagg_spill (boolean)
+  
+   enable_hashagg_spill configuration parameter
+  
+  
+  
+   
+Enables or disables the query planner's use of hashed aggregation plan
+types when the memory usage is expected to
+exceed work_mem. This only affects the planner
+choice; actual behavior at execution time is dictated by
+. The default
+is on.
+   
+  
+ 
+
  
   enable_hashjoin (boolean)
   
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d901dc4a50e..2923f4ba46d 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -104,6 +104,7 @@ static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 			 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
+static void show_hashagg_info(AggState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 ExplainState *es);
 static void show_instrumentation_count(const char *qlabel, int which,
@@ -1882,6 +1883,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Agg:
 			show_agg_keys(castNode(AggState, planstate), ancestors, es);
 			show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (es->analyze)
+show_hashagg_info((AggState *) planstate, es);
 			if (plan->qual)
 show_instrumentation_count("Rows Removed by Filter", 1,
 		   planstate, es);
@@ -2769,6 +2772,55 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 	}
 }
 
+/*
+ * If EXPLAIN ANALYZE, show information on hash aggregate memory usage and
+ * batches.
+ */
+static void
+show_hashagg_info(AggState *aggstate, ExplainState *es)
+{
+	Agg		*agg	   = (Agg *)aggstate->ss.ps.plan;
+	long	 memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;
+
+	Assert(IsA(aggstate, AggState));
+
+	if (agg->aggstrategy != AGG_HASHED &&
+		agg->aggstrategy != AGG_MIXED)
+		return;
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(
+			es->str,
+			"Memory Usage: %ldkB",
+			memPeakKb);
+
+		if (aggstate->hash_batches_used > 0)
+		{
+			appendStringInfo(
+es->str,
+"  Batches: %d  Disk: %ldkB",
+aggstate->hash_batches_used, aggstate->hash_disk_used);
+		}
+
+		appendStringInfo(
+			es->str,
+			"\n");
+	}
+	else
+	{
+		ExplainPropertyInteger("Peak Memory Usage", "kB", 

Re: Protocol problem with GSSAPI encryption?

2020-02-20 Thread Andrew Gierth
> "Stephen" == Stephen Frost  writes:

 >> I figure something along these lines for the fix. Anyone in a
 >> position to test this?

 Stephen> At least at first blush, I tend to agree with your analysis
 Stephen> and patch.

 Stephen> I'll see about getting this actually set up and tested in the
 Stephen> next week or so (and maybe there's some way to also manage to
 Stephen> have a regression test for it..).

*poke*

-- 
Andrew (irc:RhodiumToad)




Re: Add PGURI env var for passing connection string to psql in Docker

2020-02-20 Thread Michael Leonhard
Hi Tom,
Thanks for your reply.  A new PGURI env var would have the same
security risks as the existing PGPASSWORD env var, but no more.  It
would be a usability improvement for folks using Docker.  Docker
provides some special security benefits.  I believe that we can
improve security for users by helping them to use Docker.

~/.pgpass is useful for folks who manually connect to databases.  I'm
writing deployment, backup, and restore automation tools.  I would
like to keep these tools simple.  Using pgpass requires extra steps:
1. parse a perfectly good URI
2. join it back together without the secret part
3. write the secret part to a file in a special format
4. protect the file from unauthorized access
5. expose that file to the Docker container
6. pass the secret-less URI to the process
The chances for screwing this up and leaking credentials are real.
Therefore, I believe PGURI will be much safer in practice than
PGPASSWORD.

Your point about ambiguity if the user sets multiple overlapping env
vars is good.  I think it could be solved reasonably by having other
vars override values in PGURI.  A short sentence in the documentation
would eliminate confusion for users.  Example changes to
app-psql.html:
>
+ PGURI (other environment variables override values from this variable)
PGDATABASE
PGHOST
PGPORT
PGUSER
+ PGPASSWORD
Default connection parameters (see Section 33.14).
<

We could get the best of both worlds by adding both PGURI and
PGURIFILE env vars.  What do you think?

-Michael

On Thu, Feb 20, 2020 at 12:20 PM Tom Lane  wrote:
>
> Michael Leonhard  writes:
> > I need to pass a connection string to psql inside Docker [2].  I can
> > pass it as a process argument, but this exposes the password to other
> > processes on my machine:
> > $ docker run --rm -i -t postgres:11 psql "$(cat db_uri)"
>
> Yeah, if you include the password in the URI :-(
>
> > How about adding PGURI to the list of supported environment variables [3]?
>
> That will not fix your security problem, because on a lot of OSes,
> environment variables are *also* visible to other processes.
>
> There are other practical problems with such a proposal, mainly that
> it's not clear how such a variable ought to interact with existing
> connection-control variables (eg, if you set both PGURI and PGHOST,
> which wins?)
>
> The only safe way to deal with a password is to have some other
> out-of-band way to pass it.  That's one reason for the popularity
> of ~/.pgpass files.  Alternatively, look into non-password
> authentication methods.
>
> regards, tom lane




Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-20 Thread Tom Lane
I wrote:
> Thomas Munro  writes:
>> ... like coypu, where NUM_RESERVED_FDS is
>> the only thing ensuring we have some spare fds.  I don't know the
>> history but it looks like NUM_RESERVED_FDS was deliberately or
>> accidentally tuned to be just enough to be able to run the regression
>> tests (the interesting ones being the ones that use sockets, like
>> postgres_fdw.sql), but with a new long lived kqueue() fd in the
>> picture, it might have to be increased to cover it, no?

> No.  NUM_RESERVED_FDS was set decades ago, long before any of those tests
> existed, and it has never been changed AFAIK.  It is a bit striking that
> we just started seeing it be insufficient with this patch.  Maybe that's
> just happenstance, but I wonder whether there is a plain old FD leak
> involved in addition to the design issue?  I'll take a closer look at
> exactly what's open when we hit the error.

Hmm ... looks like I'm wrong: we've been skating way too close to the edge
for awhile.  Here's a breakdown of the open FDs in the backend at the time
of the failure, excluding stdin/stdout/stderr (which set_max_safe_fds
accounted for) and FDs pointing to database files:

COMMANDPID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
postmaste 2657 postgres3r  FIFO0,8  0t0  20902158 pipe  
postmaster_alive_fds[0]
postmaste 2657 postgres4u   REG0,90  3877 
[eventpoll]   FeBeWaitSet's epoll_fd
postmaste 2657 postgres7u  unix 0x880878e21880  0t0  20902664 
socketsocket for a PGconn owned by postgres_fdw
postmaste 2657 postgres8u  IPv6   20902171  0t0   UDP 
localhost:40795->localhost:40795  pgStatSock
postmaste 2657 postgres9u  unix 0x880876903c00  0t0  20902605 
/tmp/.s.PGSQL.5432MyProcPort->sock
postmaste 2657 postgres   10r  FIFO0,8  0t0  20902606 pipe  
selfpipe_readfd
postmaste 2657 postgres   11w  FIFO0,8  0t0  20902606 pipe  
selfpipe_writefd
postmaste 2657 postgres  105u  unix 0x880878e21180  0t0  20902647 
socketsocket for a PGconn owned by postgres_fdw
postmaste 2657 postgres  118u  unix 0x8804772c88c0  0t0  20902650 
socketsocket for a PGconn owned by postgres_fdw

One thing to notice is that there are only nine, though NUM_RESERVED_FDS
should have allowed ten.  That's because there are 116 open FDs pointing
at database files, though max_safe_fds is 115.  I'm not sure whether
that's OK or an off-by-one error in fd.c's accounting.  One of the 116
is pointing at a WAL segment, and I think we might not be sending that
through the normal VFD path, so it might be "expected".

But anyway, what this shows is that over time we've eaten enough of
the "reserved" FDs that only three are available for random usage like
postgres_fdw's, if the process's back is against the wall FD-wise.
The postgres_fdw regression test is using all three, meaning there's
exactly no daylight in that test.

Clearly, we gotta do something about that too.  Maybe bumping up
NUM_RESERVED_FDS would be a good idea, but I feel like more-honest
accounting for permanently-eaten FDs would be a better idea.  And
in any case we can't suppose that there's a fixed upper limit on
the number of postgres_fdw connections.  I'm liking the idea I floated
earlier of letting postgres_fdw forcibly close the oldest LRU entry.

BTW, you don't need anything very exotic to provoke this error.
The results above are from a Linux box; I just did "ulimit -n 128"
before starting the postmaster.

regards, tom lane




Re: Improve heavyweight locks instead of building new lock managers?

2020-02-20 Thread Thomas Munro
On Thu, Feb 20, 2020 at 5:14 PM Andres Freund  wrote:
>  16 files changed, 569 insertions(+), 1053 deletions(-)

Nice!

Some comments on 0001, 0003, 0004:

> Subject: [PATCH v1 1/6] Add additional prev/next & detached node functions to

+extern void dlist_check(const dlist_head *head);
+extern void slist_check(const slist_head *head);

I approve of the incidental constification in this patch.

+/*
+ * Like dlist_delete(), but also sets next/prev to NULL to signal not being in
+ * list.
+ */
+static inline void
+dlist_delete_thoroughly(dlist_node *node)
+{
+   node->prev->next = node->next;
+   node->next->prev = node->prev;
+   node->next = NULL;
+   node->prev = NULL;
+}

Instead of introducing this strange terminology, why not just have the
callers do ...

dlist_delete(node);
dlist_node_init(node);

..., or perhaps supply dlist_delete_and_reinit(node) that does exactly
that?  That is, reuse the code and terminology.

+/*
+ * Check if node is detached. A node is only detached if it either has been
+ * initialized with dlist_init_node(), or deleted with
+ * dlist_delete_thoroughly().
+ */
+static inline bool
+dlist_node_is_detached(const dlist_node *node)
+{
+Assert((node->next == NULL && node->prev == NULL) ||
+   (node->next != NULL && node->prev != NULL));
+
+return node->next == NULL;
+}

How about dlist_node_is_linked()?  I don't like introducing random new
verbs when we already have 'linked' in various places, and these
things are, y'know, linked lists.

> Subject: [PATCH v1 3/6] Use dlist for syncrep queue.

LGTM.

> Subject: [PATCH v1 4/6] Use dlists for predicate locking.

+dlist_foreach(iter, (SERIALIZABLEXACT *, reader)->outConflicts)

Yuck...  I suppose you could do this:

-dlist_foreach(iter, (SERIALIZABLEXACT *, reader)->outConflicts)
+dlist_foreach_const(iter, >outConflicts)

... given:

+/* Variant for when you have a pointer to const dlist_head. */
+#define dlist_foreach_const(iter, lhead)\
+for (AssertVariableIsOfTypeMacro(iter, dlist_iter),\
+ AssertVariableIsOfTypeMacro(lhead, const dlist_head *),\
+ (iter).end = (dlist_node *) &(lhead)->head,\
+ (iter).cur = (iter).end->next ? (iter).end->next : (iter).end;\
+ (iter).cur != (iter).end;\
+ (iter).cur = (iter).cur->next)
+

... or find a way to make dlist_foreach() handle that itself, which
seems pretty reasonable given its remit to traverse lists without
modifying them, though perhaps that would require a different iterator
type...

Otherwise looks OK to me and passes various tests I threw at it.




Re: PATCH: Add uri percent-encoding for binary data

2020-02-20 Thread Alvaro Herrera
On 2019-Oct-07, Anders Åstrand wrote:

> Attached is a patch for adding uri as an encoding option for
> encode/decode. It uses what's called "percent-encoding" in rfc3986
> (https://tools.ietf.org/html/rfc3986#section-2.1).

Thanks.  Seems useful.  I made a few cosmetic tweaks and it looks almost
ready to me; however, documentation is missing.  I added a stub; can you
please complete that?

To answer Arthur Zakirov's question: yes, the standard recommends
("should") to use uppercase characters:

:  For consistency, URI producers and
:  normalizers should use uppercase hexadecimal digits for all percent-
:  encodings.

Thanks,

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 44475f709762ba1a2a881d20345cc6a4cb086f01 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 20 Feb 2020 18:46:15 -0300
Subject: [PATCH v2] URI encode
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Author: Anders Åstrand
Discussion: https://postgr.es/m/APwPebtwJnjjt=euusml1zz6w3jvna1cvjezhbouccytjc9...@mail.gmail.com
---
 doc/src/sgml/func.sgml|  16 +++-
 src/backend/utils/adt/encode.c| 129 ++
 src/test/regress/expected/strings.out |  21 +
 src/test/regress/sql/strings.sql  |   7 ++
 4 files changed, 172 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index ceda48e0fc..c60ad4f4e2 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3180,7 +3180,8 @@ SELECT format('Testing %3$s, %2$s, %s', 'one', 'two', 'three');
format values are:
base64,
escape,
-   hex
+   hex,
+   uri
   
   encode('123\000\001', 'base64')
   MTIzAAE=
@@ -3274,6 +3275,19 @@ SELECT format('Testing %3$s, %2$s, %s', 'one', 'two', 'three');
   
  
 
+
+
+ uri
+ 
+  uri format
+ 
+ 
+  
+   The uri format represents ...
+  
+ 
+
+

   
 
diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index b8d9ec7e00..81d4ea8400 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -110,6 +110,7 @@ binary_decode(PG_FUNCTION_ARGS)
  */
 
 static const char hextbl[] = "0123456789abcdef";
+static const char hextbl_upper[] = "0123456789ABCDEF";
 
 static const int8 hexlookup[128] = {
 	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
@@ -512,6 +513,128 @@ esc_dec_len(const char *src, unsigned srclen)
 	return len;
 }
 
+/*
+ * URI percent encoding
+ *
+ * Percent encodes all byte values except the unreserved ASCII characters as
+ * per RFC3986.
+ */
+
+static unsigned
+uri_encode(const char *src, unsigned srclen, char *dst)
+{
+	char	   *d = dst;
+
+	for (const char *s = src; s < src + srclen; s++)
+	{
+		/*
+		 * RFC3986:
+		 *
+		 * unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
+		 */
+		if ((*s >= 'A' && *s <= 'Z') ||
+			(*s >= 'a' && *s <= 'z') ||
+			(*s >= '0' && *s <= '9') ||
+			*s == '-' ||
+			*s == '.' ||
+			*s == '_' ||
+			*s == '~')
+		{
+			*d++ = *s;
+		}
+		else
+		{
+			*d++ = '%';
+			*d++ = hextbl_upper[(*s >> 4) & 0xF];
+			*d++ = hextbl_upper[*s & 0xF];
+		}
+	}
+	return d - dst;
+}
+
+static unsigned
+uri_decode(const char *src, unsigned srclen, char *dst)
+{
+	const char *s = src;
+	const char *srcend = src + srclen;
+	char	   *d = dst;
+	char		val;
+
+	while (s < srcend)
+	{
+		if (*s == '%')
+		{
+			/*
+			 * Verify we have the needed bytes.  This doesn't happen, since
+			 * uri_dec_len already takes care of validation.
+			 */
+			if (s > srcend - 3)
+elog(ERROR, "invalid uri percent encoding");
+
+			/* Skip '%' */
+			s++;
+
+			val = get_hex(*s++) << 4;
+			val += get_hex(*s++);
+			*d++ = val;
+		}
+		else
+			*d++ = *s++;
+	}
+	return d - dst;
+}
+
+static unsigned
+uri_enc_len(const char *src, unsigned srclen)
+{
+	int			len = 0;
+
+	for (const char *s = src; s < src + srclen; s++)
+	{
+		if ((*s >= 'A' && *s <= 'Z') ||
+			(*s >= 'a' && *s <= 'z') ||
+			(*s >= '0' && *s <= '9') ||
+			*s == '-' ||
+			*s == '_' ||
+			*s == '.' ||
+			*s == '~')
+		{
+			len++;
+		}
+		else
+			len += 3;
+	}
+	return len;
+}
+
+static unsigned
+uri_dec_len(const char *src, unsigned srclen)
+{
+	const char *s = src;
+	const char *srcend = src + srclen;
+	int			len = 0;
+
+	while (s < srcend)
+	{
+		if (*s == '%')
+		{
+			if (s > srcend - 3)
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+		 errmsg("invalid uri percent encoding"),
+		 errhint("Input data ends prematurely.")));
+			s++;
+			get_hex(*s++);
+			get_hex(*s++);
+		}
+		else
+			s++;
+		len++;
+	}
+
+	return len;
+}
+
 /*
  * Common
  */
@@ -541,6 +664,12 @@ static const struct
 			esc_enc_len, esc_dec_len, esc_encode, esc_decode
 		}
 	},
+	{
+		"uri",
+		{
+			uri_enc_len, uri_dec_len, uri_encode, uri_decode
+		}
+	},
 	{
 		NULL,
 		{

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-20 Thread Tom Lane
Thomas Munro  writes:
> One thing I've been planning to do for 13 is to get rid of all the
> temporary create/destroy WaitEventSets from the main backend loops.
> My goal was cutting down on stupid system calls, but this puts a new
> spin on it.  I have a patch set to do a bunch of that[1], but now I'm
> thinking that perhaps I need to be even more aggressive about it and
> set up the 'common' long lived WES up front at backend startup, rather
> than doing it on demand, so that there is no chance of failure due to
> lack of fds once you've started up.

+1

> That wouldn't mean that the postgres_fdw.sql can't fail on a ulimit -n
> = 128 system, though, it might just mean that it's postgres_fdw's
> socket() call that hits EMFILE rather than WES's kqueue() call while
> running that test.

Good point.

> I suppose there are two kinds of system: those
> where ulimit -n is higher than max_files_per_process (defaults, on
> Linux: 1024 vs 1000) so you have more allowance for sockets and the
> like, and those where it isn't, like coypu, where NUM_RESERVED_FDS is
> the only thing ensuring we have some spare fds.  I don't know the
> history but it looks like NUM_RESERVED_FDS was deliberately or
> accidentally tuned to be just enough to be able to run the regression
> tests (the interesting ones being the ones that use sockets, like
> postgres_fdw.sql), but with a new long lived kqueue() fd in the
> picture, it might have to be increased to cover it, no?

No.  NUM_RESERVED_FDS was set decades ago, long before any of those tests
existed, and it has never been changed AFAIK.  It is a bit striking that
we just started seeing it be insufficient with this patch.  Maybe that's
just happenstance, but I wonder whether there is a plain old FD leak
involved in addition to the design issue?  I'll take a closer look at
exactly what's open when we hit the error.

The point about possibly hitting EMFILE in libpq's socket() call is
an interesting one.  libpq of course can't do anything to recover
from that (and even if it could, there are lower levels such as a
possible DNS lookup that we're not going to be able to modify).
I'm speculating about having postgres_fdw ask fd.c to forcibly
free one LRU file before it attempts to open a new libpq connection.
That would prevent EMFILE (process-level exhaustion) and it would
also provide some small protection against ENFILE (system-wide
exhaustion), though of course there's no guarantee that someone
else doesn't snap up the FD you so graciously relinquished.

regards, tom lane




Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2020-02-20 Thread Peter Geoghegan
On Thu, Feb 20, 2020 at 10:58 AM Peter Geoghegan  wrote:
> I think that there is a good chance that it just won't matter. The
> number of indexes that won't be able to support deduplication will be
> very small in practice. The important exceptions are INCLUDE indexes
> and nondeterministic collations. These exceptions make sense
> intuitively, and will be documented as limitations of those other
> features.

I wasn't clear about the implication of what I was saying here, which
is: I will make the NOTICE a DEBUG1 message, and leave everything else
as-is in the initial committed version.

-- 
Peter Geoghegan




Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-20 Thread Thomas Munro
On Fri, Feb 21, 2020 at 8:56 AM Tom Lane  wrote:
> I wrote:
> > It seems fairly obvious now that I look at it, but: the epoll and kqueue
> > variants of CreateWaitEventSet are both *fundamentally* unsafe, because
> > they assume that they can always get a FD when they want one, which is
> > not a property that we generally want backend code to have.  The only
> > reason we've not seen this before with epoll is a lack of testing
> > under lots-of-FDs stress.
> > The fact that they'll likely leak those FDs on subsequent failures is
> > another not-very-nice property.
>
> Hmmm ... actually, there's a third problem, which is the implicit
> assumption that we can have as many concurrently-active WaitEventSets
> as we like.  We can't, if they depend on FDs --- that's a precious
> resource.  It doesn't look like we actually ever have more than about
> two per process right now, but I'm concerned about what will happen
> as the usage of the feature increases.

One thing I've been planning to do for 13 is to get rid of all the
temporary create/destroy WaitEventSets from the main backend loops.
My goal was cutting down on stupid system calls, but this puts a new
spin on it.  I have a patch set to do a bunch of that[1], but now I'm
thinking that perhaps I need to be even more aggressive about it and
set up the 'common' long lived WES up front at backend startup, rather
than doing it on demand, so that there is no chance of failure due to
lack of fds once you've started up.  I also recently figured out how
to handle some more places with the common WES.  I'll post a new patch
set over on that thread shortly.

That wouldn't mean that the postgres_fdw.sql can't fail on a ulimit -n
= 128 system, though, it might just mean that it's postgres_fdw's
socket() call that hits EMFILE rather than WES's kqueue() call while
running that test.  I suppose there are two kinds of system: those
where ulimit -n is higher than max_files_per_process (defaults, on
Linux: 1024 vs 1000) so you have more allowance for sockets and the
like, and those where it isn't, like coypu, where NUM_RESERVED_FDS is
the only thing ensuring we have some spare fds.  I don't know the
history but it looks like NUM_RESERVED_FDS was deliberately or
accidentally tuned to be just enough to be able to run the regression
tests (the interesting ones being the ones that use sockets, like
postgres_fdw.sql), but with a new long lived kqueue() fd in the
picture, it might have to be increased to cover it, no?

About the potential for leaks, Horiguchi-san realised this hazard and
posted a patch[2] to allow WaitEventSets to be cleaned up by the
resource owner machinery.  That's useful for the temporary
WaitEventSet objects that we'd genuinely need in the patch set that's
part of: that's for creating a query-lifetime WES to manage N
connections to remote shards, and it needs to be cleaned up on
failure.  For the temporary ones created by WaitLatch(), I suspect
they don't really belong in a resource owner: instead we should get
rid of it using my WaitMyLatch() patch set, and if there are any
places where we can't for some reason (I hope not), perhaps a
try/catch block should be used to fix that.

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGJAC4Oqao%3DqforhNey20J8CiG2R%3DoBPqvfR0vOJrFysGw%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/20191206.171211.1119526746053895900.horikyota.ntt%40gmail.com




Re: Add PGURI env var for passing connection string to psql in Docker

2020-02-20 Thread Tom Lane
Michael Leonhard  writes:
> I need to pass a connection string to psql inside Docker [2].  I can
> pass it as a process argument, but this exposes the password to other
> processes on my machine:
> $ docker run --rm -i -t postgres:11 psql "$(cat db_uri)"

Yeah, if you include the password in the URI :-(

> How about adding PGURI to the list of supported environment variables [3]?

That will not fix your security problem, because on a lot of OSes,
environment variables are *also* visible to other processes.

There are other practical problems with such a proposal, mainly that
it's not clear how such a variable ought to interact with existing
connection-control variables (eg, if you set both PGURI and PGHOST,
which wins?)

The only safe way to deal with a password is to have some other
out-of-band way to pass it.  That's one reason for the popularity
of ~/.pgpass files.  Alternatively, look into non-password
authentication methods.

regards, tom lane




Add PGURI env var for passing connection string to psql in Docker

2020-02-20 Thread Michael Leonhard
Hi PostgreSQL Hackers,
Please forgive me if this is not the preferred place to suggest a new
feature.  I found that a lot of items in the psql TODO list [1] were
posted to this email list.

I need to pass a connection string to psql inside Docker [2].  I can
pass it as a process argument, but this exposes the password to other
processes on my machine:
$ docker run --rm -i -t postgres:11 psql "$(cat db_uri)"

The alternative is to parse the URI, remove the password, and provide
the password via environment variable.  It's ugly:

$ PGPASSWORD=$(cat db_uri |grep -oE ':[a-zA-Z0-9]*@' |tr -d :@ ) \
docker run --rm -i -t postgres:11 \
psql "$(cat db_uri |sed 's/:[:alnum:]*@/@/')"

I would prefer to do this:
$ PGURI="$(cat db_uri)" docker run --rm -i -t postgres:11 -e PGURI psql
How about adding PGURI to the list of supported environment variables [3]?

Sincerely,
Michael

[1] https://wiki.postgresql.org/wiki/Todo#psql
[2] https://hub.docker.com/_/postgres
[3] https://www.postgresql.org/docs/devel/app-psql.html#APP-PSQL-ENVIRONMENT




Open Source Hackathon Mentorship Invitation

2020-02-20 Thread Misha Patel
Hello,

My name is Misha Patel and I’m reaching out on behalf of the HackIllinois
Outreach team. HackIllinois is a 36-hour collegiate Open Source hackathon
that takes place annually at the University of Illinois Urbana-Champaign.
This year, it will be from February 28th-March 1st, 2020. Our mission is to
introduce college students to Open Source, while giving back to the
community. We strive to create a collaborative environment in which our
attendees can learn from and work with developers to make their own
contributions. In past years, we’ve had developers from prominent projects
such as npm, Rust, and Apache come to mentor students from our pool of 900+
attendees.

We’d love it if you could pass along this message to the PostgreSQL
community or any individuals you believe would be interested. We will
provide meals throughout the event and can reimburse for travel and lodging
up to a certain amount depending on where in the US people are coming from.
More information on mentorship can be found at hackillinois.org/mentor. You
can also visit opensource.hackillinois.org to see what kinds of projects
were represented at our event last year.

Best,
Misha Patel
HackIllinois 2020 Outreach Director


Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-20 Thread Tom Lane
I wrote:
> It seems fairly obvious now that I look at it, but: the epoll and kqueue
> variants of CreateWaitEventSet are both *fundamentally* unsafe, because
> they assume that they can always get a FD when they want one, which is
> not a property that we generally want backend code to have.  The only
> reason we've not seen this before with epoll is a lack of testing
> under lots-of-FDs stress.
> The fact that they'll likely leak those FDs on subsequent failures is
> another not-very-nice property.

Hmmm ... actually, there's a third problem, which is the implicit
assumption that we can have as many concurrently-active WaitEventSets
as we like.  We can't, if they depend on FDs --- that's a precious
resource.  It doesn't look like we actually ever have more than about
two per process right now, but I'm concerned about what will happen
as the usage of the feature increases.

regards, tom lane




Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-20 Thread Tom Lane
[ redirecting to -hackers ]

I wrote:
> =?utf-8?Q?R=C3=A9mi_Zara?=  writes:
> Le 20 févr. 2020 à 12:15, Thomas Munro  a écrit :
>>> Remi, any chance you could run gmake installcheck under
>>> contrib/postgres_fdw on that host, to see if this is repeatable?  Can
>>> you tell us about the relevant limits?  Maybe ulimit -n (for the user
>>> that runs the build farm), and also sysctl -a | grep descriptors,
>>> sysctl -a | grep maxfiles?

> I have a working NetBSD 8/ppc installation, will try to reproduce there.

Yup, it reproduces fine here.  I see

$ ulimit -a
...
nofiles   (-n descriptors) 128

which squares with the sysctl values:

proc.curproc.rlimit.descriptors.soft = 128
proc.curproc.rlimit.descriptors.hard = 1772
kern.maxfiles = 1772

and also with set_max_safe_fds' results:

2020-02-20 14:29:38.610 EST [2218] DEBUG:  max_safe_fds = 115, usable_fds = 
125, already_open = 3

It seems fairly obvious now that I look at it, but: the epoll and kqueue
variants of CreateWaitEventSet are both *fundamentally* unsafe, because
they assume that they can always get a FD when they want one, which is
not a property that we generally want backend code to have.  The only
reason we've not seen this before with epoll is a lack of testing
under lots-of-FDs stress.

The fact that they'll likely leak those FDs on subsequent failures is
another not-very-nice property.

I think we ought to redesign this so that those FDs are handed out by
fd.c, which can ReleaseLruFile() and retry if it gets EMFILE or ENFILE.
fd.c could also be responsible for the resource tracking needed to
prevent leakage.

regards, tom lane




Re: plan cache overhead on plpgsql expression

2020-02-20 Thread Pavel Stehule
st 19. 2. 2020 v 8:09 odesílatel Amit Langote 
napsal:

> On Wed, Feb 19, 2020 at 3:56 PM Amit Langote 
> wrote:
> > On Wed, Feb 19, 2020 at 3:38 PM Pavel Stehule 
> wrote:
> > > st 19. 2. 2020 v 7:30 odesílatel Pavel Stehule <
> pavel.steh...@gmail.com> napsal:
> > >> út 18. 2. 2020 v 17:08 odesílatel Amit Langote <
> amitlangot...@gmail.com> napsal:
> > >>> > I updated the patch to do that.
> > >>> >
> > >>> > With the new patch, `select foo()`, with inline-able sql_incr() in
> it,
> > >>> > runs in 679 ms.
> > >>> >
> > >>> > Without any inline-able function, it runs in 330 ms, whereas with
> > >>> > HEAD, it takes 590 ms.
> > >>>
> > >>> I polished it a bit.
> > >>
> > >>
> > >> the performance looks very interesting - on my comp the execution
> time of  1 iterations was decreased from 34 sec to 15 sec,
> > >>
> > >> So it is interesting speedup
> > >
> > > but regress tests fails
> >
> > Oops, I failed to check src/pl/plpgsql tests.
> >
> > Fixed in the attached.
>
> Added a regression test based on examples discussed here too.
>

It is working without problems

I think this patch is very interesting for Postgres 13

Regards

Pavel

>
> Thanks,
> Amit
>


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2020-02-20 Thread Peter Geoghegan
On Thu, Feb 20, 2020 at 7:38 AM Anastasia Lubennikova
 wrote:
> I don't think this patch really needs more nitpicking )

But when has that ever stopped it?   :-)

> User can discover this with a complex query to pg_index and pg_opclass.
> To simplify this, we can probably wrap this into function or some field
> in pg_indexes.

A function isn't a real user interface, though -- it probably won't be noticed.

I think that there is a good chance that it just won't matter. The
number of indexes that won't be able to support deduplication will be
very small in practice. The important exceptions are INCLUDE indexes
and nondeterministic collations. These exceptions make sense
intuitively, and will be documented as limitations of those other
features.

The numeric/float thing doesn't really make intuitive sense, and
numeric is an important datatype. Still, numeric columns and float
columns seem to rarely get indexed. That just leaves container type
opclasses, like anyarray and jsonb.

Nobody cares about indexing container types with a B-Tree index, with
the possible exception of expression indexes on a jsonb column. I
don't see a way around that, but it doesn't seem all that important.
Again, applications are unlikely to have more than one or two of
those. The *overall* space saving will probably be almost as good as
if the limitation did not exist.

> Anyway, I would wait for feedback from pre-release testers.

Right -- let's delay making a final decision on it. Just like the
decision to enable it by default. It will work this way in the
committed version, but that isn't supposed to be the final word on it.

-- 
Peter Geoghegan




Re: Bug in pg_restore with EventTrigger in parallel mode

2020-02-20 Thread Fabrízio de Royes Mello
On Thu, Feb 20, 2020 at 4:52 AM Michael Paquier  wrote:
>
> That sounds right, as event triggers could interact with GRANT and
> REFRESH of matviews, so they should be logically last.  Looking at the
> recent commit history, this would be similar to 3eb9a5e as we don't
> really have a way to treat event triggers as dependency-sortable
> objects.
>

Indeed... event triggers should be the last thing to be restored.

>  What kind of errors did you see in this customer
> environment?  Errors triggered by one or more event triggers blocking
> some commands based on a tag match?
>

By error I meant the weird behavior I described before that pg_restore
create the event triggers in parallel mode and after that other objects are
created then the event trigger is fired during the restore...

Have a look at the new attached patch.

Regards,

--
   Fabrízio de Royes Mello Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 77bf9edaba..faa93aaf9a 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -670,11 +670,13 @@ RestoreArchive(Archive *AHX)
 	{
 		/*
 		 * In serial mode, process everything in three phases: normal items,
-		 * then ACLs, then matview refresh items.  We might be able to skip
-		 * one or both extra phases in some cases, eg data-only restores.
+		 * then ACLs, matview refresh items, then event triggers. We might be 
+		 * able to skip one or both extra phases in some cases, eg data-only
+		 * restores.
 		 */
 		bool		haveACL = false;
 		bool		haveRefresh = false;
+		bool		haveEventTrigger = false;
 
 		for (te = AH->toc->next; te != AH->toc; te = te->next)
 		{
@@ -692,6 +694,9 @@ RestoreArchive(Archive *AHX)
 case RESTORE_PASS_REFRESH:
 	haveRefresh = true;
 	break;
+case RESTORE_PASS_EVENT_TRIGGER:
+	haveEventTrigger = true;
+	break;
 			}
 		}
 
@@ -714,6 +719,16 @@ RestoreArchive(Archive *AHX)
 	(void) restore_toc_entry(AH, te, false);
 			}
 		}
+
+		if (haveEventTrigger)
+		{
+			for (te = AH->toc->next; te != AH->toc; te = te->next)
+			{
+if ((te->reqs & (REQ_SCHEMA | REQ_DATA)) != 0 &&
+	_tocEntryRestorePass(te) == RESTORE_PASS_EVENT_TRIGGER)
+	(void) restore_toc_entry(AH, te, false);
+			}
+		}
 	}
 
 	if (ropt->single_txn)
@@ -3084,6 +3099,8 @@ _tocEntryRestorePass(TocEntry *te)
 		return RESTORE_PASS_ACL;
 	if (strcmp(te->desc, "MATERIALIZED VIEW DATA") == 0)
 		return RESTORE_PASS_REFRESH;
+	if (strcmp(te->desc, "EVENT TRIGGER") == 0)
+		return RESTORE_PASS_EVENT_TRIGGER;
 	return RESTORE_PASS_MAIN;
 }
 
diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h
index 6768f92976..204019b514 100644
--- a/src/bin/pg_dump/pg_backup_archiver.h
+++ b/src/bin/pg_dump/pg_backup_archiver.h
@@ -220,9 +220,10 @@ typedef enum
 {
 	RESTORE_PASS_MAIN = 0,		/* Main pass (most TOC item types) */
 	RESTORE_PASS_ACL,			/* ACL item types */
-	RESTORE_PASS_REFRESH		/* Matview REFRESH items */
+	RESTORE_PASS_REFRESH,		/* Matview REFRESH items */
+	RESTORE_PASS_EVENT_TRIGGER	/* Event Trigger items */
 
-#define RESTORE_PASS_LAST RESTORE_PASS_REFRESH
+#define RESTORE_PASS_LAST RESTORE_PASS_EVENT_TRIGGER
 } RestorePass;
 
 typedef enum


Re: error context for vacuum to include block number

2020-02-20 Thread Tom Lane
Alvaro Herrera  writes:
> Another point is that this patch seems to be leaking memory each time
> you set relation/index/namespace name, since you never free those and
> they are changed over and over.

One other point is that this code seems to be trying to ensure that
the error context callback itself won't need to touch the catalog cache or
relcache, which is an important safety feature ... but it's failing at
that goal, because RelationGetRelationName() is going to hand back a
pointer to a string in the relcache.  You need another pstrdup for that.

regards, tom lane




Removing obsolete configure checks

2020-02-20 Thread Tom Lane
Over in the thread at [1] we agreed to remove assorted code that copes
with missing , on the grounds that C99 requires that header
so we should not have to cater anymore for platforms without it.
This logic could obviously be carried further.  I scraped the buildfarm
configure logs to see what other tests seem pointless (on the grounds that
every active animal reports the same result) and found a fair number.

I think we can just remove these tests, and the corresponding
src/port/ files where there is one:

fseeko
isinf
memmove
rint
signed types
utime
utime.h
wchar.h

All of the above are required by C99 and/or SUSv2, and the configure-using
buildfarm members are unanimous in reporting that they have them, and
msvc/Solution.pm expects Windows to have them.  Removing src/port/isinf.c
will let us get rid of a few more configure tests too:

  # Look for a way to implement a substitute for isinf()
  AC_CHECK_FUNCS([fpclass fp_class fp_class_d class], [break])

although that code path is never taken so it won't save much.

I believe that we can also get rid of these tests:

flexible array members
cbrt
intptr_t
uintptr_t

as these features are likewise required by C99.  Solution.pm thinks that
MSVC does not have the above, but I suspect its information is out of
date.  We could soon find out from the buildfarm, of course.

I also noted that these header checks are passing everywhere,
which is unsurprising because they're required by C99 and/or POSIX:

ANSI C header files
inttypes.h
memory.h
stdlib.h
string.h
strings.h
sys/stat.h
sys/types.h
unistd.h

Unfortunately we're not actually asking for any of those to be probed
for --- it looks like Autoconf just up and does that of its own accord.
So we can't get rid of the tests and save configure cycles thereby.
But we can skip testing the HAVE_FOO_H symbols for them.  We mostly
were already, but there's one or two exceptions.

There are a few other tests that are getting the same results in
all buildfarm configure checks, but Solution.pm is injecting different
results for Windows, such as what to expand "inline" to.  Conceivably
we could hard-code that based on the WIN32 #define and remove the
configure probes, but I'm inclined to think it's not worth the
trouble and possible loss of flexibility.

Barring objections I'll go make this happen.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/5d398bbb-262a-5fed-d839-d0e5cff3c0d7%402ndquadrant.com




Re: error context for vacuum to include block number

2020-02-20 Thread Alvaro Herrera
This is, by far, the most complex error context callback we've tried to
write ... Easy stuff first:

In the error context function itself, you don't need the _() around the
strings: errcontext() is marked as a gettext trigger and it does the
translation itself, so the manually added _() is just cruft.

When reporting index names, make sure to attach the namespace to the
table, not to the index.  Example:

case PROGRESS_VACUUM_PHASE_INDEX_CLEANUP:
-   errcontext(_("while cleaning up index \"%s.%s\" of relation 
\"%s\""),   
  
-  cbarg->relnamespace, cbarg->indname, cbarg->relname);

  
+   errcontext("while cleaning up index \"%s\" of relation \"%s.%s\"",  

  
+  cbarg->indname, cbarg->relnamespace, cbarg->relname);   

I think it would be worthwhile to have the "truncate wait" phase as a
separate thing from the truncate itself, since it requires acquiring a
possibly taken lock.  This suggests that using the progress enum is not
a 100% solution ... or maybe it suggests that the progress enum too
needs to report the truncate-wait phase separately.  (I like the latter
myself, actually.)

On 2020-Feb-19, Justin Pryzby wrote:

> Also, I was thinking that lazy_scan_heap doesn't needs to do this:
> 
> +   /* Pop the error context stack while calling vacuum */
> +   error_context_stack = errcallback.previous;
> ...
> +   /* Set the error context while continuing heap scan */
> +   error_context_stack = 
> 
> It seems to me that's not actually necessary, since lazy_vacuum_heap will just
> *push* a context handler onto the stack, and then pop it back off.

So if you don't pop before pushing, you'll end up with two context
lines, right?

I find that arrangement a bit confusing.  I think it would make sense to
initialize the context callback just *once* for a vacuum run, and from
that point onwards, just update the errcbarg struct to match what
you're currently doing -- not continually pop/push error callback stack
entries.  See below ...

(This means you need to pass the "cbarg" as new argument to some of the
called functions, so that they can update it.)

Another point is that this patch seems to be leaking memory each time
you set relation/index/namespace name, since you never free those and
they are changed over and over.

In init_vacuum_error_callback() you don't need the "switch(phase)" bit;
instead, test rel->rd_rel->relkind, and if it's RELKIND_INDEX then you
put the relname as indexname, otherwise set it to NULL (after freeing
the previous value, if there's one).  Note that with this, you only need
to set the relation name (table name) in the first call!  IOW you should
split init_vacuum_error_callback() in two functions: one "init" to call
at start of vacuum, where you set relnamespace and relname; the other
function is update_vacuum_error_callback() (or you find a better name
for that) and it sets the phase, and optionally the block number and
index name (these last two get reset to InvalidBlkNum/ NULL if not
passed by caller).  I'm not really sure what this means for the parallel
index vacuuming stuff; probably you'll need a special case for that: the
parallel children will need to "init" on their own, right?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Fwd: bad wal on replica / incorrect resource manager data checksum in record / zfs

2020-02-20 Thread Alex Malek
On Thu, Feb 20, 2020, 6:16 AM Amit Kapila  wrote:

> On Thu, Feb 20, 2020 at 3:06 AM Alex Malek  wrote:
> >
> >
> > Hello Postgres Hackers -
> >
> > We are having a reoccurring issue on 2 of our replicas where replication
> stops due to this message:
> > "incorrect resource manager data checksum in record at ..."
> > This has been occurring on average once every 1 to 2 weeks during large
> data imports (100s of GBs being written)
> > on one of two replicas.
> > Fixing the issue has been relatively straight forward: shutdown replica,
> remove the bad wal file, restart replica and
> > the good wal file is retrieved from the master.
> > We are doing streaming replication using replication slots.
> > However twice now, the master had already removed the WAL file so the
> file had to retrieved from the wal archive.
> >
> > The WAL log directories on the master and the replicas are on ZFS file
> systems.
> > All servers are running RHEL 7.7 (Maipo)
> > PostgreSQL 10.11
> > ZFS v0.7.13-1
> >
> > The issue seems similar to
> https://www.postgresql.org/message-id/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com
> and to https://github.com/timescale/timescaledb/issues/1443
> >
> > One quirk in our ZFS setup is ZFS is not handling our RAID array, so ZFS
> sees our array as a single device.
> >
> > Right before the issue started we did some upgrades and altered some
> postgres configs and ZFS settings.
> > We have been slowly rolling back changes but so far the the issue
> continues.
> >
> > Some interesting data points while debugging:
> > We had lowered the ZFS recordsize from 128K to 32K and for that week the
> issue started happening every other day.
> > Using xxd and diff we compared "good" and "bad" wal files and the
> differences were not random bad bytes.
> >
> > The bad file either had a block of zeros that were not in the good file
> at that position or other data.  Occasionally the bad data has contained
> legible strings not in the good file at that position. At least one of
> those exact strings has existed elsewhere in the files.
> > However I am not sure if that is the case for all of them.
> >
> > This made me think that maybe there was an issue w/ wal file recycling
> and ZFS under heavy load, so we tried lowering
> > min_wal_size in order to "discourage" wal file recycling but my
> understanding is a low value discourages recycling but it will still
> > happen (unless setting wal_recycle in psql 12).
> >
>
> We do print a message "recycled write-ahead log file .." in DEBUG2
> mode.  You either want to run the server with DEBUG2 or maybe change
> the code to make it LOG and see if that is printed.  If you do that,
> you can verify if the corrupted WAL is the same as a recycled one.
>

Are you suggesting having the master, the replicas or all in debug mode?
How much extra logging would this generate?
A replica typically consumes over 1 TB of WAL files before a bad wal file
is encountered.



> > There is a third replica where this bug has not (yet?) surfaced.
> > This leads me to guess the bad data does not originate on the master.
> > This replica is older than the other replicas, slower CPUs, less RAM,
> and the WAL disk array is spinning disks.
> > The OS, version of Postgres, and version of ZFS are the same as the
> other replicas.
> > This replica is not using a replication slot.
> > This replica does not serve users so load/contention is much lower than
> the others.
> > The other replicas often have 100% utilization of the disk array that
> houses the (non-wal) data.
> >
> > Any insight into the source of this bug or how to address it?
> >
> > Since the master has a good copy of the WAL file, can the replica
> re-request  the file from the master? Or from the archive?
> >
>
> I think we do check in the archive if we get the error during
> streaming, but archive might also have the same data due to which this
> problem happens.  Have you checked that the archive WAL file, is it
> different from the bad WAL?  See the


Typically the master, the archive and the other replicas all have a good
copy of the WAL file.

relevant bits of code in
> WaitForWALToBecomeAvailable especially the code near below comment:
>
> "Failure while streaming. Most likely, we got here because streaming
> replication was terminated, or promotion was triggered. But we also
> get here if we find an invalid record in the WAL streamed from master,
> in which case something is seriously wrong. There's little chance that
> the problem will just go away, but PANIC is not good for availability
> either, especially in hot standby mode. So, we treat that the same as
> disconnection, and retry from archive/pg_wal again. The WAL in the
> archive should be identical to what was streamed, so it's unlikely
> that it helps, but one can hope..."
>
>
Thank you for this comment!
This made me realize that on the replicas I had mentioned we had removed
the restore_command.
The replica we thought was not having the issue, 

Re: Parallel copy

2020-02-20 Thread David Fetter
On Thu, Feb 20, 2020 at 02:36:02PM +0100, Tomas Vondra wrote:
> On Thu, Feb 20, 2020 at 04:11:39PM +0530, Amit Kapila wrote:
> > On Thu, Feb 20, 2020 at 5:12 AM David Fetter  wrote:
> > > 
> > > On Fri, Feb 14, 2020 at 01:41:54PM +0530, Amit Kapila wrote:
> > > > This work is to parallelize the copy command and in particular "Copy
> > > >  from 'filename' Where ;" command.
> > > 
> > > Apropos of the initial parsing issue generally, there's an interesting
> > > approach taken here: https://github.com/robertdavidgraham/wc2
> > > 
> > 
> > Thanks for sharing.  I might be missing something, but I can't figure
> > out how this can help here.  Does this in some way help to allow
> > multiple workers to read and tokenize the chunks?
> 
> I think the wc2 is showing that maybe instead of parallelizing the
> parsing, we might instead try using a different tokenizer/parser and
> make the implementation more efficient instead of just throwing more
> CPUs on it.

That was what I had in mind.

> I don't know if our code is similar to what wc does, maytbe parsing
> csv is more complicated than what wc does.

CSV parsing differs from wc in that there are more states in the state
machine, but I don't see anything fundamentally different.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: [Patch] Make pg_checksums skip foreign tablespace directories

2020-02-20 Thread Bernd Helmle
Am Dienstag, den 18.02.2020, 15:15 +0900 schrieb Michael Paquier:
> Fair point.  Now, while the proposed patch is right to use
> TABLESPACE_VERSION_DIRECTORY, shouldn't we use strncmp based on the
> length of TABLESPACE_VERSION_DIRECTORY instead of de->d_name?  It
> seems also to me that the code as proposed is rather fragile, and
> that
> we had better be sure that the check only happens when we are
> scanning
> entries within pg_tblspc.
> 

Yes, after thinking and playing around with it a little i share your
position. You can still easily cause pg_checksums to error out by just
having arbitrary files around in the reference tablespace locations.
Though i don't think this is something of a big issue, it looks strange
and misleading if pg_checksums just complains about files not belonging
to the scanned PostgreSQL data directory (even we explicitly note in
the docs that even tablespace locations are somehow taboo for DBAs to
put other files and/or directories in there).

So i propose a different approach like the attached patch tries to
implement: instead of just blindly iterating over directory contents
and filter them out, reference the tablespace location and
TABLESPACE_VERSION_DIRECTORY directly. This is done by a new function
scan_tablespaces() which is specialized in just follow the
symlinks/junctions in pg_tblspc and call scan_directory() with just
what it has found there. It will also honour directories, just in case
an experienced DBA has copied over the tablespace into pg_tblspc
directly.

> The issue with pg_internal.init.XX is quite different, so I think
> that
> it would be better to commit that separately first.

Agreed.

Thanks,
Bernd
From 4b6d369f731f111216f9ed6613ad1b41872fb0ea Mon Sep 17 00:00:00 2001
From: Bernd Helmle 
Date: Thu, 20 Feb 2020 17:18:10 +0100
Subject: [PATCH] Make scanning pg_tblspc more robust.

Currently we dive blindly into the subdirectories referenced by
the symlinks (or under Windows via file junctions) located in pg_tblspc,
which has various problems. We need to filter out any other possible tablespace
location found in the referenced subdirectory, which could likely happen
if you have e.g. an upgraded instance via pg_upgrade or other installations using
the same tablespace location.

Instead of filtering out any of these possible foreign directories, try to
resolve the locations in pg_tblspc directly. This is done with the new
scan_tablespaces() helper function, which indirectly calls scan_directory()
after resolving the links in pg_tblspc. It explicitely follows directories found
there too, since it might happen that users copy over tablespace locations directly
into pg_tblspc.
---
 src/bin/pg_checksums/pg_checksums.c | 99 -
 1 file changed, 97 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index 46ee1f1dc3..4e2e9cb3e2 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -108,6 +108,15 @@ static const char *const skip[] = {
 	NULL,
 };
 
+/*
+ * Forwarded declarations.
+ */
+static int64
+scan_directory(const char *basedir, const char *subdir, bool sizeonly);
+
+static int64
+scan_tablespaces(const char *basedir, bool sizeonly);
+
 /*
  * Report current progress status.  Parts borrowed from
  * src/bin/pg_basebackup/pg_basebackup.c.
@@ -267,6 +276,91 @@ scan_file(const char *fn, BlockNumber segmentno)
 	close(f);
 }
 
+/*
+ * Scans pg_tblspc of the given base directory for either links
+ * or directories and examines wether an existing tablespace version
+ * directory exists there. If true, scan_directory() will do the checksum
+ * legwork. sizeonly specifies wether we want to compute the directory size
+ * only, required for progress reporting.
+ */
+static int64
+scan_tablespaces(const char *basedir, bool sizeonly)
+{
+	int64  dirsize = 0;
+	char   path[MAXPGPATH];
+	DIR   *dir;
+	struct dirent *de;
+
+	snprintf(path, sizeof(path), "%s/%s", basedir, "pg_tblspc");
+	dir = opendir(path);
+
+	if (!dir)
+	{
+		pg_log_error("could not open directory \"%s\": %m", path);
+		exit(1);
+	}
+
+	/*
+	 * Scan through the pg_tblspc contents. We expect either a symlink
+	 * (or Windows junction) or a directory. The latter might exist due
+	 * systems where administrators copy over the tablespace directory
+	 * directly into PGDATA, e.g. on recovery systems.
+	 */
+	while((de = readdir(dir)) != NULL)
+	{
+		char tblspc_path[MAXPGPATH];
+		char fn[MAXPGPATH];
+		struct stat st;
+
+		if (strcmp(de->d_name, ".") == 0 ||
+			strcmp(de->d_name, "..") == 0)
+			continue;
+
+		snprintf(fn, sizeof(fn), "%s/%s", path, de->d_name);
+		if (lstat(fn, ) < 0)
+		{
+			pg_log_error("could not stat file \"%s\": %m", fn);
+			exit(1);
+		}
+
+		if (S_ISREG(st.st_mode))
+		{
+			pg_log_debug("ignoring file %s in pg_tblspc", de->d_name);
+			continue;
+		}
+
+#ifndef WIN32
+		if (S_ISDIR(st.st_mode) || S_ISLNK(st.st_mode))

Re: Fix compiler warnings on 64-bit Windows

2020-02-20 Thread Tom Lane
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?=  
writes:
> On Mon, Feb 17, 2020 at 4:52 PM Tom Lane  wrote:
>> Anyway, I'll have a go at updating gaur to use 4.5.x.  There is a
>> sane-looking stdint.h on my second-oldest dinosaur, prairiedog.
>> Don't know about the situation on Windows, though.  We might want
>> to take a close look at NetBSD, too, based on the GCC notes.

> As for Windows, stdint.h was included in VS2010, and currently Postgres
> supports VS2013 to 2019.

I've now updated gaur to gcc 4.5.4 (took a little more hair-pulling
than I would have wished).  I confirm that 0001-Require-stdint.h.patch
works in that environment, so I think you can go ahead and push it.

I think there is room for more extensive trimming of no-longer-useful
configure checks, but I'll start a separate thread about that.

regards, tom lane




Re: WAL usage calculation patch

2020-02-20 Thread Kirill Bychik
> вт, 18 февр. 2020 г. в 06:23, Thomas Munro :
> > On Mon, Feb 10, 2020 at 8:20 PM Craig Ringer  wrote:
> > > On Wed, 5 Feb 2020 at 21:36, Kirill Bychik  
> > > wrote:
> > > > Patch is separated in two parts: core changes and pg_stat_statements
> > > > additions. Essentially the extension has its schema updated to allow
> > > > two more fields, docs updated to reflect the change. Patch is prepared
> > > > against master branch.
> > > >
> > > > Please provide your comments and/or code findings.
> > >
> > > I like the concept, I'm a big fan of anything that affordably improves
> > > visibility into Pg's I/O and activity.
> >
> > +1
> >
> > > To date I've been relying on tools like systemtap to do this sort of
> > > thing. But that's a bit specialised, and Pg currently lacks useful
> > > instrumentation for it so it can be a pain to match up activity by
> > > parallel workers and that sort of thing. (I aim to find time to submit
> > > a patch for that.)
> >
> > (I'm interested in seeing your conference talk about that!  I did a
> > bunch of stuff with static probes to measure PHJ behaviour around
> > barrier waits and so on but it was hard to figure out what stuff like
> > that to put in the actual tree, it was all a bit
> > use-once-to-test-a-theory-and-then-throw-away.)
> >
> > Kirill, I noticed that you included a regression test that is failing.  Can
> > this possibly be stable across machines or even on the same machine?
> > Does it still pass for you or did something change on the master
> > branch to add a new WAL record since you posted the patch?
>
> Thank you for testing the patch and running extension checks. I assume
> the patch applies without problems.
>
> As for the regr test, it apparently requires some rework. I didn't pay
> attention enough to make sure the data I check is actually meaningful
> and isolated enough to be repeatable.
>
> Please consider the extension part of the patch as WIP, I'll resubmit
> the patch once I get a stable and meanngful test up. Thanks for
> finding it!
>

I have reworked the extension regression test to be more isolated.
Apparently, something merged into master branch shifted my numbers.

PFA the new patch. Core part didn't change a bit, the extension part
has regression test SQL and expected log changed.

Looking forward for new comments.


wal_stats.core.patch
Description: Binary data


wal_stats.ext.patch
Description: Binary data


Re: Disallow cancellation of waiting for synchronous replication

2020-02-20 Thread Michail Nikolaev
Hello.

Just want to share some thoughts about how it looks from perspective
of a high availability web-service application developer.
Because sometimes things look different from other sides. And
everything looks like disaster to be honest.

But let's take it one at a time.

First - the problem is not related to upsert queries only. It could be
reproduced by plain INSERTS or UPDATES. For example:

* client 1 inserts new records and waits for synchronous replication
* client 1 cancels the query
* clients 2, 3, 4 and 5 see new data and perform some actions outside
of the database in external systems
* master is switched to the replica with no WAL of new records replicated yet

As a result: newly inserted data are just gone, but external systems
already rely on it.
And this is just a huge pain for the application and its developer.

Second - it is all not about the client who canceled the query. It may
be super clever and totally understand all of the tricky aspects and
risks of such action.
But it is about *other* clients who become able to see the
"non-existing" data. They even have no option to detect such
situation.

Yes, currently there are a few ways for non-synchronous-replicated
data to become visible (for complex reasons of course):
1) client cancels the query while waiting synchronous replications
2) database restart
3) kill -9 of backend waiting synchronous replications

What is the main difference among 1 vs 2 and 3? Because 1 is performed
not only by humans.
And moreover it is performed mostly by applications. And it happens
right now on thousands on servers!

Check [1] and [2]. It is official JDBC driver for PostgreSQL. I am
sure it is the most popular way to communicate with PostgreSQL these
days.
And implementation of Statement::setQueryTimeout creates timer to send
cancellation after the timeout. It is official and recommended way to
limit statement execution to some interval in JDBC.
In my project (and its libraries) it is used in dozens of places. It
is also possible to search GitHub [4] to understand how widely it's
used.
For example it is used by Spring framework[5], probably the most
popular framework in the world for the rank-2 programming language.

And situation is even worse. What is the case when setQueryTimeout
starts to cancel queries during synchronous replication like crazy?
Yes, it is the moment of losing connection between master and sync
replica (because all backends are now stuck in synrep). New master
will be elected in a few seconds (or maybe already elected and
working).
And... Totally correct code cancels hundreds of queries stuck in
synrep making "non-existing" data available to be read for other
clients in same availability zone

It is just nightmare to be honest.

These days almost every web-service needs HA for postgres. And
practically if your code (or some of library code) calls
Statement::setQueryTimeout - your HA (for example - Patroni) is
broken.
And it is really not easy to control setQueryTimeout call in modern
application with thousands of third-party libraries. Also, a lot of
applications are in the support phase.

As for me - I am going to hack postgres jdbc driver to ignore
setQueryTimeout at all for now.

>> I think proper solution here would be to add GUC to disallow cancellation of 
>> synchronous replication.
> This sounds entirely insane to me.  There is no possibility that you
> can prevent a failure from occurring at this step.

Yes, maybe it is insane but looks like whole java-postgres-HA (and may
be others) world is going down. So, I believe we should even backport
such an insane knob.


As developers of distributed systems we don't have many things to rely
on. And they are:
1) if database clearly says something is committed - it is committed
with ACID guarantees
2) anything else - it may be committed, may be not committed, may be
waiting to be committed

And we've just lost letter D from ACID practically.

Thanks,
Michail.

[1] 
https://github.com/pgjdbc/pgjdbc/blob/23cce8ad35d9af6e2a1cb97fac69fdc0a7f94b42/pgjdbc/src/main/java/org/postgresql/core/QueryExecutorBase.java#L164-L200
[2] 
https://github.com/pgjdbc/pgjdbc/blob/ed09fd1165f046ae956bf21b6c7882f1267fb8d7/pgjdbc/src/main/java/org/postgresql/jdbc/PgStatement.java#L538-L540
[3] 
https://docs.oracle.com/javase/7/docs/api/java/sql/Statement.html#setQueryTimeout(int)
[4] https://github.com/search?l=Java=statement.setQueryTimeout=Code
[5] 
https://github.com/spring-projects/spring-framework/blob/master/spring-jdbc/src/main/java/org/springframework/jdbc/datasource/DataSourceUtils.java#L329-L343




Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2020-02-20 Thread Anastasia Lubennikova

On 19.02.2020 22:16, Peter Geoghegan wrote:

On Wed, Feb 19, 2020 at 8:14 AM Anastasia Lubennikova
 wrote:

Thank you for this work. I've looked through the patches and they seem
to be ready for commit.
I haven't yet read recent documentation and readme changes, so maybe
I'll send some more feedback tomorrow.


The only thing I found is a typo in the comment

+  int  nhtids;  /* Number of heap TIDs in nhtids array */

s/nhtids/htids

I don't think this patch really needs more nitpicking )




In my opinion, this message is too specific for default behavior. It
exposes internal details without explanation and may look to user like
something went wrong.

You're probably right about that. I just wish that there was some way
of showing the same information that was discoverable, and didn't
require the use of pageinspect. If I make it a DEBUG1 message, then it
cannot really be documented.


User can discover this with a complex query to pg_index and pg_opclass.
To simplify this, we can probably wrap this into function or some field 
in pg_indexes.

Anyway, I would wait for feedback from pre-release testers.





Re: allow running parts of src/tools/msvc/ under not Windows

2020-02-20 Thread Tom Lane
Peter Eisentraut  writes:
> On 2020-02-13 16:36, Tom Lane wrote:
>> Yeah, I'm wondering exactly how this helps.  IME the typical sort of
>> breakage is "the MSVC build doesn't know that file X needs to be
>> included when building Y".  It seems like just building the project
>> files will teach one nothing about that type of omission.

> The main benefit is that if you make "blind" edits in the Perl files, 
> you can verify them easily, first by seeing that the Perl code runs, 
> second, depending on the circumstances, by diffing the created project 
> files.  Another is that if you do some nontrivial surgery in makefiles, 
> you can check whether the Perl code can still process them.  So the 
> benefit is mainly that you can iterate faster when working on build 
> system related things.  You still need to do a full test on Windows at 
> the conclusion, but then hopefully with a better chance of success.

I see.  No objection then.

regards, tom lane




Re: Minor improvement to partition_bounds_copy()

2020-02-20 Thread Julien Rouhaud
On Thu, Feb 20, 2020 at 09:38:26PM +0900, Amit Langote wrote:
> Fujita-san,
>
> On Thu, Feb 20, 2020 at 8:36 PM Etsuro Fujita  wrote:
> > partition_bounds_copy() sets the hash_part and natts variable in each
> > iteration of a loop to copy the datums in the datums array, which
> > would not be efficient.  Attached is small patch for avoiding that.
>
> That looks good to me.

Looks good to me too!




Re: [Patch] Make pg_checksums skip foreign tablespace directories

2020-02-20 Thread David Steele

On 2/20/20 12:55 AM, Michael Paquier wrote:

On Wed, Feb 19, 2020 at 12:37:00PM -0600, David Steele wrote:


As far as I can see, the pg_internal.init.XX will not be cleaned up by
PostgreSQL on startup.  I've only tested this in 9.6 so far, but I don't see
any differences in the code since then that would lead to better behavior.
Perhaps that's also something we should fix?


Not sure that it is worth spending cycles on that at the beginning of
recovery as when a mapping file is written its temporary entry
truncates any existing one present matching its name.


But since the name includes the backend's pid you would need to get 
lucky and have a new backend with the same pid create the file after a 
restart.  I tried it and the old temp file was left behind after restart 
and first connection to the database.


I doubt this is a big issue in the field, but it seems like it would be 
nice to do something about it.



I'm really not a fan of a blind prefix match.  I think we should stick with
only excluding files that are created by Postgres.


Thinking more on that, excluding any backup_label with a custom suffix
worries me as it could cause a potential breakage for exiting backup
solutions.  So attached is an updated patch making things in a
smarter way: I have added to the exclusion lists the possibility to
match an entry based on its prefix, or not, the choice being optional.
This solves the problem with pg_internal.PID and is careful to not
exclude unnecessary entries like suffixed backup labels or such.  This
leads to some extra duplication within pg_rewind, basebackup.c and
pg_checksums but I think we can live with that, and that makes
back-patching simpler.  Refactoring is still tricky though as it
relates to the use of paths across the backend and the frontend..


I'm not excited about the amount of code duplication between these three 
tools.  I know this was because of back-patching various issues in the 
past, but I really think we need to unify these data 
structures/functions in HEAD.



So backup_label.old and
tablespace_map.old should just be added to the exclude list.  That's how we
have it in pgBackRest.


That would be a behavior change.  We could change that on HEAD, but I
don't think that this can be back-patched as this does not cause an
actual problem.


Right, that should be in HEAD.


For now, my proposal is to fix the prefix first, and then let's look
at the business with tablespaces where needed.


OK.

Regards,
--
-David
da...@pgmasters.net




Re: Parallel copy

2020-02-20 Thread Tomas Vondra

On Thu, Feb 20, 2020 at 04:11:39PM +0530, Amit Kapila wrote:

On Thu, Feb 20, 2020 at 5:12 AM David Fetter  wrote:


On Fri, Feb 14, 2020 at 01:41:54PM +0530, Amit Kapila wrote:
> This work is to parallelize the copy command and in particular "Copy
>  from 'filename' Where ;" command.

Apropos of the initial parsing issue generally, there's an interesting
approach taken here: https://github.com/robertdavidgraham/wc2



Thanks for sharing.  I might be missing something, but I can't figure
out how this can help here.  Does this in some way help to allow
multiple workers to read and tokenize the chunks?



I think the wc2 is showing that maybe instead of parallelizing the
parsing, we might instead try using a different tokenizer/parser and
make the implementation more efficient instead of just throwing more
CPUs on it.

I don't know if our code is similar to what wc does, maytbe parsing
csv is more complicated than what wc does.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Minor improvement to partition_bounds_copy()

2020-02-20 Thread Amit Langote
Fujita-san,

On Thu, Feb 20, 2020 at 8:36 PM Etsuro Fujita  wrote:
> partition_bounds_copy() sets the hash_part and natts variable in each
> iteration of a loop to copy the datums in the datums array, which
> would not be efficient.  Attached is small patch for avoiding that.

That looks good to me.

Thanks,
Amit




Re: Experimenting with transactional memory for SERIALIZABLE

2020-02-20 Thread Thomas Munro
On Thu, Feb 20, 2020 at 11:38 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> Can you share some numbers about how not well it perform and how many
> hardware transactions were aborted with a fallback? I'm curious because
> from this paper [1] I've got an impression that the bigger (in terms of
> memory) and longer transaction is, the higher changes for it to get
> aborted. So I wonder if it needs to be taken into account, or using it
> for SSI as presented in the patch somehow implicitely minimize those
> chances? Otherwise not only conflicting transactions will cause
> fallback, but also those that e.g. span too much memory.

Good questions, and I don't have good enough numbers to share right
now; to be clear, the stage this work is at is: "wow, I think this new
alien technology might actually be producing the right answers at
least some of the time, now maybe we could start to think about
analysing its behaviour some more", and I wanted to share early and
see if anyone else was interested in the topic too :-)

Thanks for that paper, added to my reading list.  The HTM
transactions' size is not linked to the size of database transactions,
which would certainly be too large.  It's just used for lower level
operations that need to be atomic and serializable, replacing a bunch
of LWLocks.  I see from skimming the final paragraph of that paper
that they're also not mapping database transactions directly to HTM.
So, the amount of memory you touch depends on the current size of
various lists in SSI's internal book keeping, and I haven't done the
work to figure out at which point space runs out (_XABORT_CAPACITY) in
any test workloads etc, or to consider whether some operations that I
covered with one HTM transaction could be safely broken up into
multiple transactions to minimise transaction size, though I am aware
of at least one opportunity like that.

> Another interesting for me question is how much is it affected by TAA
> vulnerability [2], and what are the prospects of this approach in the
> view of many suggests to disable TSX due to that (there are mitigations
> ofcourse, but if I understand correctly e.g. for Linux it's similar to
> MDS, where a mitigation is done via flushing cpu buffers on entering the
> kernel space, but in between speculative access still could be
> performed).

Yeah, the rollout of TSX has been a wild ride since the beginning.  I
didn't want to comment on that aspect because I just don't know enough
about it and at this point it's frankly pretty confusing.   As far as
I know from limited reading, as of late last year a few well known
OSes are offering easy ways to disable TSX due to Zombieload v2 if you
would like to, but not doing so by default.  I tested with the Debian
intel-microcode package version 3.20191115.2~deb10u1 installed which I
understand to the be latest and greatest, and made no relevant
modifications, and the instructions were available.  I haven't read
anywhere that TSX itself is ending.  Other ISAs have comparable
technology[1][2], and the concept has been worked on for over 20
years, so... I just don't know.

[1] 
https://developer.arm.com/docs/101028/0008/transactional-memory-extension-tme-intrinsics
[2] 
https://www.ibm.com/developerworks/aix/library/au-aix-ibm-xl-compiler-built-in-functions/index.html




Re: Clean up some old cruft related to Windows

2020-02-20 Thread Juan José Santamaría Flecha
On Thu, Feb 20, 2020 at 4:23 AM Michael Paquier  wrote:

>
> +  You can change certain build options, such as the targeted CPU
> +  architecture, build type, and the selection of the SDK by using either
> +  VSVARS32.bat or VsDevCmd.bat
> depending
> +  on your Visual Studio release. All commands
> +  should be run from the src\tools\msvc directory.
>

I think more parts of this paragraph need tuning, like:

"In Visual Studio, start the Visual Studio Command Prompt. If you wish to
build a 64-bit version, you must use the 64-bit version of the command, and
vice versa."

This is what VsDevCmd.bat does, seting up the Visual Studio Command Prompt,
but from the command-line.

Also the following:

"In the Microsoft Windows SDK, start the CMD shell listed under the SDK on
the Start Menu."

This is not the case, you would be working in the CMD setup previously from
Visual Studio.


>  And I am not actually sure which
> environment variable to touch when using VSVARS32.bat or
> VSVARSALL.bat with MSVC <= 2017.
>

Actually, you can still use the vcvars% scripts to configure architecture,
platform_type and winsdk_version with current VS [1].

Both commands have different ways of doing things, and don't really
> shine with their documentation
>

I hear you.

Please find attached a new version that addresses these issues.

[1]
https://docs.microsoft.com/en-us/cpp/build/building-on-the-command-line?view=vs-2019

Regards,

Juan José Santamaría Flecha


install-windows_doc_cleanup-v2.patch
Description: Binary data


Minor improvement to partition_bounds_copy()

2020-02-20 Thread Etsuro Fujita
partition_bounds_copy() sets the hash_part and natts variable in each
iteration of a loop to copy the datums in the datums array, which
would not be efficient.  Attached is small patch for avoiding that.

Best regards,
Etsuro Fujita


partition_bounds_copy.patch
Description: Binary data


Re: bad wal on replica / incorrect resource manager data checksum in record / zfs

2020-02-20 Thread Amit Kapila
On Thu, Feb 20, 2020 at 3:06 AM Alex Malek  wrote:
>
>
> Hello Postgres Hackers -
>
> We are having a reoccurring issue on 2 of our replicas where replication 
> stops due to this message:
> "incorrect resource manager data checksum in record at ..."
> This has been occurring on average once every 1 to 2 weeks during large data 
> imports (100s of GBs being written)
> on one of two replicas.
> Fixing the issue has been relatively straight forward: shutdown replica, 
> remove the bad wal file, restart replica and
> the good wal file is retrieved from the master.
> We are doing streaming replication using replication slots.
> However twice now, the master had already removed the WAL file so the file 
> had to retrieved from the wal archive.
>
> The WAL log directories on the master and the replicas are on ZFS file 
> systems.
> All servers are running RHEL 7.7 (Maipo)
> PostgreSQL 10.11
> ZFS v0.7.13-1
>
> The issue seems similar to 
> https://www.postgresql.org/message-id/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com
>  and to https://github.com/timescale/timescaledb/issues/1443
>
> One quirk in our ZFS setup is ZFS is not handling our RAID array, so ZFS sees 
> our array as a single device.
>
> Right before the issue started we did some upgrades and altered some postgres 
> configs and ZFS settings.
> We have been slowly rolling back changes but so far the the issue continues.
>
> Some interesting data points while debugging:
> We had lowered the ZFS recordsize from 128K to 32K and for that week the 
> issue started happening every other day.
> Using xxd and diff we compared "good" and "bad" wal files and the differences 
> were not random bad bytes.
>
> The bad file either had a block of zeros that were not in the good file at 
> that position or other data.  Occasionally the bad data has contained legible 
> strings not in the good file at that position. At least one of those exact 
> strings has existed elsewhere in the files.
> However I am not sure if that is the case for all of them.
>
> This made me think that maybe there was an issue w/ wal file recycling and 
> ZFS under heavy load, so we tried lowering
> min_wal_size in order to "discourage" wal file recycling but my understanding 
> is a low value discourages recycling but it will still
> happen (unless setting wal_recycle in psql 12).
>

We do print a message "recycled write-ahead log file .." in DEBUG2
mode.  You either want to run the server with DEBUG2 or maybe change
the code to make it LOG and see if that is printed.  If you do that,
you can verify if the corrupted WAL is the same as a recycled one.

> There is a third replica where this bug has not (yet?) surfaced.
> This leads me to guess the bad data does not originate on the master.
> This replica is older than the other replicas, slower CPUs, less RAM, and the 
> WAL disk array is spinning disks.
> The OS, version of Postgres, and version of ZFS are the same as the other 
> replicas.
> This replica is not using a replication slot.
> This replica does not serve users so load/contention is much lower than the 
> others.
> The other replicas often have 100% utilization of the disk array that houses 
> the (non-wal) data.
>
> Any insight into the source of this bug or how to address it?
>
> Since the master has a good copy of the WAL file, can the replica re-request  
> the file from the master? Or from the archive?
>

I think we do check in the archive if we get the error during
streaming, but archive might also have the same data due to which this
problem happens.  Have you checked that the archive WAL file, is it
different from the bad WAL?  See the relevant bits of code in
WaitForWALToBecomeAvailable especially the code near below comment:

"Failure while streaming. Most likely, we got here because streaming
replication was terminated, or promotion was triggered. But we also
get here if we find an invalid record in the WAL streamed from master,
in which case something is seriously wrong. There's little chance that
the problem will just go away, but PANIC is not good for availability
either, especially in hot standby mode. So, we treat that the same as
disconnection, and retry from archive/pg_wal again. The WAL in the
archive should be identical to what was streamed, so it's unlikely
that it helps, but one can hope..."


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Add PostgreSQL home page to --help output

2020-02-20 Thread Daniel Gustafsson
> On 20 Feb 2020, at 10:53, Daniel Gustafsson  wrote:
> 
>> On 20 Feb 2020, at 10:15, Peter Eisentraut 
>>  wrote:
>> 
>> On 2020-02-13 14:24, Greg Stark wrote:
>>> Sounds like a fine idea. But personally I would prefer it without the <> 
>>> around the it, just a url on a line by itself. I think it would be clearer, 
>>> look cleaner, and be easier to select to copy/paste elsewhere.
>> 
>> I'm on the fence about this one, but I like the delimiters because it would 
>> also work consistently if we put a URL into running text where it might be 
>> immediately adjacent to other characters.  So I was actually going for 
>> easier to copy/paste here, but perhaps in other environments it's not easier?
> 
> For URLs completely on their own, not using <> makes sense.  Copy pasting 
> 
> into the location bar of Safari makes it load the url, but Firefox and Chrome
> turn it into a search engine query (no idea about Windows browsers).
> 
> For URLs in running text it's not uncommon to have <> around the URL for the
> very reason you mention.  Looking at --help and manpages from random open
> source tools there seems to be roughly a 50/50 split on using <> or not.

RFC3986 discuss this in , with
the content mostly carried over from RFC2396 appendix E.

cheers ./daniel



Re: Parallel copy

2020-02-20 Thread Amit Kapila
On Thu, Feb 20, 2020 at 5:12 AM David Fetter  wrote:
>
> On Fri, Feb 14, 2020 at 01:41:54PM +0530, Amit Kapila wrote:
> > This work is to parallelize the copy command and in particular "Copy
> >  from 'filename' Where ;" command.
>
> Apropos of the initial parsing issue generally, there's an interesting
> approach taken here: https://github.com/robertdavidgraham/wc2
>

Thanks for sharing.  I might be missing something, but I can't figure
out how this can help here.  Does this in some way help to allow
multiple workers to read and tokenize the chunks?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Experimenting with transactional memory for SERIALIZABLE

2020-02-20 Thread Dmitry Dolgov
> On Thu, Feb 20, 2020 at 04:55:12PM +1300, Thomas Munro wrote:
> Hello hackers,
>
> Here's a *highly* experimental patch set that tries to skip the LWLock
> protocol in predicate.c and use HTM[1] instead.  HTM is itself a sort
> of hardware-level implementation of SSI for shared memory.  My
> thinking was that if your workload already suits the optimistic nature
> of SSI, perhaps it could make sense to go all-in and remove the rather
> complicated pessimistic locking it's built on top of.  It falls back
> to an LWLock-based path at compile time if you don't build with
> --enable-htm, or at runtime if a startup test discovered that your CPU
> doesn't have the Intel TSX instruction set (microarchitectures older
> than Skylake, and some mobile and low power variants of current ones),
> or if a hardware transaction is aborted for various reasons.

Thanks, that sounds cool!

> The good news is that it seems to produce correct results in simple
> tests (well, some lock-held-by-me assertions can fail in an
> --enable-cassert build, that's trivial to fix).  The bad news is that
> it doesn't perform very well yet, and I think the reason for that is
> that there are some inherently serial parts of the current design that
> cause frequent conflicts.

Can you share some numbers about how not well it perform and how many
hardware transactions were aborted with a fallback? I'm curious because
from this paper [1] I've got an impression that the bigger (in terms of
memory) and longer transaction is, the higher changes for it to get
aborted. So I wonder if it needs to be taken into account, or using it
for SSI as presented in the patch somehow implicitely minimize those
chances? Otherwise not only conflicting transactions will cause
fallback, but also those that e.g. span too much memory.

Another interesting for me question is how much is it affected by TAA
vulnerability [2], and what are the prospects of this approach in the
view of many suggests to disable TSX due to that (there are mitigations
ofcourse, but if I understand correctly e.g. for Linux it's similar to
MDS, where a mitigation is done via flushing cpu buffers on entering the
kernel space, but in between speculative access still could be
performed).

[1]: https://db.in.tum.de/~leis/papers/HTM.pdf
[2]: 
https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html




Re: Add PostgreSQL home page to --help output

2020-02-20 Thread Daniel Gustafsson
> On 20 Feb 2020, at 10:15, Peter Eisentraut  
> wrote:
> 
> On 2020-02-13 14:24, Greg Stark wrote:
>> Sounds like a fine idea. But personally I would prefer it without the <> 
>> around the it, just a url on a line by itself. I think it would be clearer, 
>> look cleaner, and be easier to select to copy/paste elsewhere.
> 
> I'm on the fence about this one, but I like the delimiters because it would 
> also work consistently if we put a URL into running text where it might be 
> immediately adjacent to other characters.  So I was actually going for easier 
> to copy/paste here, but perhaps in other environments it's not easier?

For URLs completely on their own, not using <> makes sense.  Copy pasting 
into the location bar of Safari makes it load the url, but Firefox and Chrome
turn it into a search engine query (no idea about Windows browsers).

For URLs in running text it's not uncommon to have <> around the URL for the
very reason you mention.  Looking at --help and manpages from random open
source tools there seems to be roughly a 50/50 split on using <> or not.

cheers ./daniel



RE: [PoC] Non-volatile WAL buffer

2020-02-20 Thread Takashi Menjo
Dear Amit,

Thank you for your advice.  Exactly, it's so to speak "do as the hackers do 
when in pgsql"...

I'm rebasing my branch onto master.  I'll submit an updated patchset and 
performance report later.

Best regards,
Takashi

-- 
Takashi Menjo 
NTT Software Innovation Center

> -Original Message-
> From: Amit Langote 
> Sent: Monday, February 17, 2020 5:21 PM
> To: Takashi Menjo 
> Cc: Robert Haas ; Heikki Linnakangas 
> ; PostgreSQL-development
> 
> Subject: Re: [PoC] Non-volatile WAL buffer
> 
> Hello,
> 
> On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo 
>  wrote:
> > Hello Amit,
> >
> > > I apologize for not having any opinion on the patches themselves,
> > > but let me point out that it's better to base these patches on HEAD
> > > (master branch) than REL_12_0, because all new code is committed to
> > > the master branch, whereas stable branches such as REL_12_0 only receive 
> > > bug fixes.  Do you have any
> specific reason to be working on REL_12_0?
> >
> > Yes, because I think it's human-friendly to reproduce and discuss 
> > performance measurement.  Of course I know
> all new accepted patches are merged into master's HEAD, not stable branches 
> and not even release tags, so I'm
> aware of rebasing my patchset onto master sooner or later.  However, if 
> someone, including me, says that s/he
> applies my patchset to "master" and measures its performance, we have to pay 
> attention to which commit the
> "master" really points to.  Although we have sha1 hashes to specify which 
> commit, we should check whether the
> specific commit on master has patches affecting performance or not because 
> master's HEAD gets new patches day
> by day.  On the other hand, a release tag clearly points the commit all we 
> probably know.  Also we can check more
> easily the features and improvements by using release notes and user manuals.
> 
> Thanks for clarifying. I see where you're coming from.
> 
> While I do sometimes see people reporting numbers with the latest stable 
> release' branch, that's normally just one
> of the baselines.
> The more important baseline for ongoing development is the master branch's 
> HEAD, which is also what people
> volunteering to test your patches would use.  Anyone who reports would have 
> to give at least two numbers --
> performance with a branch's HEAD without patch applied and that with patch 
> applied -- which can be enough in
> most cases to see the difference the patch makes.  Sure, the numbers might 
> change on each report, but that's fine
> I'd think.  If you continue to develop against the stable branch, you might 
> miss to notice impact from any relevant
> developments in the master branch, even developments which possibly require 
> rethinking the architecture of your
> own changes, although maybe that rarely occurs.
> 
> Thanks,
> Amit






Re: Autovacuum on partitioned table

2020-02-20 Thread Amit Langote
On Thu, Feb 20, 2020 at 5:32 PM Amit Langote  wrote:
> On Thu, Feb 20, 2020 at 4:50 PM Amit Langote  wrote:
> > * I may be missing something, but why doesn't do_autovacuum() fetch a
> > partitioned table's entry from pgstat instead of fetching that for
> > individual children and adding? That is, why do we need to do the
> > following:
> >
> > +/*
> > + * If the relation is a partitioned table, we check it
> > using reltuples
> > + * added up childrens' and changes_since_analyze tracked
> > by stats collector.
>
> Oh, it's only adding up children's pg_class.reltuple, not pgstat
> stats.  We need to do that because a partitioned table's
> pg_class.reltuples is always 0 and correctly so.  Sorry for not
> reading the patch properly.

Having read the relevant diffs again, I think this could be done
without duplicating code too much.  You seem to have added the same
logic in two places: do_autovacuum() and table_recheck_autovac().
More importantly, part of the logic of relation_needs_vacanalyze() is
duplicated in both of the aforementioned places, which I think is
unnecessary and undesirable if you consider maintainability. I think
we could just add the logic to compute reltuples for partitioned
tables at the beginning of relation_needs_vacanalyze() and be done.  I
have attached a delta patch to show what I mean.  Please check and
tell what you think.

Thanks,
Amit
diff --git a/src/backend/postmaster/autovacuum.c 
b/src/backend/postmaster/autovacuum.c
index 7d0a5ce30d..ca6996e448 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2089,103 +2089,19 @@ do_autovacuum(void)
continue;
}
 
-   if (classForm->relkind == RELKIND_RELATION ||
-   classForm->relkind == RELKIND_MATVIEW)
-   {
-   /* Fetch reloptions and the pgstat entry for this table 
*/
-   relopts = extract_autovac_opts(tuple, pg_class_desc);
-   tabentry = get_pgstat_tabentry_relid(relid, 
classForm->relisshared,
-   
 shared, dbentry);
-
-   /* Check if it needs vacuum or analyze */
-   relation_needs_vacanalyze(relid, relopts, classForm, 
tabentry,
- 
effective_multixact_freeze_max_age,
- 
, , );
-
-   /* Relations that need work are added to table_oids */
-   if (dovacuum || doanalyze)
-   table_oids = lappend_oid(table_oids, relid);
-   }
-   else
-   {
-   /*
-* If the relation is a partitioned table, we check it 
using reltuples
-* added up childrens' and changes_since_analyze 
tracked by stats collector.
-* We check only auto analyze because partitioned 
tables don't need to vacuum.
-*/
-   List *tableOIDs;
-   ListCell *lc;
-   bool  av_enabled;
-   int   anl_base_thresh;
-   float4all_reltuples = 0,
- anl_scale_factor,
- anlthresh,
- reltuples,
- anltuples;
-
-   /* Find all members of inheritance set taking 
AccessShareLock */
-   tableOIDs = find_all_inheritors(relid, AccessShareLock, 
NULL);
-
-   foreach(lc, tableOIDs)
-   {
-   OidchildOID = lfirst_oid(lc);
-   HeapTuple  childtuple;
-   Form_pg_class childclassForm;
-
-   /* Ignore the parent table */
-   if (childOID == relid)
-   continue;
-
-   childtuple = SearchSysCacheCopy1(RELOID, 
ObjectIdGetDatum(childOID));
-   childclassForm = (Form_pg_class) 
GETSTRUCT(childtuple);
-
-   /* Skip foreign partitions */
-   if (childclassForm->relkind == 
RELKIND_FOREIGN_TABLE)
-   continue;
-
-   /* Sum up the child's reltuples for its parent 
table */
-   all_reltuples += childclassForm->reltuples;
-   elog(NOTICE, "[parent:%s] child%s has %.0f 
tuples", NameStr(classForm->relname),NameStr(childclassForm->relname), 
childclassForm->reltuples);
- 

Re: Add PostgreSQL home page to --help output

2020-02-20 Thread Peter Eisentraut

On 2020-02-13 14:24, Greg Stark wrote:
Sounds like a fine idea. But personally I would prefer it without the <> 
around the it, just a url on a line by itself. I think it would be 
clearer, look cleaner, and be easier to select to copy/paste elsewhere.


I'm on the fence about this one, but I like the delimiters because it 
would also work consistently if we put a URL into running text where it 
might be immediately adjacent to other characters.  So I was actually 
going for easier to copy/paste here, but perhaps in other environments 
it's not easier?


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: client-side fsync() error handling

2020-02-20 Thread Peter Eisentraut

On 2020-02-13 12:52, Michael Paquier wrote:

durable_rename() calls fsync_fname(), so it would be covered by this change.
The other file access calls in there can be handled by normal error
handling, I think.  Is there any specific scenario you have in mind?


The old file flush is handled by your patch, but not the new one if
it exists, and it seems to me that we should handle failures
consistently to reason easier about it, actually as the top of the
function says :)


OK, added in new patch.


Another point that we could consider is if fsync_fname() should have
an option to not trigger an immediate exit when facing a failure.  The
backend has that option thanks to fsync_fname_ext() with its elevel
argument.  Your choice to default to a failure is fine for most cases
because that's what we want.  However, I am questioning if this change
would be surprising for some client applications or not, and if we
should have the option to choose one behavior or the other.


The option in the backend is between panicking and retrying.  The old 
behavior was to always retry but we have learned that that usually 
doesn't work.


The frontends do neither right now, or at least the error handling is 
very inconsistent and inscrutable.  It would be possible in theory to 
add a retry option, but that would be a very different patch, and given 
what we have learned about fsync(), it probably wouldn't be widely useful.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 91bd9bee1de18fbb1842d0223b693cff21eb2ec6 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 20 Feb 2020 09:52:55 +0100
Subject: [PATCH v2] Change client-side fsync_fname() to report errors fatally

Given all we have learned about fsync() error handling in the last few
years, reporting an fsync() error non-fatally is not useful,
unless you don't care much about the file, in which case you probably
don't need to use fsync() in the first place.

Change fsync_fname() and durable_rename() to exit(1) on fsync() errors
other than those that we specifically chose to ignore.

This affects initdb, pg_basebackup, pg_checksums, pg_dump, pg_dumpall,
and pg_rewind.

Discussion: 
https://www.postgresql.org/message-id/flat/d239d1bd-aef0-ca7c-dc0a-da14bdcf0392%402ndquadrant.com
---
 src/common/file_utils.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 413fe4eeb1..7584c1f2fb 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -51,8 +51,6 @@ static void walkdir(const char *path,
  * fsyncing, and might not have privileges to write at all.
  *
  * serverVersion indicates the version of the server to be fsync'd.
- *
- * Errors are reported but not considered fatal.
  */
 void
 fsync_pgdata(const char *pg_data,
@@ -250,8 +248,8 @@ pre_sync_fname(const char *fname, bool isdir)
  * fsync_fname -- Try to fsync a file or directory
  *
  * Ignores errors trying to open unreadable files, or trying to fsync
- * directories on systems where that isn't allowed/required.  Reports
- * other errors non-fatally.
+ * directories on systems where that isn't allowed/required.  All other errors
+ * are fatal.
  */
 int
 fsync_fname(const char *fname, bool isdir)
@@ -294,9 +292,9 @@ fsync_fname(const char *fname, bool isdir)
 */
if (returncode != 0 && !(isdir && (errno == EBADF || errno == EINVAL)))
{
-   pg_log_error("could not fsync file \"%s\": %m", fname);
+   pg_log_fatal("could not fsync file \"%s\": %m", fname);
(void) close(fd);
-   return -1;
+   exit(EXIT_FAILURE);
}
 
(void) close(fd);
@@ -364,9 +362,9 @@ durable_rename(const char *oldfile, const char *newfile)
{
if (fsync(fd) != 0)
{
-   pg_log_error("could not fsync file \"%s\": %m", 
newfile);
+   pg_log_fatal("could not fsync file \"%s\": %m", 
newfile);
close(fd);
-   return -1;
+   exit(EXIT_FAILURE);
}
close(fd);
}
-- 
2.25.0



Re: Autovacuum on partitioned table

2020-02-20 Thread Amit Langote
On Thu, Feb 20, 2020 at 4:50 PM Amit Langote  wrote:
> * I may be missing something, but why doesn't do_autovacuum() fetch a
> partitioned table's entry from pgstat instead of fetching that for
> individual children and adding? That is, why do we need to do the
> following:
>
> +/*
> + * If the relation is a partitioned table, we check it
> using reltuples
> + * added up childrens' and changes_since_analyze tracked
> by stats collector.

Oh, it's only adding up children's pg_class.reltuple, not pgstat
stats.  We need to do that because a partitioned table's
pg_class.reltuples is always 0 and correctly so.  Sorry for not
reading the patch properly.

Thanks,
Amit




RE: Complete data erasure

2020-02-20 Thread asaba.takan...@fujitsu.com
Hello Tom,

From: Tom Lane 
> Tomas Vondra  writes:
> > I think it depends how exactly it's implemented. As Tom pointed out in
> > his message [1], we can't do the erasure itself in the post-commit is
> > not being able to handle errors. But if the files are renamed durably,
> > and the erasure happens in a separate process, that could be OK. The
> > COMMIT may wayt for it or not, that's mostly irrelevant I think.
> 
> How is requiring a file rename to be completed post-commit any less
> problematic than the other way?  You still have a non-negligible
> chance of failure.

I think that errors of rename(2) listed in [1] cannot occur or can be handled.
What do you think?

[1] http://man7.org/linux/man-pages/man2/rename.2.html


Regards,

--
Takanori Asaba






RE: Complete data erasure

2020-02-20 Thread asaba.takan...@fujitsu.com
Greetings,

From: asaba.takan...@fujitsu.com 
> Hello Stephen,
> 
> From: Stephen Frost 
> > I disagree- it's a feature that's been asked for multiple times and does
> > have value in some situations.
> 
> I'm rethinking the need for this feature although I think that it improves the
> security.
> You said that this feature has value in some situations.
> Could you tell me about that situations?
> 
> Regards,
> 
> --
> Takanori Asaba
> 

I think that the use scene is to ensure that no data remains.
This feature will give users peace of mind.

There is a risk of leakage as long as data remains.
I think that there are some things that users are worried about.
For example, there is a possibility that even if it takes years, attackers 
decrypt encrypted data.
Or some users may be concerned about disk management in cloud environments.
These concerns will be resolved if they can erase data themselves.

I think that this feature is valuable, so I would appreciate your continued 
cooperation.

Regards,

--
Takanori Asaba






Re: allow running parts of src/tools/msvc/ under not Windows

2020-02-20 Thread Peter Eisentraut

On 2020-02-13 16:36, Tom Lane wrote:

Julien Rouhaud  writes:

On Thu, Feb 13, 2020 at 02:24:43PM +0100, Peter Eisentraut wrote:

When making build system changes that risk breaking the MSVC build system,
it's useful to be able to run the part of the MSVC build tools that read the
makefiles and produce the project files under a not-Windows platform.



With v2 I'm able to successfully run mkvcbuild.pl on linux and macos.  I don't
have any knowledge on compiling with windows, so I can't really judge what it's
been doing.


Yeah, I'm wondering exactly how this helps.  IME the typical sort of
breakage is "the MSVC build doesn't know that file X needs to be
included when building Y".  It seems like just building the project
files will teach one nothing about that type of omission.


The main benefit is that if you make "blind" edits in the Perl files, 
you can verify them easily, first by seeing that the Perl code runs, 
second, depending on the circumstances, by diffing the created project 
files.  Another is that if you do some nontrivial surgery in makefiles, 
you can check whether the Perl code can still process them.  So the 
benefit is mainly that you can iterate faster when working on build 
system related things.  You still need to do a full test on Windows at 
the conclusion, but then hopefully with a better chance of success.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services