date:20190214

Re: Planning counters in pg_stat_statements (using pgss_store)

2019-02-14 Thread Sergei Kornilov

Hi

>>  +#define PG_STAT_STATEMENTS_COLS_V1_4 25
>
> I thought it was needed when adding new columns, isn't it ?

Yes, this is needed. I mean it should be PG_STAT_STATEMENTS_COLS_V1_8: because 
such change was made for 1.8 pg_stat_statements version. Same thing for other 
version-specific places.

regards, Sergei

Re: ToDo: show size of partitioned table

2019-02-14 Thread Amit Langote

Hi Pavel,

Thanks for updating the patch.

On 2019/02/08 17:26, Pavel Stehule wrote:
> I renamed originally calculated column "size" to "direct partitions size"
> .. see Alvaro's comment. Then I introduced new column "total partitions
> size" that is calculated like you propose.
> 
> Now the result of dPn+ looks like
> 
>  List of partitioned relations
> ┌┬┬───┬─┬┬───┬─┐
> │ Schema │  Name  │ Owner │ Parent name │ Direct partitions size │ Total
> partitions size │ Description │
> ╞╪╪═══╪═╪╪═══╪═╡
> │ public │ p  │ pavel │ │ 8192 bytes │ 24
> kB │ │
> │ public │ p_1│ pavel │ p   │ 8192 bytes │ 16
> kB │ │
> │ public │ p_1_bc │ pavel │ p_1 │ 8192 bytes │ 8192
> bytes│ │
> └┴┴───┴─┴┴───┴─┘
> (3 rows)

OK, so for each listed partitioned table (root and nested), this shows the
total size of the directly attached leaf partitions *and* the total size
of all partitions in its (sub-) tree.

By the way, what I think Alvaro meant by "local size" is not what the
"direct partition size" above shows.  I think "local size" means the size
of the storage assigned to the table itself, not to partitions attached to
it, which are distinct relations.  We don't implement that concept in
Postgres today, but may in the future.  I'm not sure if we'll add a
another column to show "local size" in the future when we do implement
that concept or if Alvaro meant that there should only be "local size"
(not "direct partition size") which will always show 0 for now and "total
partition size" columns.


Anyway, I have a few more suggestions to improve the patch, but instead of
sending the minute-level changes in the email, I've gone ahead and made
those changes myself.  I've attached a delta patch that applies on top of
your v9 patch.  Summary of the changes I made is as follows:

* Documentation rewording here and there (also mentioned the "direct
partitions size" and "total partitions size" division in the \dPn output
per the latest patch)

* Wrapped some lines in code so that they don't look too wide

* Renamed show_nested_partitions to show_nested

* Changed "Partitioned relations" in the output headers to say
"Partitioned tables" where appropriate

* Fixed quiet mode output to use correct word between object_name vs
objects_name

Please merge these changes if you think they are reasonable.

Thanks,
Amit
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 76a6ceb0dd..9fb632b0bd 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1664,19 +1664,23 @@ testdb=
 \dP[n+] [ pattern ]
 
 
-Lists partitioned relations. If pattern is specified, only
-entries whose relation name or schema name matches
-the pattern are listed. If the form \dP+
-is used, the sum of sizes of related partitions (including the
-table and indexes, if any) and associated description
-are also displayed.
+Lists partitioned relations.  If pattern is specified, only entries
+whose name matches the pattern are listed.  By default, only
+partitioned tables are listed; supply a pattern to also include
+partitioned indexes.  If the form \dP+
+is used, the sum of sizes of table's partitions (including their
+indexes) and associated description are also displayed.
 
 
 
- If modifier n is used, then non-root partition
- tables are displayed too. The size is calculated just for directly
- assigned partitions (not for nested partitions).
+ If modifier n (which stands for
+ nested) is used, then non-root partitioned tables are
+ displayed too.  The displayed size is divided into two columns in
+ this case: one that shows the total size of only the directly
+ attached leaf partitions and another that shows total size of all
+ partitions, also considering other sub-partitioned partitions, for
+ each partitioned tables that's displayed.
 
 
   
@@ -1687,17 +1691,15 @@ testdb=
 
 
 Lists partitioned indexes. If pattern is specified, only
-entries whose index name or schema name matches
-the pattern are listed. If the form \dPi+
-is used, the sum of sizes of related indexes and associated
-description are also displayed.
+class="parameter">pattern is specified, only entries
+whose name matches the pattern are listed. If the form
+\dPi+ is used, the sum of sizes of

Re: Delay locking partitions during INSERT and UPDATE

2019-02-14 Thread John Naylor

On 1/25/19, Tomas Vondra  wrote:
> Yes, I don't see why would the patch change that and I've been looking
> for such cases. I think David was looking at that this week too, and I
> assume he'll send an update if he finds anything. If not, I plan to get
> it committed soon-ish (possibly next week after the FOSDEM dev meeting,
> unless there are some objections).

I forgot to mark it ready for committer, so I've done that now.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: libpq host/hostaddr/conninfo inconsistencies

2019-02-14 Thread Kyotaro HORIGUCHI

Hello.

At Thu, 14 Feb 2019 22:51:40 +0100 (CET), Fabien COELHO  
wrote in 
> 
> > On 2018-10-26 09:21:51 +0200, Fabien COELHO wrote:
> >> (1) you are somehow against changing the current implementation, eg
> >> erroring
> >> out on possibly misleading configurations, because you do not think it
> >> is
> >> really useful to help users in those cases.
> >
> > I find this formulation somewhat passive aggressive.
> 
> I do not understand what you mean by that expression.
> 
> I was just trying to sum-up Robert's opposition to erroring on
> misleading configurations (eg "host=1.2.3.4 hostaddr=4.3.2.1") instead
> of complying to it whatever, as is currently done. Probably my
> phrasing could be improved, but I do not think that I misrepresented
> Robert's position.
> 
> Note that the issue is somehow mitigated by 6e5f8d489a: \conninfo now
> displays a more precise information, so that at least you are not told
> that you are connected to a socket when you a really connected to an
> ip, or to one ip when you a really connected to another.

I'm rather on (maybe) Robert's side in that not opposing to edit
it but documentation should be plain as far as it is not so
mis-leading for average readers. From the same viewpoint,
documentation is written general-and-important-first, then
special cases and trivials.

On such standpoint, the first hunk in the patch attracted my
eyes.

   host
   

-Name of host to connect to.host 
name
-If a host name begins with a slash, it specifies Unix-domain
-communication rather than TCP/IP communication; the value is the
-name of the directory in which the socket file is stored.
+Comma-separated list of hosts to connect to.host 
name
+Each specified host will be tried in turn in the order given.
+See  for details.
+Each item may be a host name that will be resolved with a look-up,
+a numeric IP address (IPv4 in the standard format, e.g.,
+172.28.40.9, or IPv6 if supported by your machine)
+that will be used directly, or
+the name of a directory which contains the socket file for Unix-domain
+communication rather than TCP/IP communication
+(the specification must then begin with a slash);
+   

I don't think this is user-friendly since almost all of them
don't write multiple hosts there. So I prefer the previous
organization. The description about IP-address looks too verbose,
especially we don't need explain what is IP-address here.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Early WIP/PoC for inlining CTEs

2019-02-14 Thread Andreas Karlsson


On 14/02/2019 16.11, Tom Lane wrote:

... so, have we beaten this topic to death yet?  Can we make a decision?

Personally, I'd be happy with either of the last two patch versions
I posted (that is, either AS [[NOT] MATERIALIZED] or
AS [MATERIALIZE [ON|OFF]] syntax).  But we gotta pick something.


I am happy with either of those.

Andreas

Re: pg11.1: dsa_area could not attach to segment

2019-02-14 Thread Thomas Munro

On Fri, Feb 15, 2019 at 2:31 AM Sergei Kornilov  wrote:
> I can not reproduce bug after 30min test long. (without patch bug was after 
> minute-two)

Thank you Justin and Sergei for all your help reproducing and testing this.

Fix pushed to all supported releases.  It's lightly refactored from
the version I posted yesterday.  Just doing s/break/continue/ made for
a cute patch, but this way the result is easier to understand IMHO.  I
also didn't bother with the non-essential change.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Using POPCNT and other advanced bit manipulation instructions

2019-02-14 Thread Kyotaro HORIGUCHI

At Thu, 14 Feb 2019 16:45:38 -0500, Tom Lane  wrote in 
<822.1550180...@sss.pgh.pa.us>
> Andres Freund  writes:
> > On 2019-02-14 15:47:13 -0300, Alvaro Herrera wrote:
> >> Hah, I just realized you have to add -mlzcnt in order for these builtins
> >> to use the lzcnt instructions.  It goes from something like
> >> 
> >> bsrq   %rax, %rax
> >> xorq   $63, %rax
> 
> > I'm confused how this is a general count leading zero operation? Did you
> > use constants or something that allowed ot infer a range in the test? If
> > so the compiler probably did some optimizations allowing it to do the
> > above.
> 
> No.  If you compile
> 
> int myclz(unsigned long long x)
> {
>   return __builtin_clzll(x);
> }
> 
> at -O2, on just about any x86_64 gcc, you will get
> 
> myclz:
> .LFB1:
> .cfi_startproc
> bsrq%rdi, %rax
> xorq$63, %rax
> ret
> .cfi_endproc
> 

I understand that the behavior of __builtin_c[tl]z(0) is
undefined from the reason, they convert to bs[rf]. So if we use
these builtins, additional check is required.

https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

> Built-in Function: int __builtin_clz (unsigned int x)
>   Returns the number of leading 0-bits in x, starting at the most
>   significant bit position. If x is 0, the result is undefined.
> 
> Built-in Function: int __builtin_ctz (unsigned int x)
>   Returns the number of trailing 0-bits in x, starting at the
>   least significant bit position. If x is 0, the result is
>   undefined.


And even worse lzcntx is accidentially "fallback"s to bsrx on
unsupported CPUs, which leads to bogus results.
__builtin_clzll(8) = 3, which should be 60.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: proposal: pg_restore --convert-to-text

2019-02-14 Thread Andreas Karlsson


On 14/02/2019 01.31, Euler Taveira wrote:

Em qua, 13 de fev de 2019 às 19:56, Andrew Gierth
 escreveu:

I propose we add a new option: --convert-to-text or some such name, and
then make pg_restore throw a usage error if neither -d nor the new
option is given.


However, I agree that pg_restore to stdout if -d wasn't specified is
not a good default. The current behavior is the same as "-f -"
(however, pg_restore doesn't allow - meaning stdout). Isn't it the
case to error out if -d or -f wasn't specified? If we go to this road,
-f option should allow - (stdout) as parameter.


Agreed, "-f -" would be acceptable. I use pg_restore to stdout a lot, 
but almost always manually and would have to have to remember and type 
"--convert-to-text".


Andreas

RE: idle-in-transaction timeout error does not give a hint

2019-02-14 Thread Jamison, Kirk

Hi,

On Monday, February 4, 2019 2:15 AM +, Michael Paquier wrote:
> On Tue, Dec 04, 2018 at 04:07:34AM +, Ideriha, Takeshi wrote:
> > Sure. I didn't have a strong opinion about it, so it's ok.

> From what I can see this is waiting input from a native English speaker, so 
> for now I have moved this entry to next CF.

Although I also use English in daily life,
I am not from a native English-speaking country.
But I'm giving it a try, since the current patch is not that complex to check.

> HINT: In a moment you should be able to reconnect to the
>   database and restart your transaction.

I think this error hint message makes sense for idle-in-transaction timeout.
It's grammatically correct and gives a clearer hint for that.

I agree that "reconnect to database" implies "restart session",
so it may not be the best word for errhint message.

Now the next point raised by Ideriha-san is whether the
other places in the code should also be changed.

I also checked the source code where the other errhints appear,
besides idle-in-transaction.
(ERRCODE_CRASH_SHUTDOWN and ERRCODE_T_R_SERIALIZATION_FAILURE)

Those errcodes use "repeat your command" for the hint.
> errhint("In a moment you should be able to reconnect to the"
> " database and repeat your command.")));

(1) (errcode(ERRCODE_CRASH_SHUTDOWN)
 As per its errdetail, the transaction is rollbacked,
 so "restart transaction" also makes sense.

(2) ERRCODE_T_R_SERIALIZATION_FAILURE
 The second one is under the clause below...
 > if (RecoveryConflictPending && DoingCommandRead)
 ...so using "repeat your command" makes sense.

I'm fine with retaining the other two hint messages as they are.
Regardless if we want to change the other similarly written
errhint messages or add errhint messages for other errcodes
that don't have it, I think the latest patch that provides errhint
message for idle-in-transaction timeout may be changed to
"Ready for Committer", as it still applies and the tests
successfully passed.

Regards,
Kirk Jamison

Re: Early WIP/PoC for inlining CTEs

2019-02-14 Thread Bruce Momjian

On Thu, Feb 14, 2019 at 04:25:27PM +0100, Peter Eisentraut wrote:
> On 06/02/2019 10:00, Bruce Momjian wrote:
> > I think "materialize" is the right word since "materialized" would be
> > past tense.
> 
> It's an adjective.  The sentence is, "with foo as the materialized
> $query, do the $main_query".
> 
> It's the same as "materialized view".

Agreed, thanks.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: pg_basebackup ignores the existing data directory permissions

2019-02-14 Thread Kyotaro HORIGUCHI

At Fri, 15 Feb 2019 08:15:24 +0900, Michael Paquier  wrote 
in <20190214231524.gc2...@paquier.xyz>
> On Thu, Feb 14, 2019 at 11:21:19PM +1100, Haribabu Kommi wrote:
> > On Thu, Feb 14, 2019 at 8:57 PM Magnus Hagander  wrote:
> >> I think it could be argued that neither initdb *or* pg_basebackup should
> >> change the permissions on an existing directory, because the admin may have
> >> done that intentionally. But when they do create the directory, they should
> >> follow the same patterns.
> > 
> > Hmm, even if the administrator set some specific permissions to the data
> > directory, PostgreSQL server doesn't allow server to start if the
> > permissions are not (0700) for versions less than 11 and (0700 or
> > 0750) for version 11 or later.
> 
> Yes, particularly with pg_basebackup -R this adds an extra step in the
> user flow.

I disagree that pg_basebackup rejects directories other than
specific permissions, since it is just a binary backup tool,
which is not exclusive to making replication-standby. It ought to
be runnable and actually runnable by any OS users even by root,
for who postgres rejects to start. As mentioned upthread, it is
safe-side failure that server rejects to run on it.

> > To let the user to use the PostgreSQL server, user must change the
> > permissions of the data directory. So, I don't see a problem in
> > changing the permissions by these tools.
> 
> I certainly agree with the point of Magnus that both tools should
> behave consistently, and I cannot actually imagine why it would be
> useful for an admin to keep a more permissive data folder while all
> the contents already have umasks set at the same level as the primary
> (or what initdb has been told to use), but perhaps I lack imagination.
> If we doubt about potential user impact, the usual, best, answer is to
> let back-branches behave the way they do now, and only do something on
> HEAD.

initdb is to create a directory on which server works and rather
rejects existing directory, so I think the "incosistency" seems
fine.

I can live with some new options, say --create-New-directory or
--check-directory-Permission.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Using POPCNT and other advanced bit manipulation instructions

2019-02-14 Thread Alvaro Herrera

On 2019-Feb-14, Tom Lane wrote:

> I think we need a clean test for __builtin_popcount(), and to be willing
> to use it if available, independently of -mpopcnt.  Then separately we
> should test to see if -mpopcnt works, probably with the same
> infrastructure we use for checking for other compiler flags, viz
> 
># Optimization flags for specific files that benefit from vectorization
>PGAC_PROG_CC_VAR_OPT(CFLAGS_VECTOR, [-funroll-loops])
>PGAC_PROG_CC_VAR_OPT(CFLAGS_VECTOR, [-ftree-vectorize])
> +  # Optimization flags for bit-twiddling
> +  PGAC_PROG_CC_VAR_OPT(CFLAGS_POPCNT, [-mpopcnt])
># We want to suppress clang's unhelpful unused-command-line-argument 
> warnings
> 
> Then the correct test to see if we want to build pg_popcount.c (BTW,
> please pick a less generic name for that) and the choose function
> is whether we have *both* HAVE__BUILTIN_POPCOUNT and nonempty
> CFLAGS_POPCNT.

Yeah, this works.  I'll post the patch tomorrow.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Ryu floating point output patch

2019-02-14 Thread Andrew Gierth

> "Andrew" == Andrew Dunstan  writes:

 Andrew> Rather than give you the files, I will just tell you, in both
 Andrew> cases it is matching float8.out, not the small-is-zero file.

OK, thanks. That lets me fix the float4 failures in a reasonably
straightforward way.

-- 
Andrew (irc:RhodiumToad)

Re: [Suspect SPAM] Better error messages when lacking connection slots for autovacuum workers and bgworkers

2019-02-14 Thread Michael Paquier

On Thu, Feb 14, 2019 at 09:04:37PM +0900, Kyotaro HORIGUCHI wrote:
> I agree to the distinctive messages, but the autovaccum and
> bgworker cases are in a kind of internal error, and they are not
> "connection"s. I feel that elog is more suitable for them.

I used ereport() for consistency with the existing code, still you are
right that we ought to use elog() as this is the case of a problem
which should not happen.
--
Michael


signature.asc
Description: PGP signature

Re: Using POPCNT and other advanced bit manipulation instructions

2019-02-14 Thread Tom Lane

Alvaro Herrera  writes:
> That leads me to the attached patch.  It creates a new file
> pg_popcount.c which is the only one compiled with -mpopcnt (if
> available); if there's no compiler switch to enable POPCNT, we just
> don't compile the file.  I'm not sure that's kosher -- in particular I'm
> not sure if it can fail when POPCNT is enabled by other flags and
> -mpopcnt is not needed at all.  I think our c-compiler.m4 stuff is a bit
> too simplistic there: it just assumes that -mpopcnt is always required.

Yes, the configure test for this stuff is really pretty broken.
It's conflating two nearly independent questions: (1) does the compiler
have __builtin_popcount(), and (2) does the compiler accept -mpopcnt.
It is certainly the case that (1) may hold without (2); in fact, every
recent non-x86_64 gcc is a counterexample to how that's done in HEAD.

I think we need a clean test for __builtin_popcount(), and to be willing
to use it if available, independently of -mpopcnt.  Then separately we
should test to see if -mpopcnt works, probably with the same
infrastructure we use for checking for other compiler flags, viz

   # Optimization flags for specific files that benefit from vectorization
   PGAC_PROG_CC_VAR_OPT(CFLAGS_VECTOR, [-funroll-loops])
   PGAC_PROG_CC_VAR_OPT(CFLAGS_VECTOR, [-ftree-vectorize])
+  # Optimization flags for bit-twiddling
+  PGAC_PROG_CC_VAR_OPT(CFLAGS_POPCNT, [-mpopcnt])
   # We want to suppress clang's unhelpful unused-command-line-argument warnings

Then the correct test to see if we want to build pg_popcount.c (BTW,
please pick a less generic name for that) and the choose function
is whether we have *both* HAVE__BUILTIN_POPCOUNT and nonempty
CFLAGS_POPCNT.

I don't think this'd be fooled by user-specified CFLAGS.  The worst
possible outcome is that it builds a function that we intended would
use POPCNT but it's falling back to some other implementation, in
case the compiler has a switch named -mpopcnt but it doesn't do what
we think it does, or the user overrode things by adding -mno-popcnt.
That would really be nearly cost-free, other than the overhead of
the choose function the first time through: both of the execution
functions would be reducing to __builtin_popcount(), for whatever
version of that the compiler is giving us, so the choice wouldn't
matter.

regards, tom lane

Re: Using POPCNT and other advanced bit manipulation instructions

2019-02-14 Thread Alvaro Herrera

On 2019-Feb-14, Tom Lane wrote:

> static inline int
> pg_clz(...)

Hmm, I missed this bit.  So we put all these functions in the header, as
in the attached.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
commit d4265b36754720f85bf2ac6fd9ae2c58b8e1abc2
Author: Alvaro Herrera 
AuthorDate: Thu Feb 14 15:18:03 2019 -0300
CommitDate: Thu Feb 14 19:41:41 2019 -0300

fix popcount etc

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 148c5550573..72dfd1d2695 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -10,17 +10,175 @@
  *
  * -
  */
-
 #ifndef PG_BITUTILS_H
 #define PG_BITUTILS_H
 
-extern int (*pg_popcount32) (uint32 word);
-extern int (*pg_popcount64) (uint64 word);
-extern int (*pg_rightmost_one32) (uint32 word);
-extern int (*pg_rightmost_one64) (uint64 word);
-extern int (*pg_leftmost_one32) (uint32 word);
-extern int (*pg_leftmost_one64) (uint64 word);
-
+extern int	(*pg_popcount32) (uint32 word);
+extern int	(*pg_popcount64) (uint64 word);
 extern uint64 pg_popcount(const char *buf, int bytes);
 
+/* in pg_popcount.c */
+extern int	pg_popcount32_sse42(uint32 word);
+extern int	pg_popcount64_sse42(uint64 word);
+
+
+#ifndef HAVE__BUILTIN_CTZ
+/*
+ * Array marking the position of the right-most set bit for each value of
+ * 1-255.  We count the right-most position as the 0th bit, and the
+ * left-most the 7th bit.  The 0th index of the array must not be used.
+ */
+static const uint8 rightmost_one_pos[256] = {
+	0, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	7, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0
+};
+#endif			/* !HAVE__BUILTIN_CTZ */
+
+/*
+ * pg_rightmost_one32
+ *		Returns the number of trailing 0-bits in word, starting at the least
+ *		significant bit position. word must not be 0.
+ */
+static inline int
+pg_rightmost_one32(uint32 word)
+{
+	int			result = 0;
+
+	Assert(word != 0);
+
+#ifdef HAVE__BUILTIN_CTZ
+	result = __builtin_ctz(word);
+#else
+	while ((word & 255) == 0)
+	{
+		word >>= 8;
+		result += 8;
+	}
+	result += rightmost_one_pos[word & 255];
+#endif			/* HAVE__BUILTIN_CTZ */
+
+	return result;
+}
+
+/*
+ * pg_rightmost_one64
+ *		Returns the number of trailing 0-bits in word, starting at the least
+ *		significant bit position. word must not be 0.
+ */
+static inline int
+pg_rightmost_one64(uint64 word)
+{
+	int			result = 0;
+
+	Assert(word != 0);
+
+#ifdef HAVE__BUILTIN_CTZ
+#if defined(HAVE_LONG_INT_64)
+	return __builtin_ctzl(word);
+#elif defined(HAVE_LONG_LONG_INT_64)
+	return __builtin_ctzll(word);
+#else
+#error must have a working 64-bit integer datatype
+#endif
+#else			/* HAVE__BUILTIN_CTZ */
+	while ((word & 255) == 0)
+	{
+		word >>= 8;
+		result += 8;
+	}
+	result += rightmost_one_pos[word & 255];
+#endif
+
+	return result;
+}
+
+#ifndef HAVE__BUILTIN_CLZ
+/*
+ * Array marking the position of the left-most set bit for each value of
+ * 1-255.  We count the right-most position as the 0th bit, and the
+ * left-most the 7th bit.  The 0th index of the array must not be used.
+ */
+static const uint8 leftmost_one_pos[256] = {
+	0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
+	4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
+	5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
+	5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7
+};
+#endif			/* !HAVE_BUILTIN_CLZ */
+
+/*
+ * pg_leftmost_one32
+ *		Returns the 0-based position of the most significant set bit in word
+ *		measured from the least significant bit.  word must not be 0.
+ */
+static

Re: Using POPCNT and other advanced bit manipulation instructions

2019-02-14 Thread Alvaro Herrera

On 2019-Feb-14, Tom Lane wrote:

> I'd bet a fair amount of money that we'd be better off *not* using
> lzcnt, even if available, because then we could just expose things
> along this line:
> 
> static inline int
> pg_clz(...)
> {
> #ifdef HAVE__BUILTIN_CLZ
> return __builtin_clz(x);
> #else
> handwritten implementation;
> #endif
> }
> 
> Avoiding a function call (that has to indirect through a pointer) probably
> saves much more than the difference between lzcnt and the other way.

I see ... makes sense.

That leads me to the attached patch.  It creates a new file
pg_popcount.c which is the only one compiled with -mpopcnt (if
available); if there's no compiler switch to enable POPCNT, we just
don't compile the file.  I'm not sure that's kosher -- in particular I'm
not sure if it can fail when POPCNT is enabled by other flags and
-mpopcnt is not needed at all.  I think our c-compiler.m4 stuff is a bit
too simplistic there: it just assumes that -mpopcnt is always required.
But what if the user passes it in CFLAGS?

I left CPUID alone for the CLZ/CTZ builtins.  So we either use the
table, or the builtins.  We never try the instructions.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
commit 6c771a4f43da0409ae5fa9ff1b1579f381c451c0
Author: Alvaro Herrera 
AuthorDate: Thu Feb 14 15:18:03 2019 -0300
CommitDate: Thu Feb 14 19:24:03 2019 -0300

fix popcount etc

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 148c5550573..635eb9331af 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -10,17 +10,20 @@
  *
  * -
  */
-
 #ifndef PG_BITUTILS_H
 #define PG_BITUTILS_H
 
-extern int (*pg_popcount32) (uint32 word);
-extern int (*pg_popcount64) (uint64 word);
-extern int (*pg_rightmost_one32) (uint32 word);
-extern int (*pg_rightmost_one64) (uint64 word);
-extern int (*pg_leftmost_one32) (uint32 word);
-extern int (*pg_leftmost_one64) (uint64 word);
+extern int	pg_rightmost_one32(uint32 word);
+extern int	pg_rightmost_one64(uint64 word);
+extern int	pg_leftmost_one32(uint32 word);
+extern int	pg_leftmost_one64(uint64 word);
 
+extern int	(*pg_popcount32) (uint32 word);
+extern int	(*pg_popcount64) (uint64 word);
 extern uint64 pg_popcount(const char *buf, int bytes);
 
+/* in pg_popcount.c */
+extern int	pg_popcount32_sse42(uint32 word);
+extern int	pg_popcount64_sse42(uint64 word);
+
 #endif			/* PG_BITUTILS_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index 2da73260a13..d7290573c65 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -41,6 +41,13 @@ OBJS = $(LIBOBJS) $(PG_CRC32C_OBJS) chklocale.o erand48.o inet_net_ntop.o \
 	qsort.o qsort_arg.o quotes.o snprintf.o sprompt.o strerror.o \
 	tar.o thread.o
 
+# If the compiler supports a flag for the POPCOUNT instruction, we compile
+# pg_popcount.o with it.  (Whether to actually use the functions therein is
+# determined at runtime by testing CPUID flags.)
+ifneq ($(CFLAGS_POPCNT),)
+OBJS += pg_popcount.o
+endif
+
 # libpgport.a, libpgport_shlib.a, and libpgport_srv.a contain the same files
 # foo.o, foo_shlib.o, and foo_srv.o are all built from foo.c
 OBJS_SHLIB = $(OBJS:%.o=%_shlib.o)
@@ -78,10 +85,10 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
 pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
 
-# pg_bitutils.c needs CFLAGS_POPCNT
-pg_bitutils.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_bitutils_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_bitutils_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
+# pg_popcount.c needs CFLAGS_POPCNT
+pg_popcount.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
 
 #
 # Shared library versions of object files
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index aac394fe927..23d317d111e 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -10,7 +10,6 @@
  *
  *-
  */
-
 #include "postgres.h"
 
 #ifdef HAVE__GET_CPUID
@@ -23,61 +22,21 @@
 
 #include "port/pg_bitutils.h"
 
-#if defined(HAVE__BUILTIN_POPCOUNT) && defined(HAVE__GET_CPUID)
+#ifdef HAVE__BUILTIN_POPCOUNT
 static bool pg_popcount_available(void);
-static int pg_popcount32_choose(uint32 word);
-static int pg_popcount32_sse42(uint32 word);
-static int pg_popcount64_choose(uint64 word);
-static int pg_popcount64_sse42(uint64 word);
-#endif
-static int pg_popcount32_slow(uint32 word);
-static int pg_popcount64_slow(uint64 word);
-
-#if defined(HAVE__GET_CPUID) && (defined(HAVE__BUILTIN_CTZ) || defined(HAVE__BUILTIN_CLZ))
-static bool pg_lzcnt_available(void);
-#endif
-
-#if defined(HAVE__BUILTIN_CTZ) && defined(HAVE__GET_CPUID)
-static int pg_rightmost_one32_choose(uint32 word);
-static int pg_rightmost_one32_abm(uint32

Re: Ryu floating point output patch

2019-02-14 Thread Andrew Dunstan



On 2/14/19 12:42 PM, Andrew Dunstan wrote:
> On 2/14/19 12:24 PM, Andrew Gierth wrote:
>> Andrew,
>>
>> Is there any chance you can get me the regress/results/float[48].out
>> files from lorikeet and jacana? It would help a lot.
>>
>> Seeing the diffs isn't enough, because I want to know if the float8 test
>> (which passes, so there's no diff) is matching the standard file or the
>> -small-is-zero file.
>>
> Yeah, later on today or tomorrow.
>
>
> Maybe we should have pg_regress report on the files it matches ...
>
>


Rather than give you the files, I will just tell you, in both cases it
is matching float8.out, not the small-is-zero file.  As you can see in
the buildfarm web, the float4 tests are failing and being compared
against float4.out


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: libpq host/hostaddr/conninfo inconsistencies

2019-02-14 Thread Fabien COELHO





On 2018-10-26 09:21:51 +0200, Fabien COELHO wrote:

(1) you are somehow against changing the current implementation, eg erroring
out on possibly misleading configurations, because you do not think it is
really useful to help users in those cases.


I find this formulation somewhat passive aggressive.


I do not understand what you mean by that expression.

I was just trying to sum-up Robert's opposition to erroring on misleading 
configurations (eg "host=1.2.3.4 hostaddr=4.3.2.1") instead of complying 
to it whatever, as is currently done. Probably my phrasing could be 
improved, but I do not think that I misrepresented Robert's position.


Note that the issue is somehow mitigated by 6e5f8d489a: \conninfo now 
displays a more precise information, so that at least you are not told 
that you are connected to a socket when you a really connected to an ip, 
or to one ip when you a really connected to another.


--
Fabien.

Re: Using POPCNT and other advanced bit manipulation instructions

2019-02-14 Thread Tom Lane

Andres Freund  writes:
> On 2019-02-14 15:47:13 -0300, Alvaro Herrera wrote:
>> Hah, I just realized you have to add -mlzcnt in order for these builtins
>> to use the lzcnt instructions.  It goes from something like
>> 
>> bsrq %rax, %rax
>> xorq $63, %rax

> I'm confused how this is a general count leading zero operation? Did you
> use constants or something that allowed ot infer a range in the test? If
> so the compiler probably did some optimizations allowing it to do the
> above.

No.  If you compile

int myclz(unsigned long long x)
{
  return __builtin_clzll(x);
}

at -O2, on just about any x86_64 gcc, you will get

myclz:
.LFB1:
.cfi_startproc
bsrq%rdi, %rax
xorq$63, %rax
ret
.cfi_endproc

regards, tom lane

Re: Planning counters in pg_stat_statements (using pgss_store)

2019-02-14 Thread legrand legrand

Thank you Sergei for your comments,

> Did you register patch in CF app? I did not found entry.
created today: https://commitfest.postgresql.org/22/1999/

> Currently we have pg_stat_statements 1.7 version and this patch does not
> apply... 
will rebase and create a 1.8 version

> -  errmsg("could not read file \"%s\": %m",
> +  errmsg("could not read pg_stat_statement file \"%s\": 
> %m",
this is a mistake, will fix

> +#define PG_STAT_STATEMENTS_COLS_V1_4 25
I thought it was needed when adding new columns, isn't it ?

> And this patch does not have documentation changes.
will fix

and will provide some kind of benchmark to compare with actual version.

Regards
PAscal 



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Re: pgbench - add pseudo-random permutation function

2019-02-14 Thread Fabien COELHO




Hello Andres,


+# PGAC_C_BUILTIN_CLZLL


I think this has been partially superceded by

commit 711bab1e4d19b5c9967328315a542d93386b1ac5
Author: Alvaro Herrera 
Date:   2019-02-13 16:10:06 -0300


Indeed, the patch needs a rebase & conflit resolution. I'll do it. Later.


   
+Function pr_perm implements a pseudo-random permutation.
+It allows to mix the output of non uniform random functions so that
+values drawn more often are not trivially correlated.
+It permutes integers in [0, size) using a seed by applying rounds of
+simple invertible functions, similarly to an encryption function,
+although beware that it is not at all cryptographically secure.
+Compared to hash functions discussed above, the function
+ensures that a perfect permutation is applied: there are no collisions
+nor holes in the output values.
+Values outside the interval are interpreted modulo the size.
+The function errors if size is not positive.
+If no seed is provided, :default_seed is used.
+For a given size and seed, the function is fully deterministic: if two
+permutations on the same size must not be correlated, use distinct seeds
+as outlined in the previous example about hash functions.
+  


This doesn't really explain why we want this in pgbench.


Who is "we"?

If someone runs non uniform tests, ie with random_exp/zipf/gauss, close 
values are drawn with a similar frequency, thus correlated, inducing an 
undeserved correlation at the page level (eg for read) and better 
performance that would be the case if relative frequencies were not 
correlated to key values.


So the function allows having more realistic non uniform test, whereas 
currently we can only have non uniform test with very unrealistically 
correlated values at the key level and possibly at the page level, meaning 
non representative performances because of these induced bias.


This is under the assumption that pgbench should allow more realistic 
performance test scenarii, which I believe is a desirable purpose. If 
someone disagree with this purpose, then they would consider both non 
uniform random functions and this proposed pseudo-random permutation 
function as useless, as probably most other additions to pgbench.


--
Fabien.

Re: libpq debug log

2019-02-14 Thread Jacob Champion

On Thu, Feb 14, 2019 at 10:17 AM Andres Freund  wrote:
> On 2018-11-28 23:20:03 +0100, Peter Eisentraut wrote:
> > This does not excite me.  It seems mostly redundant with using tcpdump.
>
> I think the one counter-argument to this is that using tcpdump in
> real-world scenarios has become quite hard, due to encryption.

+1. Another difficulty is having the OS permissions to do the raw
packet dumps in the first place.

--Jacob

Re: pg11.1: dsa_area could not attach to segment

2019-02-14 Thread Thomas Munro

On Fri, Feb 15, 2019 at 5:36 AM Sergei Kornilov  wrote:
> > Do you think that plausibly explains and resolves symptoms of bug#15585, 
> > too?
>
> I think yes. Bug#15585 raised only after "dsa_area could not attach to 
> segment" in different parallel worker. Leader stuck because waiting all 
> parallel workers, but one worker has unexpected recursion in 
> dsm_backend_shutdown [1] and will never shutdown. Backtrace show previous 
> error in this backend: "cannot unpin a segment that is not pinned" - root 
> cause is earlier and in a different process.

Agreed.  Even though it's an unpleasant failure mode, I'm not entirely
sure if it's a good idea to make changes to avoid it.  We could move
the code around so that the error is raised after releasing the lock,
but then you'd either blow the stack or loop forever due to longjmp (I
haven't checked which).  To avoid that you'd have to clean out the
book-keeping in shared memory eagerly so that at the next level of
error recursion you've at least made progress (and admittedly there
are examples of things like that in the code), but how far should we
go to tolerate cases that shouldn't happen?  Practically, if we had
that behaviour and this bug, you'd eventually eat all the DSM slots
with leaked segments of shared memory, and your system wouldn't work
too well.  For now I think it's better to treat the root cause of the
unexpected error.

-- 
Thomas Munro
http://www.enterprisedb.com

2019-03 CF Summary / Review - Tranche #1

2019-02-14 Thread Andres Freund

Hi,

As last year [1], I'll try to summarize all commitfest items in 2019-03
to see which I think could realistically be put into 12.

Going through all non bugfix CF entries. Here's the summary for the
entries I could stomach today:


RFC: ready for committer
NR: needs review
WOA: waiting on author.


- pgbench - another attempt at tap test for time-related options

  NR. This was already around last year.   I think it'd be fair to argue
  that there's not been a ton of push to get this committed.


- Psql patch to show access methods info

  This adds \ commands to display many properties of [index ]access methods.

  NR.  This patch has gotten a fair bit of committer feedback via
  Alvaro.  I personally am not particularly convinced this is
  functionally that many people are going to use.


- Show size of partitioned table

  NR.  There seems to have been plenty discussion over details. Feels
  like this ought to be committable for v12?


- pgbench - add pseudo-random permutation function

  WOA.  I'm not clear as to why we'd want to add this to pgbench. To
  revive a discussion from last year's thread, I feel like we're adding
  more code to pgbench than we can actually usefully use.


- libpq host/hostaddr consistency

  NR.  I think the patches in this needs a few more committer eyes. It's
  possible we just should fix the documentation, or go further and
  change the behaviour.  Feels like without more senior attention,
  this'll not be resolved.

- pg_dump multi VALUES INSERT

  NR.  There seems to be some tentative agreement, excepting Tom, that
  we probably want this feature. There has been plenty review /
  improvements.

  Seems like it ought to be possible to get this into v12.

- libpq trace log

  NR.  There seems to be considerable debate about what exactly this
  feature should do, and the code hasn't yet seen a lot of review. I
  think we ought to just target v13 here, there seems to be some
  agreement that there's a problem, just not exactly what the solution
  is.

  Andres: punt to v13.


- pg_dumpall --exclude-database option

  RFC, and author is committer.


- Add sqlstate output mode to VERBOSITY

  RFC, there seems to be agreement.


- DECLARE STATEMENT syntax support in ECPG

  NR.  There seems to be some tentative agreement that this is
  desirable. But the patch was only recently (2018-12-16) revived, and
  hasn't yet gotten a ton of review.  I pinged Michael Meskes, to make
  him aware of this work.

  Andres: punt to v13.


- A new data type 'bytea' for ECPG host variable

  NR: This seems to be much closer to being ready than the
  above. Michael has done a few review cycles.  Not sure if it'll be
  done by 12, but it seems to have a good chance.


- \describe: verbose commands in psql

  NR: This seems like a relatively large change, and only submitted
  2019-01-24. Per our policy to not add nontrivial work to the last CF,
  I think we should not consider this a candidate for v12.

  Andres: punt to v13.


- documenting signal handling with readme

  WOA: I'm very unclear as to why this is something worth documenting in
  this manner. Right now I'm not clear what the audience of this is
  supposed to be.


- First SVG graphic

  NR: My reading of the discussion makes this look like we'll probably
  have graphics in v12's docs. Neat.


- Update INSTALL file

  WOA: It seems we've not really come to much of a conclusion what this
  ought to contain.


- Make installcheck-world in a clean environment

  NR: I don't feel we really have agreement on what we want here. Thus
  I'm doubtful this is likely to be resolvable in short order.

  Andres: lightly in favor of punting to v13


- Avoid creation of the free space map for small tables

  NR: the majority of this patch has been committed, I assume the
  remaining pieces will too.


- Adding a TAP test checking data consistency on standby with
  minRecoveryPoint

  NR: Seems like it ought to be committable, Andrew Gierth did provide
  some feedback.


- idle-in-transaction timeout error does not give a hint

  NR: Seems trivial enough.


- commontype and commontypearray polymorphic types

  NR: To me this seems like a potentially large change, with user
  visible impact. As it was only submitted in December, and is clearly
  not yet reviewed much, I think we ought to punt on this for v12.

  Andres: punt to v13


- EXPLAIN with information about modified GUC values

  NR: The patch seems to be getting closer to completion, but I'm not
  sure how many actually agree that we want this.

  Andres: aim for v12, but we probably should discuss soon whether we
  actually want this.


- Include all columns in default names for foreign key constraints.

  WOA: This patch has been submitted to the last CF first. I think it's
  straddling the line after which we should just refuse that pretty
  closely.  Not sure.


- Shared-memory based stats collector

  WOA: I think this patch has quite some promise and solve quite a few
  issues,

Re: log bind parameter values on error

2019-02-14 Thread Andres Freund

Hi,

tiny scroll-through review.

On 2019-01-28 00:15:51 +, Alexey Bashtanov wrote:
> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
> index b6f5822..997e6e8 100644
> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -6274,6 +6274,23 @@ log_line_prefix = '%m [%p] %q%u@%d/%a '
>
>   
>  
> +  xreflabel="log_parameters_on_error">
> +  log_parameters_on_error (boolean)
> +  
> +   log_parameters_on_error configuration 
> parameter
> +  
> +  
> +  
> +   
> +Controls whether the statement is logged with bind parameter values. 
> +It adds some overhead, as postgres will cache textual
> +representations of parameter values in memory for all statements,
> +even if they eventually do not get logged. The default is
> +off. Only superusers can change this setting.
> +   
> +  
> + 

This needs a bit of language polishing.



> @@ -31,6 +31,8 @@
>   * set of parameter values.  If dynamic parameter hooks are present, we
>   * intentionally do not copy them into the result.  Rather, we forcibly
>   * instantiate all available parameter values and copy the datum values.
> + *
> + * We don't copy textual representations here.
>   */

That probably needs to be expanded on. Including a comment about what
guarantees that there are no memory lifetime issues.


> - /* Free result of encoding conversion, if any */
> - if (pstring && pstring != pbuf.data)
> - pfree(pstring);
> + if (pstring)
> + {
> + if (need_text_values)
> + {
> + if (pstring == pbuf.data)
> + {
> + /*
> +  * Copy textual 
> representation to portal context.
> +  */
> + 
> params->params[paramno].textValue =
> + 
> pstrdup(pstring);
> + }
> + else
> + {
> + /* Reuse the result of 
> encoding conversion for it */
> + 
> params->params[paramno].textValue = pstring;
> + }
> + }
> + else
> + {
> + /* Free result of encoding 
> conversion */
> + if (pstring != pbuf.data)
> + pfree(pstring);
> + }
> + }

So the parameters we log here haven't actually gone through the output
function? Isn't that an issue? I mean that'll cause the contents to
differ from the non-error case, no? And differs from the binary protocol
case?
>   else
>   {
> + /*
> +  * We do it for non-xact commands only, as an xact command
> +  * 1) shouldn't have any parameters to log;
> +  * 2) may have the portal dropped early.
> +  */
> + Assert(current_top_portal == NULL);
> + current_top_portal = portal;
> + portalParams = NULL;
> +

"it"? ISTM the comment doesn't really stand on its own?



> +/*
> + * get_portal_bind_parameters
> + *   Get the string containing parameters data, is used for logging.
> + *
> + * Can return NULL if there are no parameters in the portal
> + * or the portal is not valid, or the text representations of the parameters 
> are
> + * not available. If returning a non-NULL value, it allocates memory
> + * for the returned string in the current context, and it's the caller's
> + * responsibility to pfree() it if needed.
> + */
> +char *
> +get_portal_bind_parameters(ParamListInfo params)
> +{
> + StringInfoData param_str;
> +
> + /* No parameters to format */
> + if (!params || params->numParams == 0)
> + return NULL;
> +
> + elog(WARNING, "params->hasTextValues=%d, 
> IsAbortedTransactionBlockState()=%d",
> +params->hasTextValues && 
> IsAbortedTransactionBlockState());

Err, huh?


Greetings,

Andres Freund

Re: log bind parameter values on error

2019-02-14 Thread Andres Freund

Hi,

On 2018-12-14 23:04:26 +, Alexey Bashtanov wrote:
> Unfortunately, when enabled, the feature comes with some memory and CPU
> overhead,
> as we cannot convert certain values to text when in aborted transaction.

Have you analyzed how invasive it'd be to delay that till we actually
can safely start a [sub]transaction to do that logging? Probably too
expensive, but it seems like something that ought to be analyzed.

- Andres

Re: Log a sample of transactions

2019-02-14 Thread Andres Freund

Hi,

On 2019-01-26 11:44:58 +0100, Adrien NAYRAT wrote:
> +  xreflabel="log_transaction_sample_rate">
> +  log_transaction_sample_rate 
> (real)
> +  
> +   log_transaction_sample_rate configuration 
> parameter
> +  
> +  
> +   
> +
> + Set the fraction of transactions whose statements are logged. It 
> applies
> + to each new transaction regardless of their duration. The default is
> + 0, meaning do not log statements from this 
> transaction.
> + Setting this to 1 logs all statements for all 
> transactions.
> + Logging can be disabled inside a transaction by setting this to 0.
> + log_transaction_sample_rate is helpful to track a
> + sample of transaction.
> +
> +   
> +  
> +
>   

I wonder if this doesn't need a warning, explaining that using this when
there are large transactions could lead to slowdowns.


>  
> diff --git a/src/backend/access/transam/xact.c 
> b/src/backend/access/transam/xact.c
> index 7c3a9c1e89..1a52c10c39 100644
> --- a/src/backend/access/transam/xact.c
> +++ b/src/backend/access/transam/xact.c
> @@ -1821,6 +1821,9 @@ StartTransaction(void)
>   s->state = TRANS_START;
>   s->transactionId = InvalidTransactionId;/* until assigned */
>  
> + /* Determine if we log statements in this transaction */
> + xact_is_sampled = random() <= log_xact_sample_rate * MAX_RANDOM_VALUE;
> +
>   /*
>* initialize current transaction state fields
>*
> diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
> index e773f20d9f..a11667834e 100644
> --- a/src/backend/tcop/postgres.c
> +++ b/src/backend/tcop/postgres.c
> @@ -100,7 +100,8 @@ int   max_stack_depth = 100;
>  /* wait N seconds to allow attach from a debugger */
>  int  PostAuthDelay = 0;
>  
> -
> +/* flag for logging statements in a transaction */
> +bool xact_is_sampled = false;

It seems pretty weird to have this in postgres.c, given you declared it
in xact.h?


> @@ -2218,7 +2221,8 @@ check_log_statement(List *stmt_list)
>  int
>  check_log_duration(char *msec_str, bool was_logged)
>  {
> - if (log_duration || log_min_duration_statement >= 0)
> + if (log_duration || log_min_duration_statement >= 0 ||
> + (xact_is_sampled && log_xact_sample_rate > 0))

Why are both of these checked? ISTM once xact_is_sampled is set, we
ought not to respect a different value of log_xact_sample_rate for the
rest of that transaction.


As far as I can tell xact_is_sampled is not properly reset in case of
errors?


Greetings,

Andres Freund

Re: more unconstify use

2019-02-14 Thread Peter Eisentraut

On 13/02/2019 19:51, Mark Dilger wrote:
> Peter, so sorry I did not review this patch before you committed.  There
> are a few places where you unconstify the argument to a function where
> changing the function to take const seems better to me.  I argued for
> something similar in 2016,

One can consider unconstify a "todo" marker.  Some of these could be
removed by some API changes.  It needs case-by-case consideration.

> Your change:
> 
> - md5_calc((uint8 *) (input + i), ctxt);
> + md5_calc(unconstify(uint8 *, (input + i)), ctxt);
> 
> Perhaps md5_calc's signature should change to
> 
>   md5_calc(const uint8 *, md5_ctxt *)
> 
> since it doesn't modify the first argument.

Fixed, thanks.

> Your change:
> 
> - if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) newtup, 
> newsz))
> + if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) 
> unconstify(BrinTuple *, newtup), newsz))
> 
> Perhaps PageIndexTupleOverwrite's third argument should be const, since it
> only uses it as the const source of a memcpy.  (This is a bit harder than
> for md5_calc, above, since the third argument here is of type Item, which
> is itself a typedef to Pointer, and there exists no analogous ConstPointer
> or ConstItem definition in the sources.)

Yeah, typedefs to a pointer are a poor fit for this.  We have been
getting rid of them from time to time, but I don't know a general solution.

> Your change:
> 
> - XLogRegisterBufData(0, (char *) newtup, newsz);
> + XLogRegisterBufData(0, (char *) unconstify(BrinTuple *, 
> newtup), newsz);
> 
> Perhaps the third argument to XLogRegisterBufData can be changed to const,
> with the extra work of changing XLogRecData's data field to const.  I'm not
> sure about the amount of code churn this would entail.

IIRC, the XLogRegister stuff is a web of lies with respect to constness.
 Resolving this properly is tricky.

> Your change:
> 
> - result = pg_be_scram_exchange(scram_opaq, input, inputlen,
> + result = pg_be_scram_exchange(scram_opaq, unconstify(char *, 
> input), inputlen,
> 
> , ,
> 
> logdetail);
> 
> pg_be_scram_exchange passes the second argument into two functions,
> read_client_first_message and read_client_final_message, both of which take
> it as a non-const argument but immediately pstrdup it such that it might
> as well have been const.  Perhaps we should just clean up this mess rather
> than unconstifying.

Also fixed!

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: shared-memory based stats collector

2019-02-14 Thread Andres Freund

Hi Kyatoro,

On 2019-02-07 13:10:08 -0800, Andres Freund wrote:
> I don't think this is all that close to being committable :(

Are you planning to update this soon? I think this needs to be improved
pretty quickly to have any shot at getting into v12. I'm willing to put
in some resources towards that, but I definitely don't have the
resources to entirely polish it from my end.

Greetings,

Andres Freund

Re: explain plans with information about (modified) gucs

2019-02-14 Thread Andres Freund

Hi,

On 2019-01-15 02:39:49 +0100, Tomas Vondra wrote:
> 
> 
> On 1/14/19 11:13 PM, Alvaro Herrera wrote:
> > On 2019-Jan-14, Tomas Vondra wrote:
> > 
> >> The one slightly annoying issue is that currently all the options are
> >> formatted as text, including e.g. cpu_tuple_cost. That's because the GUC
> >> options may have show hook, so I can't access the value directly (not
> >> sure if there's an option around it).
> > 
> > I think the problem is that you'd have to know how to print the value,
> > which can be in one of several different C types.  You'd have to grow
> > some more infrastructure in the GUC tables, I think, and that doesn't
> > seem worth the trouble.  Printing as text seems enough.
> > 
> 
> I don't think the number of formats is such a big issue - the range of
> formats is quite limited: PGC_BOOL, PGC_INT, PGC_REAL, PGC_STRING and
> PGC_ENUM. But the show hook simply returns string, and I'm not sure it's
> guaranteed it matches the raw value (afaik the assign/show hooks can do
> all kinds of funny stuff).

Yea, but the underlying values are quite useless for
humans. E.g. counting shared_buffers, wal_buffers etc in weird units is
not helpful. So you'd need to support the different units for each.

Greetings,

Andres Freund

Re: Per-tablespace autovacuum settings

2019-02-14 Thread Oleksii Kliukin

Tom Lane  wrote:

> Oleksii Kliukin  writes:
>> Is there any interest in making autovacuum parameters available on a
>> tablespace level in order to apply those to all vacuumable objects in the
>> tablespace?
> 
> I understand what you want to accomplish, and it doesn't seem
> unreasonable.  But I just want to point out that the situation with
> vacuuming parameters is on the edge of unintelligible already; adding
> another scope might push it over the edge.  In particular there's no
> principled way to decide whether an autovacuum parameter set at an outer
> scope should override a plain-vacuum parameter set at a narrower scope.

My naive understanding is that vacuum and autovacuum should decide
independently which scope applies, coming from the most specific (per-table
for autovacuum, per-DB for vacuum) to the broader scopes, ending with
configuration parameters at the outermost scope . Both *_cost_limit and
*_cost_delay should be taken from the current vacuum scope only if effective
autovacuum settings yield -1.

> And it's really questionable which of database-wide and tablespace-wide
> should be seen as a narrower scope in the first place.

AFAIK we don’t allow setting autovacuum options per-database; neither I
suggest enabling plain vacuum to be configured per-tablespace; as a result,
we won’t be deciding between databases and tablespaces, unless we want to do
cross-lookups from autovacuum to the outer scope of plain vacuum options
before considering autovacuum’s own outer scope and I don’t see any reason
to do that.

> I don't know how to make this better, but I wish we'd take a step
> back and think about it rather than just accreting more and more
> complexity.

I am willing to do the refactoring when necessary, any particular place in
the code that is indicative of the issue?

Regards,
Oleksii Kliukin

Re: make installcheck-world in a clean environment

2019-02-14 Thread Andres Freund

Hi,

On 2019-02-04 11:11:03 +0900, Michael Paquier wrote:
> On Mon, Dec 03, 2018 at 11:58:13AM +0300, Alexander Lakhin wrote:
> > Rebased the patch once more after d3c09b9b.
> 
> The patch is ready for committer, so it has not attracted much
> attention.  Perhaps Teodor or Alexander could look at it?

I'm confused. I don't see the patch as ready, given the discussion, nor
does
https://commitfest.postgresql.org/22/1672/
contain a record of it ever being set to RFC? Did you intend to reply to
a different thread?

Greetings,

Andres Freund

Re: Using POPCNT and other advanced bit manipulation instructions

2019-02-14 Thread Andres Freund

Hi,

On 2019-02-14 15:47:13 -0300, Alvaro Herrera wrote:
> On 2019-Feb-14, Tom Lane wrote:
> 
> > Some further thoughts here ...
> > 
> > Does the "lzcnt" runtime probe actually do anything useful?
> > On the x86_64 compilers I tried (gcc 8.2.1 and 4.4.7), __builtin_clz
> > and __builtin_ctz compile to sequences involving bsrq and bsfq
> > regardless of -mpopcnt.  It's fairly hard to see how lzcnt would
> > buy anything over those sequences even if there were zero overhead
> > involved in using it.
> 
> Hah, I just realized you have to add -mlzcnt in order for these builtins
> to use the lzcnt instructions.  It goes from something like
> 
>   bsrq%rax, %rax
>   xorq$63, %rax

I'm confused how this is a general count leading zero operation? Did you
use constants or something that allowed ot infer a range in the test? If
so the compiler probably did some optimizations allowing it to do the
above.


> to
>   lzcntq  %rax, %rax
> 
> Significant?

If I understand Agner's tables correctly, then no, this isn't faster
than the two instructions above.


Greetings,

Andres Freund

Re: Early WIP/PoC for inlining CTEs

2019-02-14 Thread Merlin Moncure

On Thu, Feb 14, 2019 at 10:02 AM Alvaro Herrera
 wrote:
>
> On 2019-Feb-14, Peter Eisentraut wrote:
>
> > On 14/02/2019 16:11, Tom Lane wrote:
> > > ... so, have we beaten this topic to death yet?  Can we make a decision?
> > >
> > > Personally, I'd be happy with either of the last two patch versions
> > > I posted (that is, either AS [[NOT] MATERIALIZED] or
> > > AS [MATERIALIZE [ON|OFF]] syntax).  But we gotta pick something.
> >
> > If we're not really planning to add any more options, I'd register a
> > light vote for MATERIALIZED.  It reads easier, seems more grammatically
> > correct, and uses an existing word.
>
> +1 for MATERIALIZED, as I proposed in
> https://postgr.es/m/20170503173305.fetj4tz7kd56tjlr@alvherre.pgsql

Seconded!

merlin

Re: Proposal for Signal Detection Refactoring

2019-02-14 Thread Andres Freund

Hi,

On 2019-01-23 11:55:09 +0100, Chris Travers wrote:
> +Implementation Notes on Globals and Signal/Event Handling
> +=
> +
> +The approch to signal handling in PostgreSQL is designed to strictly conform
> +with the C89 standard and designed to run with as few platform assumptions as
> +possible.

I'm not clear as to what that means. For one we don't target C99
anymore, for another a lot of the signal handling stuff we do isn't
defined by C99, but versions of posix.


> +The primary constraint in signal handling is that things are interruptable.
> +This means that the signal handler may be interrupted at any point and that
> +execution may return to it at a later point in time.

That seems wrong. The primary problem is that *non* signal handler code
can be interrupted at ay time. Sure signal handlers interrupting each
other is a thing, but comparatively that doesn't add a ton of
complexity.


> +How PostgreSQL Handles Signals
> +--
> +
> +Most signals (except SIGSEGV, and SIGKILL) are blocked by PostgreSQL
> +during process startup.  Certain signals are given specific meaning and
> +trapped by signal handlers.  The primary signal handlers of the backends, 
> +located in src/backend/tcop/postgres.c, typically just write to variables of
> +sig_atomic_t (as documented below) and return control back to the main code.

Note that the other absolutely crucial thing is to save/restore errno.

I don't really think that signals handlers "return control back to the
main code".


> +An exception is made for SIGQUIT which is used by the postmaster to terminate
> +backend sessions quickly when another backend dies so that the postmaster
> +may re-initialize shared memory and otherwise return to a known-good state.
> +
> +The signals are then checked later when the CHECK_FOR_INTERRUPTS() macro is

s/signals/variables/


> +called.  This macro conditionally calls CheckPendingInterrupts in 
> +src/backend/tcop/postgres.c if InterruptPending is set.  This allows for
> +query cancellation and process termination to be done, under ordiary cases,
> +in a timely and orderly way, without posing problems for shared resources 
> such
> +as shard memory and semaphores.

s/shard/shared/


> +CHECK_FOR_INTERRUPTS() may be called only when there is no more cleanup to do
> +because the query or process may be aborted.  However query cancellation
> +requests and SIGTERM signals will not be processed until the next time this 
> is
> +called.

I don't think that's correct. Most resources can be cleaned up at
transaction abort.


> +Checking and Handling Interrupts
> +
> +
> +CHECK_FOR_SIGNALS() may cause a non-local exit because the function it wraps
> +utilizes PostgreSQL's built-in exception framework (ereport) to abort queries
> +and can call exit() to exit a orocess.

You mean CHECK_FOR_INTERRUPTS?


> +For this reason, it is imperative that CHECK_FOR_INTERRUPTS() is not called 
> in
> +places that require additional cleanup, such as dynamic shared memory, 
> temporary
> +files, or any other shared resource.  Instead CHECK_FOR_INTERRUPTS() should 
> be
> +called at the next point when it is safe to do so.

That's largely wrong, there's cleanup mechanisms fo rmost of the above.


Greetings,

Andres Freund

Re: Using POPCNT and other advanced bit manipulation instructions

2019-02-14 Thread Tom Lane

Alvaro Herrera  writes:
> Hah, I just realized you have to add -mlzcnt in order for these builtins
> to use the lzcnt instructions.  It goes from something like

>   bsrq%rax, %rax
>   xorq$63, %rax

> to
>   lzcntq  %rax, %rax

> Significant?

I'd bet a fair amount of money that we'd be better off *not* using
lzcnt, even if available, because then we could just expose things
along this line:

static inline int
pg_clz(...)
{
#ifdef HAVE__BUILTIN_CLZ
return __builtin_clz(x);
#else
handwritten implementation;
#endif
}

Avoiding a function call (that has to indirect through a pointer) probably
saves much more than the difference between lzcnt and the other way.

The tradeoff might be different for popcount, though, especially since
it looks like __builtin_popcount() is not nearly as widely available
as the clz/ctz builtins.

regards, tom lane

Re: Using POPCNT and other advanced bit manipulation instructions

2019-02-14 Thread Alvaro Herrera

On 2019-Feb-14, Tom Lane wrote:

> Some further thoughts here ...
> 
> Does the "lzcnt" runtime probe actually do anything useful?
> On the x86_64 compilers I tried (gcc 8.2.1 and 4.4.7), __builtin_clz
> and __builtin_ctz compile to sequences involving bsrq and bsfq
> regardless of -mpopcnt.  It's fairly hard to see how lzcnt would
> buy anything over those sequences even if there were zero overhead
> involved in using it.

Hah, I just realized you have to add -mlzcnt in order for these builtins
to use the lzcnt instructions.  It goes from something like

bsrq%rax, %rax
xorq$63, %rax

to
lzcntq  %rax, %rax

Significant?

I have this patch now, written before I realized the above; I'll augment
it to cater to this (adding -mlzcnt and a new set of functions --
perhaps a new file "lzcnt.c" or maybe put the lot in pg_popcount.c and
rename it?) and resubmit after an errand I have to run now.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
commit 477cc89802effed5d80d4a82891c7ef3b5e61e63
Author: Alvaro Herrera 
AuthorDate: Thu Feb 14 15:18:03 2019 -0300
CommitDate: Thu Feb 14 15:31:32 2019 -0300

fix popcount etc

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 148c5550573..53787a7ef32 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -23,4 +23,8 @@ extern int (*pg_leftmost_one64) (uint64 word);
 
 extern uint64 pg_popcount(const char *buf, int bytes);
 
+/* in pg_popcount.c */
+extern int pg_popcount32_sse42(uint32 word);
+extern int pg_popcount64_sse42(uint64 word);
+
 #endif			/* PG_BITUTILS_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index 2da73260a13..d7290573c65 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -41,6 +41,13 @@ OBJS = $(LIBOBJS) $(PG_CRC32C_OBJS) chklocale.o erand48.o inet_net_ntop.o \
 	qsort.o qsort_arg.o quotes.o snprintf.o sprompt.o strerror.o \
 	tar.o thread.o
 
+# If the compiler supports a flag for the POPCOUNT instruction, we compile
+# pg_popcount.o with it.  (Whether to actually use the functions therein is
+# determined at runtime by testing CPUID flags.)
+ifneq ($(CFLAGS_POPCNT),)
+OBJS += pg_popcount.o
+endif
+
 # libpgport.a, libpgport_shlib.a, and libpgport_srv.a contain the same files
 # foo.o, foo_shlib.o, and foo_srv.o are all built from foo.c
 OBJS_SHLIB = $(OBJS:%.o=%_shlib.o)
@@ -78,10 +85,10 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
 pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
 
-# pg_bitutils.c needs CFLAGS_POPCNT
-pg_bitutils.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_bitutils_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_bitutils_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
+# pg_popcount.c needs CFLAGS_POPCNT
+pg_popcount.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
 
 #
 # Shared library versions of object files
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index aac394fe927..fc8de518791 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -10,7 +10,6 @@
  *
  *-
  */
-
 #include "postgres.h"
 
 #ifdef HAVE__GET_CPUID
@@ -23,61 +22,52 @@
 
 #include "port/pg_bitutils.h"
 
-#if defined(HAVE__BUILTIN_POPCOUNT) && defined(HAVE__GET_CPUID)
+#ifdef HAVE__BUILTIN_POPCOUNT
 static bool pg_popcount_available(void);
 static int pg_popcount32_choose(uint32 word);
-static int pg_popcount32_sse42(uint32 word);
+static int pg_popcount32_builtin(uint32 word);
 static int pg_popcount64_choose(uint64 word);
-static int pg_popcount64_sse42(uint64 word);
-#endif
-static int pg_popcount32_slow(uint32 word);
-static int pg_popcount64_slow(uint64 word);
-
-#if defined(HAVE__GET_CPUID) && (defined(HAVE__BUILTIN_CTZ) || defined(HAVE__BUILTIN_CLZ))
-static bool pg_lzcnt_available(void);
-#endif
-
-#if defined(HAVE__BUILTIN_CTZ) && defined(HAVE__GET_CPUID)
-static int pg_rightmost_one32_choose(uint32 word);
-static int pg_rightmost_one32_abm(uint32 word);
-static int pg_rightmost_one64_choose(uint64 word);
-static int pg_rightmost_one64_abm(uint64 word);
-#endif
-static int pg_rightmost_one32_slow(uint32 word);
-static int pg_rightmost_one64_slow(uint64 word);
-
-#if defined(HAVE__BUILTIN_CLZ) && defined(HAVE__GET_CPUID)
-static int pg_leftmost_one32_choose(uint32 word);
-static int pg_leftmost_one32_abm(uint32 word);
-static int pg_leftmost_one64_choose(uint64 word);
-static int pg_leftmost_one64_abm(uint64 word);
-#endif
-static int pg_leftmost_one32_slow(uint32 word);
-static int pg_leftmost_one64_slow(uint64 word);
-
-#if defined(HAVE__BUILTIN_POPCOUNT) && defined(HAVE__GET_CPUID)
+static int pg_popcount64_builtin(uint64 word);
 int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
 int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;

Re: \describe*

2019-02-14 Thread Andres Freund

Hi,

On 2019-01-24 20:37:48 -0500, Corey Huinker wrote:
> Attached is a patch to add verbose \describe commands to compliment our
> existing but slightly cryptic family of \d commands.

Given that this patch has been added to the last commitfest for v12, I
think we should mark it as targeting 13, so it can be skipped over by
people looking to get things into v12.  Even leaving fairness aside, I
don't think it's likely to be ready quickly enough...

Greetings,

Andres Freund

Re: libpq host/hostaddr/conninfo inconsistencies

2019-02-14 Thread Andres Freund

Hi,

On 2018-10-26 09:21:51 +0200, Fabien COELHO wrote:
> (1) you are somehow against changing the current implementation, eg erroring
> out on possibly misleading configurations, because you do not think it is
> really useful to help users in those cases.

I find this formulation somewhat passive aggressive.


> (2) you are not against improving the documentation, although you find it
> clear enough already, but you agree that some people could get confused.
> 
> The attached patch v4 only improves the documentation so that it reflects
> what the implementation really does. I think it too bad to leave out the
> user-friendly aspects of the patch, though.

Robert, any chance you could opine on the doc patch, given that's your
suggested direction?

- Andres

Re: libpq debug log

2019-02-14 Thread Andres Freund

Hi,

On 2018-11-28 23:20:03 +0100, Peter Eisentraut wrote:
> This does not excite me.  It seems mostly redundant with using tcpdump.

I think the one counter-argument to this is that using tcpdump in
real-world scenarios has become quite hard, due to encryption. Even with
access to the private key you cannot decrypt the stream.  Wonder if the
solution to that would be an option to write out the decrypted data into
a .pcap or such.

Greetings,

Andres Freund

Re: Per-tablespace autovacuum settings

2019-02-14 Thread Oleksii Kliukin

Andres Freund  wrote:

> Hi,
> 
> On 2019-02-14 17:56:17 +0100, Oleksii Kliukin wrote:
>> Is there any interest in making autovacuum parameters available on a
>> tablespace level in order to apply those to all vacuumable objects in the
>> tablespace?
>> 
>> We have a set of tables running on ZFS, where autovacuum does almost no good
>> to us (except for preventing anti-wraparound) due to the nature of ZFS (FS
>> fragmentation caused by copy-on-write leads to sequential scans doing random
>> access) and the fact that our tables there are append-only. Initially, the
>> team in charge of the application just disabled autovacuum globally, but
>> that lead to a huge system catalog bloat.
>> 
>> At present, we have to re-enable autovacuum globally and then disable it
>> per-table using table storage parameters, but that is inelegant and requires
>> doing it once for existing tables and modifying the script that periodically
>> creates new ones (the whole system is a Postgres-based replacement of an
>> ElasticSearch cluster and we have to create new partitions regularly).
> 
> Won't that a) lead to periodic massive anti-wraparound sessions? b)
> prevent any use of index only scans?

The wraparound is hardly an issue there, as the data is transient and only
exist for 14 days (I think the entire date-based partition is dropped,
that’s how we ended up with pg_class catalog bloat). The index-only scan can
be an issue, although, IIRC, there is some manual vacuum that runs from time
to time, perhaps following your advice below.

> ISTM you'd be better off running vacuum rarely, with large
> thresholds. That way it'd do most of the writes in one pass, hopefully
> leading to less fragementation, and it'd set the visibilitymap bits to
> prevent further need to touch those. By doing it only rarely, vacuum
> should process pages sequentially, reducing the fragmentation.
> 
> 
>> Grouping tables by tablespaces for the purpose of autovacuum configuration
>> seems natural, as tablespaces are often placed on another filesystems/device
>> that may require changing how often does autovacuum run, make it less/more
>> aggressive depending on the I/O performance or require disabling it
>> altogether as in my example above. Furthermore, given that we allow
>> cost-based options per-tablespace the infrastructure is already there and
>> the task is mostly to teach autovacuum to look at tablespaces in addition to
>> the relation storage options (in case of a conflict, relation options should
>> always take priority).
> 
> While I don't buy the reasoning above, I think this'd be useful for
> other cases.

Even if we don’t want to disable autovacuum completely, we might want to
make it much less frequent by increasing the thresholds or costs/delays to
reduce the I/O strain for a particular tablespace.

Regards,
Oleksii Kliukin

Re: pgbench - add pseudo-random permutation function

2019-02-14 Thread Andres Freund

Hi,

On 2019-02-10 17:46:15 +, Hironobu SUZUKI wrote:
> I updated the patch. And also I added some explanations and simple examples
> in the modular_multiply function.

It'd be good to update the commitfest entry to say 'needs review' the
next time.


> +# PGAC_C_BUILTIN_CLZLL
> +# -
> +# Check if the C compiler understands __builtin_clzll(),
> +# and define HAVE__BUILTIN_CLZLL if so.
> +# Both GCC & CLANG seem to have one.
> +AC_DEFUN([PGAC_C_BUILTIN_CLZLL],
> +[AC_CACHE_CHECK(for __builtin_clzll, pgac_cv__builtin_clzll,
> +[AC_COMPILE_IFELSE([AC_LANG_SOURCE(
> +[static unsigned long int x = __builtin_clzll(0xaabbccddeeff0011);]
> +)],
> +[pgac_cv__builtin_clzll=yes],
> +[pgac_cv__builtin_clzll=no])])
> +if test x"$pgac_cv__builtin_clzll" = xyes ; then
> +AC_DEFINE(HAVE__BUILTIN_CLZLL, 1,
> +  [Define to 1 if your compiler understands __builtin_clzll.])
> +fi])# PGAC_C_BUILTIN_CLZLL

I think this has been partially superceded by

commit 711bab1e4d19b5c9967328315a542d93386b1ac5
Author: Alvaro Herrera 
Date:   2019-02-13 16:10:06 -0300

Add basic support for using the POPCNT and SSE4.2s LZCNT opcodes

could you make sur eit's integrated appropriately?

>
> +Function pr_perm implements a pseudo-random 
> permutation.
> +It allows to mix the output of non uniform random functions so that
> +values drawn more often are not trivially correlated.
> +It permutes integers in [0, size) using a seed by applying rounds of
> +simple invertible functions, similarly to an encryption function,
> +although beware that it is not at all cryptographically secure.
> +Compared to hash functions discussed above, the 
> function
> +ensures that a perfect permutation is applied: there are no collisions
> +nor holes in the output values.
> +Values outside the interval are interpreted modulo the size.
> +The function errors if size is not positive.
> +If no seed is provided, :default_seed is used.
> +For a given size and seed, the function is fully deterministic: if two
> +permutations on the same size must not be correlated, use distinct seeds
> +as outlined in the previous example about hash functions.
> +  

This doesn't really explain why we want this in pgbench.

Greetings,

Andres Freund

Re: Ryu floating point output patch

2019-02-14 Thread Andrew Dunstan



On 2/14/19 12:24 PM, Andrew Gierth wrote:
> Andrew,
>
> Is there any chance you can get me the regress/results/float[48].out
> files from lorikeet and jacana? It would help a lot.
>
> Seeing the diffs isn't enough, because I want to know if the float8 test
> (which passes, so there's no diff) is matching the standard file or the
> -small-is-zero file.
>

Yeah, later on today or tomorrow.


Maybe we should have pg_regress report on the files it matches ...


cheers


andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: libpq compression

2019-02-14 Thread Konstantin Knizhnik





On 14.02.2019 19:45, Dmitry Dolgov wrote:

For the records, I'm really afraid of interfering with the conversation at this
point, but I believe it's necessary for the sake of a good feature :)


On Wed, Feb 13, 2019 at 4:03 PM Konstantin Knizhnik  
wrote:

1. When decompressor has not enough data to produce any extra output, it
doesn't return error. It just not moving forward output position in the
buffer.

I'm confused, because here is what I see in zlib:

 // zlib.h
 inflate() returns Z_OK if some progress has been made, ... , Z_BUF_ERROR
 if no progress was possible or if there was not enough room in the output
 buffer when Z_FINISH is used.  Note that Z_BUF_ERROR is not fatal, and
 inflate() can be called again with more input and more output space to
 continue decompressing.

So, sounds like if no moving forward happened, there should be Z_BUF_ERROR. But
I haven't experimented with this yet to figure out.


Looks like I really missed this case.
I need to handle Z_BUF_ERROR in zlib_read:

            if (rc != Z_OK && rc != Z_BUF_ERROR)
            {
                return ZPQ_DECOMPRESS_ERROR;
            }

Strange that it works without it. Looks like written compressed message 
is very rarely splitted because of socket buffer overflow.





Ok, but still I think that it is better to pass tx/rx function to stream.
There are two important advantages:
1. It eliminates code duplication.

Ok.


2. It allows to use (in future) this streaming compression not only for
libpq for for other streaming data.

Can you elaborate on this please?


All this logic with fetching enough data to perform successful 
decompression of data chunk is implemented in one place - in zpq_stream

and it is not needed to repeat it in all places where compression is used.
IMHO passing rx/tx function to compressor stream is quite natural model  
- it is "decorator design pattern" 
https://en.wikipedia.org/wiki/Decorator_pattern

(it is how for example streams are implemented in Java).





Concerning "layering violation" may be it is better to introduce some other
functions something like inflate_read, deflate_write and call them instead of
secure_read. But from my point of view it will not improve readability and
modularity of code.

If we will unwrap the current compression logic to not contain tx/rx functions,
isn't it going to be the same as you describing it anyway, just from the higher
point of view? What I'm saying is that there is a compression logic, for it
some data would be read or written from it, just not right here an now by
compression code itself, but rather by already existing machinery (which could
be even beneficial for the patch implementation).


I do not understand why passing rx/tx functions to zpq_create is 
violating existed machinery.






And I do not see any disadvantages.

The main disadvantage, as I see it, is that there is no agreement about this
approach. Probably in such situations it makes sense to experiment with
different suggestions, to see how would they look like - let's be flexible.


Well, from my point of view approach with rx/tx is more flexible and 
modular.
But if most of other developers think that using read/read_drain is 
preferable,

then I will not complaint against using your approach.





On Wed, Feb 13, 2019 at 8:34 PM Robbie Harwood  wrote:

In order to comply with your evident desires, consider this message a
courtesy notice that I will no longer be reviewing this patch or
accepting future code from you.

I'm failing to see why miscommunication should necessarily lead to such
statements, but it's your decision after all.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Ryu floating point output patch

2019-02-14 Thread Andrew Gierth

Andrew,

Is there any chance you can get me the regress/results/float[48].out
files from lorikeet and jacana? It would help a lot.

Seeing the diffs isn't enough, because I want to know if the float8 test
(which passes, so there's no diff) is matching the standard file or the
-small-is-zero file.

-- 
Andrew (irc:RhodiumToad)

Re: Per-tablespace autovacuum settings

2019-02-14 Thread Tom Lane

Oleksii Kliukin  writes:
> Is there any interest in making autovacuum parameters available on a
> tablespace level in order to apply those to all vacuumable objects in the
> tablespace?

I understand what you want to accomplish, and it doesn't seem
unreasonable.  But I just want to point out that the situation with
vacuuming parameters is on the edge of unintelligible already; adding
another scope might push it over the edge.  In particular there's no
principled way to decide whether an autovacuum parameter set at an outer
scope should override a plain-vacuum parameter set at a narrower scope.
And it's really questionable which of database-wide and tablespace-wide
should be seen as a narrower scope in the first place.

I don't know how to make this better, but I wish we'd take a step
back and think about it rather than just accreting more and more
complexity.

regards, tom lane

Re: Per-tablespace autovacuum settings

2019-02-14 Thread Andres Freund

Hi,

On 2019-02-14 17:56:17 +0100, Oleksii Kliukin wrote:
> Is there any interest in making autovacuum parameters available on a
> tablespace level in order to apply those to all vacuumable objects in the
> tablespace?
> 
> We have a set of tables running on ZFS, where autovacuum does almost no good
> to us (except for preventing anti-wraparound) due to the nature of ZFS (FS
> fragmentation caused by copy-on-write leads to sequential scans doing random
> access) and the fact that our tables there are append-only. Initially, the
> team in charge of the application just disabled autovacuum globally, but
> that lead to a huge system catalog bloat.
> 
> At present, we have to re-enable autovacuum globally and then disable it
> per-table using table storage parameters, but that is inelegant and requires
> doing it once for existing tables and modifying the script that periodically
> creates new ones (the whole system is a Postgres-based replacement of an
> ElasticSearch cluster and we have to create new partitions regularly).

Won't that a) lead to periodic massive anti-wraparound sessions? b)
prevent any use of index only scans?

ISTM you'd be better off running vacuum rarely, with large
thresholds. That way it'd do most of the writes in one pass, hopefully
leading to less fragementation, and it'd set the visibilitymap bits to
prevent further need to touch those. By doing it only rarely, vacuum
should process pages sequentially, reducing the fragmentation.


> Grouping tables by tablespaces for the purpose of autovacuum configuration
> seems natural, as tablespaces are often placed on another filesystems/device
> that may require changing how often does autovacuum run, make it less/more
> aggressive depending on the I/O performance or require disabling it
> altogether as in my example above. Furthermore, given that we allow
> cost-based options per-tablespace the infrastructure is already there and
> the task is mostly to teach autovacuum to look at tablespaces in addition to
> the relation storage options (in case of a conflict, relation options should
> always take priority).

While I don't buy the reasoning above, I think this'd be useful for
other cases.

Greetings,

Andres Freund

Per-tablespace autovacuum settings

2019-02-14 Thread Oleksii Kliukin

Hello,

Is there any interest in making autovacuum parameters available on a
tablespace level in order to apply those to all vacuumable objects in the
tablespace?

We have a set of tables running on ZFS, where autovacuum does almost no good
to us (except for preventing anti-wraparound) due to the nature of ZFS (FS
fragmentation caused by copy-on-write leads to sequential scans doing random
access) and the fact that our tables there are append-only. Initially, the
team in charge of the application just disabled autovacuum globally, but
that lead to a huge system catalog bloat.

At present, we have to re-enable autovacuum globally and then disable it
per-table using table storage parameters, but that is inelegant and requires
doing it once for existing tables and modifying the script that periodically
creates new ones (the whole system is a Postgres-based replacement of an
ElasticSearch cluster and we have to create new partitions regularly).

Grouping tables by tablespaces for the purpose of autovacuum configuration
seems natural, as tablespaces are often placed on another filesystems/device
that may require changing how often does autovacuum run, make it less/more
aggressive depending on the I/O performance or require disabling it
altogether as in my example above. Furthermore, given that we allow
cost-based options per-tablespace the infrastructure is already there and
the task is mostly to teach autovacuum to look at tablespaces in addition to
the relation storage options (in case of a conflict, relation options should
always take priority).


Regards,
Oleksii Kliukin

Re: libpq compression

2019-02-14 Thread Dmitry Dolgov

For the records, I'm really afraid of interfering with the conversation at this
point, but I believe it's necessary for the sake of a good feature :)

> On Wed, Feb 13, 2019 at 4:03 PM Konstantin Knizhnik 
>  wrote:
>
> 1. When decompressor has not enough data to produce any extra output, it
> doesn't return error. It just not moving forward output position in the
> buffer.

I'm confused, because here is what I see in zlib:

// zlib.h
inflate() returns Z_OK if some progress has been made, ... , Z_BUF_ERROR
if no progress was possible or if there was not enough room in the output
buffer when Z_FINISH is used.  Note that Z_BUF_ERROR is not fatal, and
inflate() can be called again with more input and more output space to
continue decompressing.

So, sounds like if no moving forward happened, there should be Z_BUF_ERROR. But
I haven't experimented with this yet to figure out.

> Ok, but still I think that it is better to pass tx/rx function to stream.
> There are two important advantages:
> 1. It eliminates code duplication.

Ok.

> 2. It allows to use (in future) this streaming compression not only for
> libpq for for other streaming data.

Can you elaborate on this please?

> Concerning "layering violation" may be it is better to introduce some other
> functions something like inflate_read, deflate_write and call them instead of
> secure_read. But from my point of view it will not improve readability and
> modularity of code.

If we will unwrap the current compression logic to not contain tx/rx functions,
isn't it going to be the same as you describing it anyway, just from the higher
point of view? What I'm saying is that there is a compression logic, for it
some data would be read or written from it, just not right here an now by
compression code itself, but rather by already existing machinery (which could
be even beneficial for the patch implementation).

> And I do not see any disadvantages.

The main disadvantage, as I see it, is that there is no agreement about this
approach. Probably in such situations it makes sense to experiment with
different suggestions, to see how would they look like - let's be flexible.

> On Wed, Feb 13, 2019 at 8:34 PM Robbie Harwood  wrote:
>
> In order to comply with your evident desires, consider this message a
> courtesy notice that I will no longer be reviewing this patch or
> accepting future code from you.

I'm failing to see why miscommunication should necessarily lead to such
statements, but it's your decision after all.

Inconsistencies between dependency.c and objectaddress.c

2019-02-14 Thread Tom Lane

In <26527.1549572...@sss.pgh.pa.us> I speculated about adding a
function to objectaddress.c that would probe to see if an object
with a given ObjectAddress (still) exists.  I started to implement
this, but soon noticed that objectaddress.c doesn't cover all the
object classes that dependency.c knows.  This seems bad; is there
a reason for it?  The omitted object classes are

AccessMethodOperatorRelationId
AccessMethodProcedureRelationId
AttrDefaultRelationId
DefaultAclRelationId
PublicationRelRelationId
UserMappingRelationId

What's potentially a lot worse, the two subsystems do not agree
as to the object class OID to be used for large objects:
dependency.c has LargeObjectRelationId but what's in objectaddress.c
is LargeObjectMetadataRelationId.  How did we get to that, and why
isn't it causing serious problems?

regards, tom lane

Re: pg11.1: dsa_area could not attach to segment

2019-02-14 Thread Sergei Kornilov

Hi

> Do you think that plausibly explains and resolves symptoms of bug#15585, too?

I think yes. Bug#15585 raised only after "dsa_area could not attach to segment" 
in different parallel worker. Leader stuck because waiting all parallel 
workers, but one worker has unexpected recursion in dsm_backend_shutdown [1] 
and will never shutdown. Backtrace show previous error in this backend: "cannot 
unpin a segment that is not pinned" - root cause is earlier and in a different 
process.

* 
https://www.postgresql.org/message-id/70942611548327380%40myt6-7734411c649e.qloud-c.yandex.net

regards, Sergei

Re: WAL insert delay settings

2019-02-14 Thread Andres Freund

On 2019-02-14 11:02:24 -0500, Stephen Frost wrote:
> Greetings,
> 
> On Thu, Feb 14, 2019 at 10:15 Peter Eisentraut <
> peter.eisentr...@2ndquadrant.com> wrote:
> 
> > On 14/02/2019 11:03, Tomas Vondra wrote:
> > > But if you add extra sleep() calls somewhere (say because there's also
> > > limit on WAL throughput), it will affect how fast VACUUM works in
> > > general. Yet it'll continue with the cost-based throttling, but it will
> > > never reach the limits. Say you do another 20ms sleep somewhere.
> > > Suddenly it means it only does 25 rounds/second, and the actual write
> > > limit drops to 4 MB/s.
> >
> > I think at a first approximation, you probably don't want to add WAL
> > delays to vacuum jobs, since they are already slowed down, so the rate
> > of WAL they produce might not be your first problem.  The problem is
> > more things like CREATE INDEX CONCURRENTLY that run at full speed.
> >
> > That leads to an alternative idea of expanding the existing cost-based
> > vacuum delay system to other commands.
> >
> > We could even enhance the cost system by taking WAL into account as an
> > additional factor.
> 
> 
> This is really what I was thinking- let’s not have multiple independent
> ways of slowing down maintenance and similar jobs to reduce their impact on
> I/o to the heap and to WAL.

I think that's a bad idea. Both because the current vacuum code is
*terrible* if you desire higher rates because both CPU and IO time
aren't taken into account. And it's extremely hard to control.  And it
seems entirely valuable to be able to limit the amount of WAL generated
for replication, but still try go get the rest of the work done as
quickly as reasonably possible wrt local IO.

Greetings,

Andres Freund

Re: pg11.1: dsa_area could not attach to segment

2019-02-14 Thread Justin Pryzby

On Fri, Feb 15, 2019 at 01:12:35AM +1300, Thomas Munro wrote:
> The problem is that a DSM handle (ie a random number) can be reused
> for a new segment immediately after the shared memory object has been
> destroyed but before the DSM slot has been released.  Now two DSM
> slots have the same handle, and dsm_attach() can be confused by the
> old segment and give up.
> 
> Here's a draft patch to fix that.  It also clears the handle in a case
> where it wasn't previously cleared, but that wasn't strictly
> necessary.  It just made debugging less confusing.

Thanks.

Do you think that plausibly explains and resolves symptoms of bug#15585, too?

Justin

Re: WAL insert delay settings

2019-02-14 Thread Andres Freund

Hi,

On 2019-02-14 16:16:05 +0100, Peter Eisentraut wrote:
> On 13/02/2019 16:40, Andres Freund wrote:
> > On February 13, 2019 4:39:21 PM GMT+01:00, Peter Eisentraut 
> >  wrote:
> >> On 13/02/2019 13:18, Andres Freund wrote:
> >>> But I don't think the way you did it is acceptable - we can't just
> >> delay while holding buffer locks, in critical sections, while not
> >> interruptible.
> >>
> >> The code I added to XLogInsertRecord() is not inside the critical
> >> section.
> > 
> > Most callers do xlog insertions inside crit sections though.
> 
> Is it a problem that pg_usleep(CommitDelay) is inside a critical section?

Well: We can't make things sleep for considerable time while holding
crucial locks and not make such sleeps interruptible. And holding
lwlocks will make it noninterruptible (but still it could throw an
error), but with crit sections, we can't even error out if we somehow
got that error.

Consider throttled code writing to a popular index or bree page - they'd
suddenly be stalled and everyone else would also queue up in an
uninterruptible manner via lwlocks. You'd throttle the whole system.

Greetings,

Andres Freund

Re: Early WIP/PoC for inlining CTEs

2019-02-14 Thread Alvaro Herrera

On 2019-Feb-14, Peter Eisentraut wrote:

> On 14/02/2019 16:11, Tom Lane wrote:
> > ... so, have we beaten this topic to death yet?  Can we make a decision?
> > 
> > Personally, I'd be happy with either of the last two patch versions
> > I posted (that is, either AS [[NOT] MATERIALIZED] or
> > AS [MATERIALIZE [ON|OFF]] syntax).  But we gotta pick something.
> 
> If we're not really planning to add any more options, I'd register a
> light vote for MATERIALIZED.  It reads easier, seems more grammatically
> correct, and uses an existing word.

+1 for MATERIALIZED, as I proposed in
https://postgr.es/m/20170503173305.fetj4tz7kd56tjlr@alvherre.pgsql

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Why don't we have a small reserved OID range for patch revisions?

2019-02-14 Thread John Naylor

I wrote:

> On 2/8/19, Tom Lane  wrote:
>> A script such as you suggest might be a good way to reduce the temptation
>> to get lazy at the last minute.  Now that the catalog data is pretty
>> machine-readable, I suspect it wouldn't be very hard --- though I'm
>> not volunteering either.  I'm envisioning something simple like "renumber
>> all OIDs in range - into range -", perhaps with the
>> ability to skip any already-used OIDs in the target range.
>
> This might be something that can be done inside reformat_dat_files.pl.
> It's a little outside it's scope, but better than the alternatives.

Along those lines, here's a draft patch to do just that. It handles
array type oids as well. Run it like this:

perl  reformat_dat_file.pl  --map-from 9000  --map-to 2000  *.dat

There is some attempt at documentation. So far it doesn't map by
default, but that could be changed if we agreed on the convention of
9000 or whatever.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/doc/src/sgml/bki.sgml b/doc/src/sgml/bki.sgml
index 3c5830bc8a..27b9b52e33 100644
--- a/doc/src/sgml/bki.sgml
+++ b/doc/src/sgml/bki.sgml
@@ -392,6 +392,22 @@
 at compile time.)

 
+   
+For a variety of reasons, newly assigned OIDs should occupy the lower
+end of the reserved range.  Because multiple patches considered for
+inclusion into PostgreSQL could be using
+the same new OIDs, there is the possibilty of conflict.  To mitigate
+this, there is a convention that OIDs 9000 and over are reserved for
+development.  This reduces the effort in rebasing over the HEAD branch.
+To enable committers to move these OIDs to the lower range easily,
+reformat_dat_file.pl can map OIDs as in the
+following, which maps any OIDs 9000 and over to unused OIDs starting
+at 2000:
+
+$ perl  reformat_dat_file.pl  --map-from 9000  --map-to 2000  *.dat
+
+   
+

 The OID counter starts at 1 at the beginning of a bootstrap run.
 If a row from a source other than postgres.bki
diff --git a/src/include/catalog/Makefile b/src/include/catalog/Makefile
index ae797d2465..33fbcf7677 100644
--- a/src/include/catalog/Makefile
+++ b/src/include/catalog/Makefile
@@ -13,19 +13,16 @@ subdir = src/include/catalog
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-# location of Catalog.pm
-catalogdir = $(top_srcdir)/src/backend/catalog
-
 # 'make reformat-dat-files' is a convenience target for rewriting the
 # catalog data files in our standard format.  This includes collapsing
 # out any entries that are redundant with a BKI_DEFAULT annotation.
 reformat-dat-files:
-	$(PERL) -I $(catalogdir) $(srcdir)/reformat_dat_file.pl -o $(srcdir) $(srcdir)/pg_*.dat
+	$(PERL) $(srcdir)/reformat_dat_file.pl --output $(srcdir) $(srcdir)/pg_*.dat
 
 # 'make expand-dat-files' is a convenience target for expanding out all
 # default values in the catalog data files.  This should be run before
 # altering or removing any BKI_DEFAULT annotation.
 expand-dat-files:
-	$(PERL) -I $(catalogdir) $(srcdir)/reformat_dat_file.pl -o $(srcdir) $(srcdir)/pg_*.dat --full-tuples
+	$(PERL) $(srcdir)/reformat_dat_file.pl --output $(srcdir) $(srcdir)/pg_*.dat --full-tuples
 
 .PHONY: reformat-dat-files expand-dat-files
diff --git a/src/include/catalog/reformat_dat_file.pl b/src/include/catalog/reformat_dat_file.pl
index 4835c2e41b..05c157ed48 100755
--- a/src/include/catalog/reformat_dat_file.pl
+++ b/src/include/catalog/reformat_dat_file.pl
@@ -19,10 +19,14 @@
 
 use strict;
 use warnings;
+use Getopt::Long;
+
+# Must run in src/include/catalog
+use FindBin;
+chdir $FindBin::RealBin or die "could not cd to $FindBin::RealBin: $!\n";
 
 # If you copy this script to somewhere other than src/include/catalog,
 # you'll need to modify this "use lib" or provide a suitable -I switch.
-use FindBin;
 use lib "$FindBin::RealBin/../../backend/catalog/";
 use Catalog;
 
@@ -34,35 +38,27 @@ use Catalog;
 my @METADATA =
   ('oid', 'oid_symbol', 'array_type_oid', 'descr', 'autogenerated');
 
-my @input_files;
+my $FirstGenbkiObjectId =
+  Catalog::FindDefinedSymbol('access/transam.h', '..',
+	'FirstGenbkiObjectId');
+
 my $output_path = '';
 my $full_tuples = 0;
 
-# Process command line switches.
-while (@ARGV)
-{
-	my $arg = shift @ARGV;
-	if ($arg !~ /^-/)
-	{
-		push @input_files, $arg;
-	}
-	elsif ($arg =~ /^-o/)
-	{
-		$output_path = length($arg) > 2 ? substr($arg, 2) : shift @ARGV;
-	}
-	elsif ($arg eq '--full-tuples')
-	{
-		$full_tuples = 1;
-	}
-	else
-	{
-		usage();
-	}
-}
+# Don't map any OIDs by default
+my $min_reserved_oid = $FirstGenbkiObjectId;
+my $first_target_oid = 1;
+
+GetOptions(
+	'output:s'=> \$output_path,
+	'full-tuples' => \$full_tuples,
+	'map-from:i'  => \$min_reserved_oid,
+	'map-to:i'=> \$first_target_oid) || usage();
 
 # Sanity check arguments.
-die "No input files.\n"
-  if !@input_files;
+die

Re: WAL insert delay settings

2019-02-14 Thread Stephen Frost

Greetings,

On Thu, Feb 14, 2019 at 10:15 Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 14/02/2019 11:03, Tomas Vondra wrote:
> > But if you add extra sleep() calls somewhere (say because there's also
> > limit on WAL throughput), it will affect how fast VACUUM works in
> > general. Yet it'll continue with the cost-based throttling, but it will
> > never reach the limits. Say you do another 20ms sleep somewhere.
> > Suddenly it means it only does 25 rounds/second, and the actual write
> > limit drops to 4 MB/s.
>
> I think at a first approximation, you probably don't want to add WAL
> delays to vacuum jobs, since they are already slowed down, so the rate
> of WAL they produce might not be your first problem.  The problem is
> more things like CREATE INDEX CONCURRENTLY that run at full speed.
>
> That leads to an alternative idea of expanding the existing cost-based
> vacuum delay system to other commands.
>
> We could even enhance the cost system by taking WAL into account as an
> additional factor.


This is really what I was thinking- let’s not have multiple independent
ways of slowing down maintenance and similar jobs to reduce their impact on
I/o to the heap and to WAL.

Thanks!

Stephen

Re: Protect syscache from bloating with negative cache entries

2019-02-14 Thread 'Bruce Momjian'

On Thu, Feb 14, 2019 at 01:31:49AM +, Tsunakawa, Takayuki wrote:
> From: Bruce Momjian [mailto:br...@momjian.us]
> > > That being said, having a "minimal size" threshold before starting
> > > with the time-based eviction may be a good idea.
> >
> > Agreed.  I see the minimal size as a way to keep the systems tables
> > in cache, which we know we will need for the next query.
>
> Isn't it the maximum size, not minimal size?  Maximum size allows
> to keep desired amount of system tables in memory as well as to
> control memory consumption to avoid out-of-memory errors (OS crash!).
> I'm wondering why people want to take a different approach to
> catcatch, which is unlike other PostgreSQL memory e.g. shared_buffers,
> temp_buffers, SLRU buffers, work_mem, and other DBMSs.

Well, that is an _excellent_ question, and one I had to think about.

I think, in general, smaller is better, as long as making something
smaller doesn't remove data that is frequently accessed.  Having a timer
to expire only old entries seems like it accomplished this goal.

Having a minimum size and not taking it to zero size makes sense if we
know we will need certain entries like pg_class in the next query. 
However, if the session is idle for hours, we should just probably
remove everything, so maybe the minimum doesn't make sense --- just
remove everything.

As for why we don't do this with everything --- we can't do it with
shared_buffers since we can't change its size while the server is
running.  For work_mem, we assume all the work_mem data is for the
current query, and therefore frequently accessed.  Also, work_mem is not
memory we can just free if it is not used since it contains intermediate
results required by the current query.  I think temp_buffers, since it
can be resized in the session, actually could use a similar minimizing
feature, though that would mean it behaves slightly differently from
shared_buffers, and it might not be worth it.  Also, I assume the value
of temp_buffers was mostly for use by the current query --- yes, it can
be used for cross-query caching, but I am not sure if that is its
primary purpose.  I thought its goal was to prevent shared_buffers from
being populated with temporary per-session buffers.

I don't think other DBMSs are a good model since they have a reputation
for requiring a lot of tuning --- tuning that we have often automated.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: ALTER TABLE on system catalogs

2019-02-14 Thread Peter Eisentraut

On 08/02/2019 04:04, Kyotaro HORIGUCHI wrote:
> By the way, I'm confused to see that attributes that don't want
> to go external are marked as 'x' in system catalogs. Currently
> (putting aside its necessity) the following operation ends with
> successful attaching a new TOAST relation, which we really don't
> want.
> 
> ALTER TABLE pg_attribute ALTER COLUMN attrelid SET STORAGE plain;
> 
> Might be silly, but couldn't we have another storage class? Say,
> Compression, which means try compress but don't go external.

That already exists: 'm': Value can be stored compressed inline

I agree that it seems we should be using that for those tables that
don't have a toast table.  Maybe the genbki stuff could do it
automatically for the appropriate catalogs.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Early WIP/PoC for inlining CTEs

2019-02-14 Thread Peter Eisentraut

On 06/02/2019 10:00, Bruce Momjian wrote:
> I think "materialize" is the right word since "materialized" would be
> past tense.

It's an adjective.  The sentence is, "with foo as the materialized
$query, do the $main_query".

It's the same as "materialized view".

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: COPY support for parameters

2019-02-14 Thread Tom Lane

Adrian Phinney  writes:
> Does Postgres support COPY with parameters?

No.  In general you can only use parameters in DML statements
(SELECT/INSERT/UPDATE/DELETE); utility statements don't cope,
mainly because most of them lack expression eval capability
altogether.

Perhaps the special case of COPY from a SELECT could be made
to allow parameters inside the SELECT, but I don't think
anyone has tried.

regards, tom lane

Re: WAL insert delay settings

2019-02-14 Thread Peter Eisentraut

On 14/02/2019 11:03, Tomas Vondra wrote:
> But if you add extra sleep() calls somewhere (say because there's also
> limit on WAL throughput), it will affect how fast VACUUM works in
> general. Yet it'll continue with the cost-based throttling, but it will
> never reach the limits. Say you do another 20ms sleep somewhere.
> Suddenly it means it only does 25 rounds/second, and the actual write
> limit drops to 4 MB/s.

I think at a first approximation, you probably don't want to add WAL
delays to vacuum jobs, since they are already slowed down, so the rate
of WAL they produce might not be your first problem.  The problem is
more things like CREATE INDEX CONCURRENTLY that run at full speed.

That leads to an alternative idea of expanding the existing cost-based
vacuum delay system to other commands.

We could even enhance the cost system by taking WAL into account as an
additional factor.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Ryu floating point output patch

2019-02-14 Thread Andrew Dunstan



On 2/13/19 12:09 PM, Andrew Gierth wrote:
>> "Andrew" == Andrew Gierth  writes:
>  Andrew> 2. How far do we need to go to support existing uses of
>  Andrew>extra_float_digits? For example, do we need a way for clients to
>  Andrew>request that they get the exact same output as previously for
>  Andrew>extra_float_digits=2 or 3, rather than just assuming that any
>  Andrew>round-trip-exact value will do?
>
>  Andrew>   (this would likely mean adding a new GUC, rather than basing
>  Andrew>   everything off extra_float_digits as the patch does now)
>
> So it turns out, for reasons that should have been obvious in advance,
> that _of course_ we need this, because otherwise there's no way to do
> the cross-version upgrade regression check.
>
> So we need something like exact_float_output='on' (default) that can be
> selectively set to 'off' to restore the previous behavior of
> extra_float_digits.
>
> Thoughts?



I applied this:


diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 9edc7b9a02..a6b804ad2c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -1062,7 +1062,15 @@ setup_connection(Archive *AH, const char
*dumpencoding,
 * Set extra_float_digits so that we can dump float data exactly (given
 * correctly implemented float I/O code, anyway)
 */
-   if (AH->remoteVersion >= 9)
+   if (getenv("PGDUMP_EXTRA_FLOAT_DIGITS"))
+   {
+   PQExpBuffer q = createPQExpBuffer();
+   appendPQExpBuffer(q, "SET extra_float_digits TO %s",
+ getenv("PGDUMP_EXTRA_FLOAT_DIGITS"));
+   ExecuteSqlStatement(AH, q->data);
+   destroyPQExpBuffer(q);
+   }
+   else if (AH->remoteVersion >= 9)
    ExecuteSqlStatement(AH, "SET extra_float_digits TO 3");
    else
    ExecuteSqlStatement(AH, "SET extra_float_digits TO 2");


Then I got the buildfarm cross-version-upgrade module to set the
environment value to 0 in the appropriate cases. All the tests passed,
as far back as 9.2, no new GUC needed.

We might not want to use that in more real-world cases of pg_dump use,
but I think for this purpose it should be fine.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Early WIP/PoC for inlining CTEs

2019-02-14 Thread Tom Lane

... so, have we beaten this topic to death yet?  Can we make a decision?

Personally, I'd be happy with either of the last two patch versions
I posted (that is, either AS [[NOT] MATERIALIZED] or
AS [MATERIALIZE [ON|OFF]] syntax).  But we gotta pick something.

regards, tom lane

Re: [WIP] CREATE SUBSCRIPTION with FOR TABLES clause (table filter)

2019-02-14 Thread Evgeniy Efimkin

Hi!
Thanks for comments!
I fixed build and return lines about subscription apply process in doc

 
Efimkin Evgeny
diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index 3f2f674a1a..5d211646bf 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -522,7 +522,8 @@
   
 
   
-   To create a subscription, the user must be a superuser.
+   To add tables to a subscription, the user must have ownership rights on the
+   table.
   
 
   
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 6dfb2e4d3e..f0a368f90c 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -24,6 +24,9 @@ PostgreSQL documentation
 ALTER SUBSCRIPTION name CONNECTION 'conninfo'
 ALTER SUBSCRIPTION name SET PUBLICATION publication_name [, ...] [ WITH ( set_publication_option [= value] [, ... ] ) ]
 ALTER SUBSCRIPTION name REFRESH PUBLICATION [ WITH ( refresh_option [= value] [, ... ] ) ]
+ALTER SUBSCRIPTION name ADD TABLE table_name [, ...]
+ALTER SUBSCRIPTION name SET TABLE table_name [, ...]
+ALTER SUBSCRIPTION name DROP TABLE table_name [, ...]
 ALTER SUBSCRIPTION name ENABLE
 ALTER SUBSCRIPTION name DISABLE
 ALTER SUBSCRIPTION name SET ( subscription_parameter [= value] [, ... ] )
@@ -44,9 +47,7 @@ ALTER SUBSCRIPTION name RENAME TO <
   
You must own the subscription to use ALTER SUBSCRIPTION.
To alter the owner, you must also be a direct or indirect member of the
-   new owning role. The new owner has to be a superuser.
-   (Currently, all subscription owners must be superusers, so the owner checks
-   will be bypassed in practice.  But this might change in the future.)
+   new owning role.
   
  
 
@@ -137,6 +138,35 @@ ALTER SUBSCRIPTION name RENAME TO <
 

 
+   
+ADD TABLE table_name
+
+ 
+  The ADD TABLE clauses will add new table in subscription, table must be
+  present in publication.
+ 
+
+   
+
+   
+SET TABLE table_name
+
+ 
+  The SET TABLE clause will replace the list of tables in
+  the publication with the specified one.
+ 
+
+   
+
+   
+DROP TABLE table_name
+
+ 
+   The DROP TABLE clauses will remove table from subscription.
+ 
+
+   
+

 ENABLE
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c244fb..511fec53a1 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -24,6 +24,7 @@ PostgreSQL documentation
 CREATE SUBSCRIPTION subscription_name
 CONNECTION 'conninfo'
 PUBLICATION publication_name [, ...]
+[ FOR TABLE table_name [, ...]
 [ WITH ( subscription_parameter [= value] [, ... ] ) ]
 
  
@@ -88,6 +89,16 @@ CREATE SUBSCRIPTION subscription_name

 
+   
+FOR TABLE
+
+ 
+  Specifies a list of tables to add to the subscription. All tables listed in a clause
+  must be present in the publication.
+ 
+
+   
+

 WITH ( subscription_parameter [= value] [, ... ] )
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 3e229c693c..c23d6eb0bd 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -906,6 +906,27 @@ CREATE VIEW pg_stat_progress_vacuum AS
 FROM pg_stat_get_progress_info('VACUUM') AS S
 		LEFT JOIN pg_database D ON S.datid = D.oid;
 
+CREATE VIEW pg_user_subscriptions AS
+SELECT
+S.oid,
+		S.subdbid,
+S.subnameAS subname,
+CASE WHEN S.subowner = 0 THEN
+'public'
+ELSE
+A.rolname
+END AS usename,
+		S.subenabled,
+CASE WHEN (S.subowner <> 0 AND A.rolname = current_user)
+OR (SELECT rolsuper FROM pg_authid WHERE rolname = current_user)
+THEN S.subconninfo
+ ELSE NULL END AS subconninfo,
+		S.subslotname,
+		S.subsynccommit,
+		S.subpublications
+FROM pg_subscription S
+LEFT JOIN pg_authid A ON (A.oid = S.subowner);
+
 CREATE VIEW pg_user_mappings AS
 SELECT
 U.oid   AS umid,
@@ -938,7 +959,8 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subslotname, subpublications)
+GRANT SELECT (tableoid, oid, subdbid, subname,
+	subowner, subenabled, subslotname, subpublications, subsynccommit)
 ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index a60a15193a..57eadf6f83 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -30,6 +30,7 @@
 #include "catalog/pg_subscription.h"
 #include

Re: Protect syscache from bloating with negative cache entries

2019-02-14 Thread Bruce Momjian

On Thu, Feb 14, 2019 at 12:40:10AM -0800, Andres Freund wrote:
> Hi,
> 
> On 2019-02-13 15:31:14 +0900, Kyotaro HORIGUCHI wrote:
> > Instead, I added an accounting(?) interface function.
> > 
> > | MemoryContextGettConsumption(MemoryContext cxt);
> > 
> > The API returns the current consumption in this memory
> > context. This allows "real" memory accounting almost without
> > overhead.
> 
> That's definitely *NOT* almost without overhead. This adds additional
> instructions to one postgres' hottest set of codepaths.
> 
> I think you're not working incrementally enough here. I strongly suggest
> solving the negative cache entry problem, and then incrementally go from
> there after that's committed. The likelihood of this patch ever getting
> merged otherwise seems extremely small.

Agreed --- the patch is going in the wrong direction.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

[Patch] checksumming-related buglets in pg_verify_checksums/pg_basebackup TAP tests

2019-02-14 Thread Michael Banck

Hi,

while hacking on pg_verify_checksums and looking at hexdumps, I noticed
that the TAP tests of pg_verify_checksums (and pg_basebackup from which
it was copy-pasted) actually write "305c305c[...]" (i.e. literal
backslashes and number 0s) instead of "000[...]" into the to-be-
corrupted relfilenodes due to wrong quoting.

This also revealed a second bug in the pg_basebackup test suite where
the offset for the corruption in the second file was wrong, so it
actually never got corrupted, and the tests only passed due to the above
twice than expected number of written bytes. The write() probably
overflowed into an adjacent block so that the total number of corrupted
blocks was as expected by addident. Oops, my bad, patch attached.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutzdiff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 3e1c3863c4..33869fecc9 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -499,7 +499,7 @@ my $block_size = $node->safe_psql('postgres', 'SHOW block_size;');
 system_or_bail 'pg_ctl', '-D', $pgdata, 'stop';
 open $file, '+<', "$pgdata/$file_corrupt1";
 seek($file, $pageheader_size, 0);
-syswrite($file, '\0\0\0\0\0\0\0\0\0');
+syswrite($file, "\0\0\0\0\0\0\0\0\0");
 close $file;
 system_or_bail 'pg_ctl', '-D', $pgdata, 'start';
 
@@ -518,7 +518,7 @@ for my $i (1 .. 5)
 {
 	my $offset = $pageheader_size + $i * $block_size;
 	seek($file, $offset, 0);
-	syswrite($file, '\0\0\0\0\0\0\0\0\0');
+	syswrite($file, "\0\0\0\0\0\0\0\0\0");
 }
 close $file;
 system_or_bail 'pg_ctl', '-D', $pgdata, 'start';
@@ -534,8 +534,8 @@ rmtree("$tempdir/backup_corrupt2");
 # induce corruption in a second file
 system_or_bail 'pg_ctl', '-D', $pgdata, 'stop';
 open $file, '+<', "$pgdata/$file_corrupt2";
-seek($file, 4000, 0);
-syswrite($file, '\0\0\0\0\0\0\0\0\0');
+seek($file, $pageheader_size, 0);
+syswrite($file, "\0\0\0\0\0\0\0\0\0");
 close $file;
 system_or_bail 'pg_ctl', '-D', $pgdata, 'start';
 
diff --git a/src/bin/pg_verify_checksums/t/002_actions.pl b/src/bin/pg_verify_checksums/t/002_actions.pl
index 5250b5a728..74ad5ad723 100644
--- a/src/bin/pg_verify_checksums/t/002_actions.pl
+++ b/src/bin/pg_verify_checksums/t/002_actions.pl
@@ -45,7 +45,7 @@ sub check_relation_corruption
 	# Time to create some corruption
 	open my $file, '+<', "$pgdata/$file_corrupted";
 	seek($file, $pageheader_size, 0);
-	syswrite($file, '\0\0\0\0\0\0\0\0\0');
+	syswrite($file, "\0\0\0\0\0\0\0\0\0");
 	close $file;
 
 	# Checksum checks on single relfilenode fail

Re: pg11.1: dsa_area could not attach to segment

2019-02-14 Thread Sergei Kornilov

Hi!

Great work, thank you!

I can not reproduce bug after 30min test long. (without patch bug was after 
minute-two)

regards Sergei

Re: [WIP] CREATE SUBSCRIPTION with FOR TABLES clause (table filter)

2019-02-14 Thread Andrey Borodin

Hi!

> 11 февр. 2019 г., в 16:30, Evgeniy Efimkin  
> написал(а):
> 
The patch seems good to me but cfbot is not happy. Can you please investigate 
what's wrong with this build?
https://travis-ci.org/postgresql-cfbot/postgresql/builds/492912868

Also, I'm not sure we should drop this lines from docs:

>   The subscription apply process will run in the local database with the
>  privileges of a superuser.

Thanks!

Best regards, Andrey Borodin.

Re: ALTER TABLE on system catalogs

2019-02-14 Thread Chris Travers

I have a couple of thoughts here.

On Fri, Feb 8, 2019 at 4:35 AM Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> At Fri, 08 Feb 2019 12:03:31 +0900 (Tokyo Standard Time), Kyotaro
> HORIGUCHI  wrote in <
> 20190208.120331.167280496.horiguchi.kyot...@lab.ntt.co.jp>
> > By the way, I'm confused to see that attributes that don't want
> > to go external are marked as 'x' in system catalogs. Currently
> > (putting aside its necessity) the following operation ends with
> > successful attaching a new TOAST relation, which we really don't
> > want.
> >
> > ALTER TABLE pg_attribute ALTER COLUMN attrelid SET STORAGE plain;
> >
> > Might be silly, we may need another storage class, say,
> > Compression, which means try compress but don't go external.
> >
> > The attached patch does that.
> >
> > - All varlen fields of pg_class and pg_attribute are marked as
> >   'c'.  (Catalog.pm, genbki.pl, genbki.h, pg_attribute/class.h)
> >
> > - Try compress but don't try external for 'c' storage.
> >   (tuptoaster.c, toasting.c)
> >
> >
> > We could remove toast relation when no toastable attribute
> > remains after ATLER TABLE ALTER COLUMN SET STOAGE, but it doesn't
> > seem that simple. So the storage class is for internal use only.
>

I guess there is a serious question why this is for internal use only.
Are there any particular cases where this would be problematic for those
who know what they are doing to set?  If not, maybe it should be included
in the docs?

>
> I found that "ALTER TABLE.. SET STORAGE plain" doesn't remove
> toast relation, so it's not so bad even if compress does the
> same?
>
> The attached 0001 is fixed from the previous version. 0002 is the
> syntax.
>

I agree that we need to support setting options on tables in system
catalogs.  We have some cases where we want to disable autovacuum on some
table spaces, but set it aggressively on a couple system catalogs.
Currently this is not really doable in any sane way.

I also think that if the current catalogs violate expectations regarding
precondition checks this needs to be corrected rather than special-cased
and this seems to be the best way forward.

>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>


-- 
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin

Re: Cache relation sizes?

2019-02-14 Thread Kyotaro HORIGUCHI

2019年2月14日(木) 20:41、Kyotaro HORIGUCHI さん（horiguchi.kyot...@lab.ntt.co.jp
）のメッセージ:

> At Wed, 13 Feb 2019 05:48:28 +, "Jamison, Kirk" <
> k.jami...@jp.fujitsu.com> wrote in
> 
> > On February 6, 2019, 8:57 AM +, Andres Freund wrote:
> > > Maybe I'm missing something here, but why is it actually necessary to
> > > have the sizes in shared memory, if we're just talking about caching
> > > sizes?  It's pretty darn cheap to determine the filesize of a file that
> > > has been recently stat()/lseek()/ed, and introducing per-file shared
> > > data adds *substantial* complexity, because the amount of shared memory
> > > needs to be pre-determined.  The reason I want to put per-relation data
> > > into shared memory is different, it's about putting the buffer mapping
> > > into shared memory, and that, as a prerequisite, also need per-relation
> > > data. And there's a limit of the number of relations that need to be
> > > open (one per cached page at max), and space can be freed by evicting
> > > pages.
> >
> > Ahh.. You are right about the logic of putting it in the shared memory.
> > As for Thomas' toy patch, multiple files share one counter in shmem.
> > Although it currently works, it might likely to miss.
> > Though his eventual plan of the idea is to use an array of N counters
> > and map relation OIDs onto them.
> > But as your point about complexity says, in shared memory we cannot
> > share the same area with multiple files, so that needs an area to
> > allocate depending on the number of files.
> >
> > Regarding the allocation of per-relation data in shared memory, I
> > thought it can be a separated component at first so I asked for
> > validity of the idea. But now I consider the point raised.
>
> I still believe that one shared memory element for every
> non-mapped relation is not only too-complex but also too-much, as
> Andres (and implicitly I) wrote. I feel that just one flag for
> all works fine but partitioned flags (that is, relations or files
> corresponds to the same hash value share one flag) can reduce the
> shared memory elements to a fixed small number.
>
> Note: I'm still not sure how much lseek impacts performance.
>

Of course all the "flag"s above actually are "update counter"s :)

>
-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pg_basebackup ignores the existing data directory permissions

2019-02-14 Thread Haribabu Kommi

On Thu, Feb 14, 2019 at 8:57 PM Magnus Hagander  wrote:

> On Thu, Feb 14, 2019 at 9:10 AM Michael Paquier 
> wrote:
>
>> On Thu, Feb 14, 2019 at 06:34:07PM +1100, Haribabu Kommi wrote:
>> > we have an application that is used to create the data directory with
>>
>> Well, initdb would do that happily, so there is no actual any need to
>> do that to begin with.  Anyway..
>>
>> > owner access (0700), but with initdb group permissions option, it
>> > automatically
>> > converts to (0750) by the initdb. But pg_basebackup doesn't change it
>> when
>> > it tries to do a backup from a group access server.
>>
>> So that's basically the opposite of the case I was thinking about,
>> where you create a path for a base backup with permissions strictly
>> higher than 700, say 755, and the base backup path does not have
>> enough restrictions.  And in your case the permissions are too
>> restrictive because of the application creating the folder itself but
>> they should be relaxed if group access is enabled.  Actually, that's
>> something that we may want to do consistently across all branches.  If
>> an application calls pg_basebackup after creating a path, they most
>> likely change the permissions anyway to allow the postmaster to
>> start.
>>
>
> I think it could be argued that neither initdb *or* pg_basebackup should
> change the permissions on an existing directory, because the admin may have
> done that intentionally. But when they do create the directory, they should
> follow the same patterns.
>

Hmm, even if the administrator set some specific permissions to the data
directory,
PostgreSQL server doesn't allow server to start if the permissions are not
(0700)
for versions less than 11 and (0700 or 0750) for version 11 or later.

To let the user to use the PostgreSQL server, user must change the
permissions
of the data directory. So, I don't see a problem in changing the
permissions by these
tools.

Regards,
Haribabu Kommi
Fujitsu Australia

Re: pg11.1: dsa_area could not attach to segment

2019-02-14 Thread Thomas Munro

On Tue, Feb 12, 2019 at 10:15 PM Sergei Kornilov  wrote:
> I still have error with parallel_leader_participation = off.

Justin very kindly set up a virtual machine similar to the one where
he'd seen the problem so I could experiment with it.  Eventually I
also managed to reproduce it locally, and have finally understood the
problem.

It doesn't happen on master (hence some of my initial struggle to
reproduce it) because of commit 197e4af9, which added srandom() to set
a different seed for each parallel workers.  Perhaps you see where
this is going already...

The problem is that a DSM handle (ie a random number) can be reused
for a new segment immediately after the shared memory object has been
destroyed but before the DSM slot has been released.  Now two DSM
slots have the same handle, and dsm_attach() can be confused by the
old segment and give up.

Here's a draft patch to fix that.  It also clears the handle in a case
where it wasn't previously cleared, but that wasn't strictly
necessary.  It just made debugging less confusing.


--
Thomas Munro
http://www.enterprisedb.com
From 5d94eac7d11e74c280aa32a98d14eaeca40679de Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 15 Feb 2019 00:50:52 +1300
Subject: [PATCH] Fix race in dsm_attach() when handles are recycled.

DSM handles can be recycled as soon as the underlying shared memory
object has been destroyed.  That means that for a brief moment we
might have two DSM slots with the same handle.  While trying to attach,
if we encounter a slot with refcnt == 1, meaning that it is being
destroyed, we should continue our search in case the same handle
exists in another slot.

The failure was more likely in the back branches, due to the lack
of distinct seed for random(), so that handle reuse is more likely.

In passing, clear the handle when dsm_unpin_segment() frees a segment.
That wasn't known to be an active bug, but it was untidy.

Back-patch to 10.

Author: Thomas Munro
Reported-by: Justin Pryzby
Discussion: 20190207014719.GJ29720@telsasoft.com">https://postgr.es/m/20190207014719.GJ29720@telsasoft.com
---
 src/backend/storage/ipc/dsm.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index cab7ae74ca..b01aea9de9 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -580,10 +580,11 @@ dsm_attach(dsm_handle h)
 		/*
 		 * If the reference count is 1, the slot is still in use, but the
 		 * segment is in the process of going away.  Treat that as if we
-		 * didn't find a match.
+		 * didn't find a match.  Keep looking though; it's possible for the
+		 * handle to have been reused already.
 		 */
 		if (dsm_control->item[i].refcnt == 1)
-			break;
+			continue;
 
 		/* Otherwise we've found a match. */
 		dsm_control->item[i].refcnt++;
@@ -909,6 +910,7 @@ dsm_unpin_segment(dsm_handle handle)
 			Assert(dsm_control->item[control_slot].handle == handle);
 			Assert(dsm_control->item[control_slot].refcnt == 1);
 			dsm_control->item[control_slot].refcnt = 0;
+			dsm_control->item[control_slot].handle = DSM_HANDLE_INVALID;
 			LWLockRelease(DynamicSharedMemoryControlLock);
 		}
 	}
-- 
2.20.1

Re: [Suspect SPAM] Better error messages when lacking connection slots for autovacuum workers and bgworkers

2019-02-14 Thread Kyotaro HORIGUCHI

Hello.

At Wed, 13 Feb 2019 14:13:09 +0900, Michael Paquier  wrote 
in <20190213051309.gf5...@paquier.xyz>
> Hi all,
> 
> When lacking connection slots, the backend returns a simple "sorry,
> too many clients", which is something that has been tweaked by recent
> commit ea92368 for WAL senders.  However, the same message would show
> up for autovacuum workers and bgworkers.  Based on the way autovacuum
> workers are spawned by the launcher, and the way bgworkers are added,
> it seems that this cannot actually happen.  Still, in my opinion,
> providing more context can be helpful for people trying to work on
> features related to such code so as they can get more information than
> what would normally happen for normal backends.

I agree to the distinctive messages, but the autovaccum and
bgworker cases are in a kind of internal error, and they are not
"connection"s. I feel that elog is more suitable for them.

> I am wondering as well if this could not help for issues like this
> one, which has popped out today and makes me wonder if we don't have
> race conditions with the way dynamic bgworkers are spawned:
> https://www.postgresql.org/message-id/CAONUJSM5X259vAnnwSpqu=vnrecfgsj-cgrhys4p5yyrvwk...@mail.gmail.com
> 
> There are four code paths which report the original "sorry, too many
> clients", and the one of the patch attached can easily track the
> type of connection slot used which helps for better context messages
> when a process is initialized.
> Would that be helpful for at least debugging purposes?

I agree that it is helpful. Errors with different causes ought to
show distinctive error messages.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Cache relation sizes?

2019-02-14 Thread Kyotaro HORIGUCHI

At Wed, 13 Feb 2019 05:48:28 +, "Jamison, Kirk"  
wrote in 
> On February 6, 2019, 8:57 AM +, Andres Freund wrote:
> > Maybe I'm missing something here, but why is it actually necessary to
> > have the sizes in shared memory, if we're just talking about caching
> > sizes?  It's pretty darn cheap to determine the filesize of a file that
> > has been recently stat()/lseek()/ed, and introducing per-file shared
> > data adds *substantial* complexity, because the amount of shared memory
> > needs to be pre-determined.  The reason I want to put per-relation data
> > into shared memory is different, it's about putting the buffer mapping
> > into shared memory, and that, as a prerequisite, also need per-relation
> > data. And there's a limit of the number of relations that need to be
> > open (one per cached page at max), and space can be freed by evicting
> > pages.
> 
> Ahh.. You are right about the logic of putting it in the shared memory.
> As for Thomas' toy patch, multiple files share one counter in shmem.
> Although it currently works, it might likely to miss. 
> Though his eventual plan of the idea is to use an array of N counters
> and map relation OIDs onto them. 
> But as your point about complexity says, in shared memory we cannot
> share the same area with multiple files, so that needs an area to
> allocate depending on the number of files.
> 
> Regarding the allocation of per-relation data in shared memory, I
> thought it can be a separated component at first so I asked for
> validity of the idea. But now I consider the point raised.

I still believe that one shared memory element for every
non-mapped relation is not only too-complex but also too-much, as
Andres (and implicitly I) wrote. I feel that just one flag for
all works fine but partitioned flags (that is, relations or files
corresponds to the same hash value share one flag) can reduce the
shared memory elements to a fixed small number.

Note: I'm still not sure how much lseek impacts performance.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: [PROPOSAL]a new data type 'bytea' for ECPG

2019-02-14 Thread Matsumura, Ryo

Meskes-san

> Yes, I agree with this. But it does not explain why we cannot just add
> a length parameter. And it neither explains why we need so many if
> (!bytea) { thisandthat } else { somethingelse } blocks. I would prefer
> the integration to be smoother. Hopefully that is possible.

I agree that the special route is ugly, but I cannot remove them completely.
I try to implement Idea-2. In same time, I try to move if(bytea) blocks to
new function for readability.

e.g. move the following to new function set_data_attr().

 if (var->type != ECPGt_bytea)
  desc_item->is_binary = false;
  else
  {
  struct ECPGgeneric_varchar *variable =
  (struct ECPGgeneric_varchar *) (var->value);
  desc_item->is_binary = true;
  desc_item->data_len = variable->len;
  }
  ecpg_free(desc_item->data);
  desc_item->data = (char *) tobeinserted;

Regards
Ryo Matsumura

> -Original Message-
> From: Michael Meskes [mailto:mes...@postgresql.org]
> Sent: Wednesday, February 13, 2019 9:09 PM
> To: Matsumura, Ryo/松村 量 
> Cc: Tsunakawa, Takayuki/綱川 貴之 ;
> pgsql-hackers@lists.postgresql.org
> Subject: Re: [PROPOSAL]a new data type 'bytea' for ECPG
> 
> Matsumura-san,
> 
> > Current architecture:
> > Internal expression of varchar is C-string that includes length
> > information implicitly
> > because the length can be computed by strlen().
> > ECPGdo(ecpg_build_params) assumes that the data in descriptor is C-
> > string encoded.
> >
> > In other hand, bytea data is binary that doesn't include any length
> > information.
> > And the merit of my proposal is that bytea data can be sent to
> > backend without
> > C-string encodeing overhead. They are different points from varchar.
> 
> Yes, I agree with this. But it does not explain why we cannot just add
> a length parameter. And it neither explains why we need so many if
> (!bytea) { thisandthat } else { somethingelse } blocks. I would prefer
> the integration to be smoother. Hopefully that is possible.
> 
> > My Idea-2 is that:
> > - ECPGset_desc copies data to descriptor_item.data, set the length to
> >   dscriptor_item.data_len and set type information to
> > descriptor_item.is_binary.
> > - ecpg_build_params only memcpy as folowing without ecpg_store_input:
> >
> >   if (descriptor_item.is_binary)
> > memcpy(, descriptor_item.data,
> > descriptor_item.data_len)
> 
> Isn't that a better way then? This looks more smoothly to me.
> 
> Michael
> --
> Michael Meskes
> Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
> Meskes at (Debian|Postgresql) dot Org
> Jabber: michael at xmpp dot meskes dot org
> VfL Borussia! Força Barça! SF 49ers! Use Debian GNU/Linux, PostgreSQL
> 
>

Re: [HACKERS] Block level parallel vacuum

2019-02-14 Thread Masahiko Sawada

On Wed, Feb 13, 2019 at 9:32 PM Haribabu Kommi  wrote:
>
>
> On Sat, Feb 9, 2019 at 11:47 PM Masahiko Sawada  wrote:
>>
>> On Tue, Feb 5, 2019 at 12:14 PM Haribabu Kommi  
>> wrote:
>> >
>> >
>> > On Fri, Feb 1, 2019 at 8:19 AM Masahiko Sawada  
>> > wrote:
>> >>
>> >>
>> >> The passing stats = NULL to amvacuumcleanup and ambulkdelete means the
>> >> first time execution. For example, btvacuumcleanup skips cleanup if
>> >> it's not NULL.In the normal vacuum we pass NULL to ambulkdelete or
>> >> amvacuumcleanup when the first time calling. And they store the result
>> >> stats to the memory allocated int the local memory. Therefore in the
>> >> parallel vacuum I think that both worker and leader need to move it to
>> >> the shared memory and mark it as updated as different worker could
>> >> vacuum different indexes at the next time.
>> >
>> >
>> > OK, understood the point. But for btbulkdelete whenever the stats are NULL,
>> > it allocates the memory. So I don't see a problem with it.
>> >
>> > The only problem is with btvacuumcleanup, when there are no dead tuples
>> > present in the table, the btbulkdelete is not called and directly the 
>> > btvacuumcleanup
>> > is called at the end of vacuum, in that scenario, there is code flow 
>> > difference
>> > based on the stats. so why can't we use the deadtuples number to 
>> > differentiate
>> > instead of adding another flag?
>>
>> I don't understand your suggestion. What do we compare deadtuples
>> number to? Could you elaborate on that please?
>
>
> The scenario where the stats should pass NULL to btvacuumcleanup function is
> when there no dead tuples, I just think that we may use that deadtuples 
> structure
> to find out whether stats should pass NULL or not while avoiding the extra
> memcpy.
>

Thank you for your explanation. I understood. Maybe I'm worrying too
much but I'm concernced compatibility; currently we handle indexes
individually. So if there is an index access method whose ambulkdelete
returns NULL at the first call but returns a palloc'd struct at the
second time or other, that doesn't work fine.

The documentation says that passed-in 'stats' is NULL at the first
time call of ambulkdelete but doesn't say about the second time or
more. Index access methods may expect that the passed-in 'stats'  is
the same as what they has returned last time. So I think to add an
extra flag for keeping comptibility.

>>
>> > And also this scenario is not very often, so avoiding
>> > memcpy for normal operations would be better. It may be a small gain, just
>> > thought of it.
>> >
>>
>> This scenario could happen periodically on an insert-only table.
>> Additional memcpy is executed once per indexes in a vacuuming but I
>> agree that the avoiding memcpy would be good.
>
>
> Yes, understood. If possible removing the need of memcpy would be good.
> The latest patch doesn't apply anymore. Needs a rebase.
>

Thank you. Attached the rebased patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


v15-0001-Add-parallel-option-to-VACUUM-command.patch
Description: Binary data


v15-0002-Add-P-option-to-vacuumdb-command.patch
Description: Binary data

Re: WAL insert delay settings

2019-02-14 Thread Tomas Vondra




On 2/14/19 10:36 AM, Andres Freund wrote:
> 
> 
> On February 14, 2019 10:31:57 AM GMT+01:00, Tomas Vondra 
>  wrote:
>>
>>
>> On 2/14/19 10:06 AM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2019-02-14 10:00:38 +0100, Tomas Vondra wrote:
 On 2/13/19 4:31 PM, Stephen Frost wrote:
> Greetings,
>
> * Peter Eisentraut (peter.eisentr...@2ndquadrant.com) wrote:
>> Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can
>> create
>> a lot of WAL.  A lot of WAL at once can cause delays in
>> replication.
>
> Agreed, though I think VACUUM should certainly be included in this.
>

 Won't these two throttling criteria interact in undesirable and/or
 unpredictable way? With the regular vacuum throttling (based on
 hit/miss/dirty) it's possible to compute rough read/write I/O
>> limits.
 But with the additional sleeps based on amount-of-WAL, we may sleep
>> for
 one of two reasons, so we may not reach either limit. No?
>>>
>>> Well, it'd be max rates for either, if done right. I think we only
>>> should start adding delays for WAL logging if we're exceeding the WAL
>>> write rate.
>>
>> Not really, I think. If you add additional sleep() calls somewhere,
>> that
>> may affect the limits in vacuum, making it throttle before reaching the
>> derived throughput limits.
> 
> I don't understand. Obviously, if you have two limits, the scarcer
> resource can limit full use of the other resource. That seems OK? The
> thing u think we need to be careful about is not to limit in a way,
> e.g. by adding sleeps even when below the limit, that a WAL limit
> causes throttling of normal IO before the WAL limit is reached.
> 

With the vacuum throttling, rough I/O throughput maximums can be
computed by by counting the number of pages you can read/write between
sleeps. For example with the defaults (200 credits, 20ms sleeps, miss
cost 10 credits) this means 20 writes/round, with 50 rounds/second, so
8MB/s. But this is based on assumption that the work between sleeps
takes almost no time - that's not perfect, but generally works.

But if you add extra sleep() calls somewhere (say because there's also
limit on WAL throughput), it will affect how fast VACUUM works in
general. Yet it'll continue with the cost-based throttling, but it will
never reach the limits. Say you do another 20ms sleep somewhere.
Suddenly it means it only does 25 rounds/second, and the actual write
limit drops to 4 MB/s.

> 
>>> That's obviously more complicated than the stuff we do for
>>> the current VACUUM throttling, but I can't see two such systems
>>> interacting well. Also, the current logic just doesn't work well when
>>> you consider IO actually taking time, and/or process scheduling
>> effects
>>> on busy systems.
>>>
>>
>> True, but making it even less predictable is hardly an improvement.
> 
> I don't quite see the problem here. Could you expand?
> 

All I'm saying that you can now estimate how much reads/writes vacuum
does. With the extra sleeps (due to additional throttling mechanism) it
will be harder.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pg_basebackup ignores the existing data directory permissions

2019-02-14 Thread Magnus Hagander

On Thu, Feb 14, 2019 at 9:10 AM Michael Paquier  wrote:

> On Thu, Feb 14, 2019 at 06:34:07PM +1100, Haribabu Kommi wrote:
> > we have an application that is used to create the data directory with
>
> Well, initdb would do that happily, so there is no actual any need to
> do that to begin with.  Anyway..
>
> > owner access (0700), but with initdb group permissions option, it
> > automatically
> > converts to (0750) by the initdb. But pg_basebackup doesn't change it
> when
> > it tries to do a backup from a group access server.
>
> So that's basically the opposite of the case I was thinking about,
> where you create a path for a base backup with permissions strictly
> higher than 700, say 755, and the base backup path does not have
> enough restrictions.  And in your case the permissions are too
> restrictive because of the application creating the folder itself but
> they should be relaxed if group access is enabled.  Actually, that's
> something that we may want to do consistently across all branches.  If
> an application calls pg_basebackup after creating a path, they most
> likely change the permissions anyway to allow the postmaster to
> start.
>

I think it could be argued that neither initdb *or* pg_basebackup should
change the permissions on an existing directory, because the admin may have
done that intentionally. But when they do create the directory, they should
follow the same patterns.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Regarding participating in GSOC 2019 with PostgreSQL

2019-02-14 Thread Abhishek Agrawal

Hello,

My name is Abhishek Agrawal and I am a CSE UG student of IIT Patna.

I am interested in doing GSOC with PostgreSQL if you guys are applying for it 
this year. If anyone could direct me to proper links or some channel to prepare 
for the same then it will be really helpful. I have some questions like: What 
will be the projects this year, What are the requirement for the respective 
projects, who will be the mentors, etc.

This will be really helpful to me.

Sincerely,

Abhishek Agrawal
Github Link: https://github.com/Abhishek-dev2

useless argument of ATAddForeignKeyConstraint

2019-02-14 Thread Amit Langote

Hi,

While reviewing the foreign keys referencing partitioned tables patch, I
noticed that the parentConstr argument of ATAddForeignConstraint is
rendered useless by the following commit:

commit 0325d7a5957ba39a0dce90835ab54a08ab8bf762
Author: Alvaro Herrera 
Date:   Fri Jan 18 14:49:40 2019 -0300

Fix creation of duplicate foreign keys on partitions

Above commit added another function specialized for recursively adding a
given foreign key constraint to partitions, so ATAddForeignConstraint is
no longer called recursively.

Maybe remove that argument in HEAD ?  Attached a patch.

Thanks,
Amit
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 715c6a221c..f225c494bd 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -404,7 +404,7 @@ static ObjectAddress ATAddCheckConstraint(List **wqueue,
 bool recurse, bool recursing, bool 
is_readd,
 LOCKMODE lockmode);
 static ObjectAddress ATAddForeignKeyConstraint(List **wqueue, AlteredTableInfo 
*tab,
- Relation rel, Constraint 
*fkconstraint, Oid parentConstr,
+ Relation rel, Constraint 
*fkconstraint,
  bool recurse, bool recursing,
  LOCKMODE lockmode);
 static void CloneForeignKeyConstraints(Oid parentId, Oid relationId,
@@ -7077,7 +7077,7 @@ ATExecAddConstraint(List **wqueue, AlteredTableInfo *tab, 
Relation rel,

 NIL);
 
address = ATAddForeignKeyConstraint(wqueue, tab, rel,
-   
newConstraint, InvalidOid,
+   
newConstraint,

recurse, false,

lockmode);
break;
@@ -7236,7 +7236,7 @@ ATAddCheckConstraint(List **wqueue, AlteredTableInfo 
*tab, Relation rel,
  */
 static ObjectAddress
 ATAddForeignKeyConstraint(List **wqueue, AlteredTableInfo *tab, Relation rel,
- Constraint *fkconstraint, Oid 
parentConstr,
+ Constraint *fkconstraint,
  bool recurse, bool recursing, 
LOCKMODE lockmode)
 {
Relationpkrel;
@@ -7607,7 +7607,7 @@ ATAddForeignKeyConstraint(List **wqueue, AlteredTableInfo 
*tab, Relation rel,
  
fkconstraint->deferrable,
  
fkconstraint->initdeferred,
  
fkconstraint->initially_valid,
- 
parentConstr,
+ 
InvalidOid,
  
RelationGetRelid(rel),
  
fkattnum,
  
numfks,

Re: Problems with plan estimates in postgres_fdw

2019-02-14 Thread Etsuro Fujita


(2019/02/12 20:43), Antonin Houska wrote:

Etsuro Fujita  wrote:

Here are my review comments:

root->upper_targets[UPPERREL_FINAL] = final_target;
+   root->upper_targets[UPPERREL_ORDERED] = final_target;
root->upper_targets[UPPERREL_WINDOW] = sort_input_target;
root->upper_targets[UPPERREL_GROUP_AGG] = grouping_target;

I think that's a good idea.  I think we'll soon need the PathTarget for
UPPERREL_DISTINCT, so how about saving that as well like this?

 root->upper_targets[UPPERREL_FINAL] = final_target;
+   root->upper_targets[UPPERREL_ORDERED] = final_target;
+   root->upper_targets[UPPERREL_DISTINCT] = sort_input_target;
 root->upper_targets[UPPERREL_WINDOW] = sort_input_target;
 root->upper_targets[UPPERREL_GROUP_AGG] = grouping_target;


I see nothing wrong about this.


Cool!


Another is about this:

/*
+* Set reltarget so that it's consistent with the paths. Also it's more
+* convenient for FDW to find the target here.
+*/
+   ordered_rel->reltarget = target;

I think this is correct if there are no SRFs in the targetlist, but otherwise
I'm not sure this is really correct because the relation's reltarget would not
match the target of paths added to the relation after adjust_paths_for_srfs().


I wouldn't expect problem here. create_grouping_paths() also sets the
reltarget although adjust_paths_for_srfs() can be alled on grouped_rel
afterwards.


Yeah, but as I said in a previous email, I think the root->upper_targets 
change would be enough at least currently for FDWs to consider 
performing post-UPPERREL_GROUP_AGG steps remotely, as shown in my 
patches, so I'm inclined to leave the retarget change for future work.


Best regards,
Etsuro Fujita

Re: WAL insert delay settings

2019-02-14 Thread Andres Freund




On February 14, 2019 10:31:57 AM GMT+01:00, Tomas Vondra 
 wrote:
>
>
>On 2/14/19 10:06 AM, Andres Freund wrote:
>> Hi,
>> 
>> On 2019-02-14 10:00:38 +0100, Tomas Vondra wrote:
>>> On 2/13/19 4:31 PM, Stephen Frost wrote:
 Greetings,

 * Peter Eisentraut (peter.eisentr...@2ndquadrant.com) wrote:
> Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can
>create
> a lot of WAL.  A lot of WAL at once can cause delays in
>replication.

 Agreed, though I think VACUUM should certainly be included in this.

>>>
>>> Won't these two throttling criteria interact in undesirable and/or
>>> unpredictable way? With the regular vacuum throttling (based on
>>> hit/miss/dirty) it's possible to compute rough read/write I/O
>limits.
>>> But with the additional sleeps based on amount-of-WAL, we may sleep
>for
>>> one of two reasons, so we may not reach either limit. No?
>> 
>> Well, it'd be max rates for either, if done right. I think we only
>> should start adding delays for WAL logging if we're exceeding the WAL
>> write rate.
>
>Not really, I think. If you add additional sleep() calls somewhere,
>that
>may affect the limits in vacuum, making it throttle before reaching the
>derived throughput limits.

I don't understand. Obviously, if you have two limits, the scarcer resource can 
limit full use of the other resource. That seems OK? The thing u think we need 
to be careful about is not to limit in a way, e.g. by adding sleeps even when 
below the limit, that a WAL limit causes throttling of normal IO before the WAL 
limit is reached.


>> That's obviously more complicated than the stuff we do for
>> the current VACUUM throttling, but I can't see two such systems
>> interacting well. Also, the current logic just doesn't work well when
>> you consider IO actually taking time, and/or process scheduling
>effects
>> on busy systems.
>> 
>
>True, but making it even less predictable is hardly an improvement.

I don't quite see the problem here. Could you expand?

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: WAL insert delay settings

2019-02-14 Thread Tomas Vondra




On 2/14/19 10:06 AM, Andres Freund wrote:
> Hi,
> 
> On 2019-02-14 10:00:38 +0100, Tomas Vondra wrote:
>> On 2/13/19 4:31 PM, Stephen Frost wrote:
>>> Greetings,
>>>
>>> * Peter Eisentraut (peter.eisentr...@2ndquadrant.com) wrote:
 Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can create
 a lot of WAL.  A lot of WAL at once can cause delays in replication.
>>>
>>> Agreed, though I think VACUUM should certainly be included in this.
>>>
>>
>> Won't these two throttling criteria interact in undesirable and/or
>> unpredictable way? With the regular vacuum throttling (based on
>> hit/miss/dirty) it's possible to compute rough read/write I/O limits.
>> But with the additional sleeps based on amount-of-WAL, we may sleep for
>> one of two reasons, so we may not reach either limit. No?
> 
> Well, it'd be max rates for either, if done right. I think we only
> should start adding delays for WAL logging if we're exceeding the WAL
> write rate.

Not really, I think. If you add additional sleep() calls somewhere, that
may affect the limits in vacuum, making it throttle before reaching the
derived throughput limits.

> That's obviously more complicated than the stuff we do for
> the current VACUUM throttling, but I can't see two such systems
> interacting well. Also, the current logic just doesn't work well when
> you consider IO actually taking time, and/or process scheduling effects
> on busy systems.
> 

True, but making it even less predictable is hardly an improvement.

cheers

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WAL insert delay settings

2019-02-14 Thread Andres Freund

Hi,

On 2019-02-14 10:00:38 +0100, Tomas Vondra wrote:
> On 2/13/19 4:31 PM, Stephen Frost wrote:
> > Greetings,
> > 
> > * Peter Eisentraut (peter.eisentr...@2ndquadrant.com) wrote:
> >> Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can create
> >> a lot of WAL.  A lot of WAL at once can cause delays in replication.
> > 
> > Agreed, though I think VACUUM should certainly be included in this.
> > 
> 
> Won't these two throttling criteria interact in undesirable and/or
> unpredictable way? With the regular vacuum throttling (based on
> hit/miss/dirty) it's possible to compute rough read/write I/O limits.
> But with the additional sleeps based on amount-of-WAL, we may sleep for
> one of two reasons, so we may not reach either limit. No?

Well, it'd be max rates for either, if done right. I think we only
should start adding delays for WAL logging if we're exceeding the WAL
write rate. That's obviously more complicated than the stuff we do for
the current VACUUM throttling, but I can't see two such systems
interacting well. Also, the current logic just doesn't work well when
you consider IO actually taking time, and/or process scheduling effects
on busy systems.

Greetings,

Andres Freund

Re: WAL insert delay settings

2019-02-14 Thread Peter Geoghegan

On Thu, Feb 14, 2019 at 12:53 AM Peter Geoghegan  wrote:
> I didn't mention that the utility they used would send SIGSTOP and
> SIGCONT in close succession. (Yeah, I know.)

Actually, it SIGSTOP'd backends, not the WAL writer or background writer.


-- 
Peter Geoghegan

Re: WAL insert delay settings

2019-02-14 Thread Tomas Vondra




On 2/13/19 4:31 PM, Stephen Frost wrote:
> Greetings,
> 
> * Peter Eisentraut (peter.eisentr...@2ndquadrant.com) wrote:
>> Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can create
>> a lot of WAL.  A lot of WAL at once can cause delays in replication.
> 
> Agreed, though I think VACUUM should certainly be included in this.
> 

Won't these two throttling criteria interact in undesirable and/or
unpredictable way? With the regular vacuum throttling (based on
hit/miss/dirty) it's possible to compute rough read/write I/O limits.
But with the additional sleeps based on amount-of-WAL, we may sleep for
one of two reasons, so we may not reach either limit. No?

> I'm all for the idea though it seems like a different approach is needed
> based on the down-thread discussion.  Ultimately, having a way to have
> these activities happen without causing issues for replicas is a great
> idea and would definitely address a practical issue that a lot of people
> run into.
> 

+1


-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WAL insert delay settings

2019-02-14 Thread Peter Geoghegan

On Thu, Feb 14, 2019 at 12:52 AM Peter Geoghegan  wrote:
> I imagine that it held the critical locks briefly.

I didn't mention that the utility they used would send SIGSTOP and
SIGCONT in close succession. (Yeah, I know.)

-- 
Peter Geoghegan

Re: WAL insert delay settings

2019-02-14 Thread Peter Geoghegan

On Thu, Feb 14, 2019 at 12:42 AM Andres Freund  wrote:
> That can't have been the workaround - either you'd interrupt it while
> holding critical locks (in which case nobody could write WAL anymore),
> or you'd just move all the writing to backends, no?

I imagine that it held the critical locks briefly. I'm not endorsing
that approach, obviously, but apparently it more or less worked. It
was something that was used in rare cases, only when there was no
application-specific way to throttle writes, and only when the server
was in effect destabilized by writing out WAL too quickly.

-- 
Peter Geoghegan

Re: WAL insert delay settings

2019-02-14 Thread Andres Freund

Hi,

On 2019-02-13 23:21:39 -0800, Peter Geoghegan wrote:
> There would occasionally be cases where ops
> would find it useful to throttle WAL writing using their own terrible
> kludge (it involved sending SIGSTOP to the WAL writer).

That can't have been the workaround - either you'd interrupt it while
holding critical locks (in which case nobody could write WAL anymore),
or you'd just move all the writing to backends, no?

Greetings,

Andres Freund

Re: Protect syscache from bloating with negative cache entries

2019-02-14 Thread Andres Freund

Hi,

On 2019-02-13 15:31:14 +0900, Kyotaro HORIGUCHI wrote:
> Instead, I added an accounting(?) interface function.
> 
> | MemoryContextGettConsumption(MemoryContext cxt);
> 
> The API returns the current consumption in this memory
> context. This allows "real" memory accounting almost without
> overhead.

That's definitely *NOT* almost without overhead. This adds additional
instructions to one postgres' hottest set of codepaths.

I think you're not working incrementally enough here. I strongly suggest
solving the negative cache entry problem, and then incrementally go from
there after that's committed. The likelihood of this patch ever getting
merged otherwise seems extremely small.

Greetings,

Andres Freund

RE: Protect syscache from bloating with negative cache entries

2019-02-14 Thread Ideriha, Takeshi

>From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
>
>
>(2) Another new patch v15-0005 on top of previous design of
>  limit-by-number-of-a-cache feature converts it to
>  limit-by-size-on-all-caches feature, which I think is
>  Tsunakawa-san wanted.
Yeah, size looks better to me.

>As far as I can see no significant degradation is found in usual (as long as 
>pruning
>doesn't happen) code paths.
>
>About the new global-size based evicition(2), cache entry creation becomes 
>slow after
>the total size reached to the limit since every one new entry evicts one or 
>more old (=
>not-recently-used) entries. Because of not needing knbos for each cache, it 
>become
>far realistic. So I added documentation of "catalog_cache_max_size" in 0005.

Now I'm also trying to benchmark, which will be posted in another email.

Here are things I noticed:

[1] compiler warning
catcache.c:109:1: warning: missing braces around initializer [-Wmissing-braces]
 dlist_head cc_lru_list = {0};
 ^
catcache.c:109:1: warning: (near initialization for ‘cc_lru_list.head’) 
[-Wmissing-braces]

[2] catalog_cache_max_size is not appered in postgresql.conf.sample

[3] global lru list and global size can be included in CatCacheHeader, which 
seems to me 
good place because this structure contains global cache information 
regardless of kind of CatCache 

[4] when applying patch with git am, there are several warnings about trailing 
white space at v15-0003

Regards,
Takeshi Ideriha

Re: [PATCH] xlogreader: do not read a file block twice

2019-02-14 Thread Arthur Zakirov


On 14.02.2019 09:51, Michael Paquier wrote:

Now I don't actually agree that this qualifies as a bug fix.  As
things stand, a page may finish by being more than once if what has
been read previously equals what is requested, however this does not
prevent the code to work correctly.  The performance gain is also
heavily dependent on the callback reading a page and the way the WAL
reader is used.  How do you actually read WAL pages in your own
plugin with compressed data?  It begins by reading a full page once,
then it moves on to a per-record read after making sure that the page
has been read?


Yes, an application reads WAL pages wholly at a time. It is done within 
SimpleXLogPageRead() (it is a read_page callback passed to 
XLogReaderAllocate()). It returns XLOG_BLCKSZ.


Here is the part of the code, not sure that it will be useful though:

SimpleXLogPageRead(...)
{
...
targetPageOff = targetPagePtr % private_data->xlog_seg_size;
...
if (gzseek(private_data->gz_xlogfile, (z_off_t) targetPageOff,
   SEEK_SET) == -1)
...
if (gzread(private_data->gz_xlogfile, readBuf, XLOG_BLCKSZ) !=
   XLOG_BLCKSZ)
...
return XLOG_BLCKSZ;
}

So we read whole page with size XLOG_BLCKSZ. The full code:
https://github.com/postgrespro/pg_probackup/blob/c052651b8c8864733bcabbc2660c387b792229d8/src/parsexlog.c#L1074

Here is the little optimization I made. Mainly I just add a buffer to 
store previous read page:

https://github.com/postgrespro/pg_probackup/blob/c052651b8c8864733bcabbc2660c387b792229d8/src/parsexlog.c#L1046

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: pg_basebackup ignores the existing data directory permissions

2019-02-14 Thread Michael Paquier

On Thu, Feb 14, 2019 at 06:34:07PM +1100, Haribabu Kommi wrote:
> we have an application that is used to create the data directory with

Well, initdb would do that happily, so there is no actual any need to
do that to begin with.  Anyway..

> owner access (0700), but with initdb group permissions option, it
> automatically
> converts to (0750) by the initdb. But pg_basebackup doesn't change it when
> it tries to do a backup from a group access server.

So that's basically the opposite of the case I was thinking about,
where you create a path for a base backup with permissions strictly
higher than 700, say 755, and the base backup path does not have
enough restrictions.  And in your case the permissions are too
restrictive because of the application creating the folder itself but
they should be relaxed if group access is enabled.  Actually, that's
something that we may want to do consistently across all branches.  If
an application calls pg_basebackup after creating a path, they most
likely change the permissions anyway to allow the postmaster to
start.
--
Michael


signature.asc
Description: PGP signature

95 matches

Mail list logo