get_controlfile() can leak fds in the backend

2019-02-26 Thread Michael Paquier
Hi all,
(CC-ing Joe as of dc7d70e)

I was just looking at the offline checksum patch, and noticed some
sloppy coding in controldata_utils.c.  The control file gets opened in
get_controlfile(), and if it generates an error then the file
descriptor remains open.  As basic code rule in the backend we should
only use BasicOpenFile() when opening files, so I think that the issue
should be fixed as attached, even if this requires including fd.h for
the backend compilation which is kind of ugly.

Another possibility would be to just close the file descriptor before
any error, saving appropriately errno before issuing any %m portion,
but that still does not respect the backend policy regarding open().

We also do not have a wait event for the read() call, maybe we should
have one, but knowing that this gets called only for the SQL-level
functions accessing the control file, I don't really think that's
worth it.

Thoughts?
--
Michael
diff --git a/src/common/controldata_utils.c b/src/common/controldata_utils.c
index abfe7065f5..0b011741e3 100644
--- a/src/common/controldata_utils.c
+++ b/src/common/controldata_utils.c
@@ -27,6 +27,9 @@
 #include "catalog/pg_control.h"
 #include "common/controldata_utils.h"
 #include "port/pg_crc32c.h"
+#ifndef FRONTEND
+#include "storage/fd.h"
+#endif
 
 /*
  * get_controlfile(char *DataDir, const char *progname, bool *crc_ok_p)
@@ -45,13 +48,20 @@ get_controlfile(const char *DataDir, const char *progname, bool *crc_ok_p)
 	char		ControlFilePath[MAXPGPATH];
 	pg_crc32c	crc;
 	int			r;
+	int			flags = O_RDONLY | PG_BINARY;
 
 	AssertArg(crc_ok_p);
 
 	ControlFile = palloc(sizeof(ControlFileData));
 	snprintf(ControlFilePath, MAXPGPATH, "%s/global/pg_control", DataDir);
 
-	if ((fd = open(ControlFilePath, O_RDONLY | PG_BINARY, 0)) == -1)
+#ifndef FRONTEND
+	fd = BasicOpenFile(ControlFilePath, flags);
+#else
+	fd = open(ControlFilePath, flags, 0);
+#endif
+
+	if (fd < 0)
 #ifndef FRONTEND
 		ereport(ERROR,
 (errcode_for_file_access(),


signature.asc
Description: PGP signature


Re: Remove Deprecated Exclusive Backup Mode

2019-02-26 Thread Laurenz Albe
Fujii Masao wrote:
> So, let me clarify the situations;
> 
> (3) If backup_label.pending exists but recovery.signal doesn't, the server
>ignores (or removes) backup_label.pending and do the recovery
>starting the pg_control's REDO location. This case can happen if
>the server crashes while an exclusive backup is in progress.
>So crash-in-the-middle-of-backup doesn't prevent the recovery from
>starting in this idea

That's where I see the problem with your idea.

It is fairly easy for someone to restore a backup and then just start
the server without first creating "recovery.signal", and that would
lead to data corruption.

Indeed, if I didn't know much about databases and were faced with
the task of restoring a PostgreSQL database from backup in a hurry,
that is probably what I would do.

Having thought some about the whole problem area, I believe that there
is no safe way to disambiguate a backup from a server crashed in backup
mode.

The non-exclusive backup mode actually suffers from a similar problem ---
if you don't add "backup_label" to the backup afterwards, it will lead
to database corruption.

The only fool-proof tool we have in core is pg_basebackup.

Any really good backup solution that wishes to integrate with a third-
party backup tool will have to get that tool to include "backup_label"
with the backup itself.

For that, it would need to know the contents of "backup_label" *before*
pg_stop_backup().  Is there a good reason why it is pg_stop_backup()
and not pg_start_backup() that provides that information?

Yours,
Laurenz Albe




Re: [HACKERS] generated columns

2019-02-26 Thread Michael Paquier
On Wed, Feb 27, 2019 at 08:05:02AM +0100, Peter Eisentraut wrote:
> There is something like that at the top of
> src/test/regress/sql/generated.sql.  I can expand that.

I think you mean identity.sql.  I would think that this would live
better with some other sanity checks.  I am fine with your final
decision on the matter.

> But it only covers the tests.  For run time checks, you'd want
> something like pg_catcheck.

Sure.
--
Michael


signature.asc
Description: PGP signature


Re: Offline enabling/disabling of data checksums

2019-02-26 Thread Michael Paquier
On Wed, Feb 27, 2019 at 07:59:31AM +0100, Fabien COELHO wrote:
> The renaming implies quite a few changes (eg in the documentation,
> makefiles, tests) which warrants a review, so it should be a patch. Also,
> ISTM that the renaming only make sense when adding the enable/disable
> feature, so I'd say that it belongs to this patch. Opinions?

I would think that the rename should happen first, but it is possible
to make git diffs less noisy as well for files copied, so merging
things is technically doable.

> About tests: I'd run a check on a disabled cluster to check that the command
> fails because disabled.

While I look at that...  If you could split the refactoring into a
separate, first, patch as well.. 
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] generated columns

2019-02-26 Thread Peter Eisentraut
On 2019-02-26 06:12, Michael Paquier wrote:
> Hm.  Does the SQL standard mention more features which could be merged
> with stored values, virtual values, default expressions and identity
> columns?  I don't know the last trends in this area but I am wondering
> if there are any other column specific, expression-like features like
> that associated to a column.  That could give more strength with
> having one column in pg_attribute to govern them all.  Well, assuming
> that something else is implemented of course.  That's a lot of
> assumptions, and it's not like the current implementation is wrong
> either.

The system-versioned tables feature might be the most adjacent one.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] generated columns

2019-02-26 Thread Peter Eisentraut
On 2019-02-26 06:30, Michael Paquier wrote:
> +   if (attgenerated)
> +   {
> +   /*
> +* Generated column: Dropping anything that the generation expression
> +* refers to automatically drops the generated column.
> +*/
> +   recordDependencyOnSingleRelExpr(, expr, 
> RelationGetRelid(rel),
> +   DEPENDENCY_AUTO,
> +   DEPENDENCY_AUTO, false);
> +   }
> A CCI is not necessary I think here, still the recent thread about
> automatic dependencies with identity columns had a much similar
> pattern...

Yeah, worth taking another look.

> 
> +   else if (generated[0] == ATTRIBUTE_GENERATED_VIRTUAL)
> +   default_str = psprintf("generated always as (%s)", 
> PQgetvalue(res, i, attrdef_col));
> Nit: I would add VIRTUAL instead of relying on the default option.

I suppose we'll decide that when the virtual columns are actually added.
 I see your point.

> Another thing I was thinking about: could it be possible to add a
> sanity check in sanity_check.sql so as a column more that one
> field in attidentity, attgenerated and atthasdef set at the same time?

There is something like that at the top of
src/test/regress/sql/generated.sql.  I can expand that.  But it only
covers the tests.  For run time checks, you'd want something like
pg_catcheck.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Offline enabling/disabling of data checksums

2019-02-26 Thread Fabien COELHO



Hallo Michael,


- * src/bin/pg_verify_checksums/pg_verify_checksums.c
+ * src/bin/pg_checksums/pg_checksums.c
That's lacking a rename, or this comment is incorrect.


Right, I started the rename, but then backed off pending further 
discussion whether I should submit that or whether the committer will 
just do it.


ISTM that there is a all clear for renaming.

The renaming implies quite a few changes (eg in the documentation, 
makefiles, tests) which warrants a review, so it should be a patch. Also, 
ISTM that the renaming only make sense when adding the enable/disable 
feature, so I'd say that it belongs to this patch. Opinions?


About v4:

Patch applies cleanly, compiles, global & local "make check" are ok.

Doc: "enable, disable or verifies" -> "checks, enables or disables"?
Spurious space: "> ." -> ">.".

As checksum are now checked, the doc could use "check" instead of 
"verify", especially if there is a rename and the "verify" word 
disappears.


I'd be less terse when documenting actions, eg: "Disable checksums" -> 
"Disable checksums on cluster."


Doc should state that checking is the default action, eg "Check checksums 
on cluster. This is the default action."


Help string could say that -c is the default action. There is a spurious 
help line remaining from the previous "--action" implementation.


open: I'm positively unsure about ?: priority over |, and probably not the 
only one, so I'd add parentheses around the former.


I'm at odds with the "postmaster.pid" check, which would not prevent an 
issue if a cluster is started with "postmaster". I still think that the 
enabling-in-progress should be stored in the cluster state.


ISTM that the cluster read/update cycle should lock somehow the control 
file being modified. However other commands do not seem to do something 
about it.


I do not think that enabling if already enabled or disabling or already 
disable should exit(1), I think it is a no-op and should simply exit(0).


About tests: I'd run a check on a disabled cluster to check that the 
command fails because disabled.


--
Fabien.



RE: Problem with default partition pruning

2019-02-26 Thread Yuzuko Hosoya
Amit-san,

> From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
> Sent: Wednesday, February 27, 2019 11:22 AM
> 
> Hosoya-san,
> 
> On 2019/02/22 17:14, Yuzuko Hosoya wrote:
> > Hi,
> >
> > I found the bug of default partition pruning when executing a range query.
> >
> > -
> > postgres=# create table test1(id int, val text) partition by range
> > (id); postgres=# create table test1_1 partition of test1 for values
> > from (0) to (100); postgres=# create table test1_2 partition of test1
> > for values from (150) to (200); postgres=# create table test1_def
> > partition of test1 default;
> >
> > postgres=# explain select * from test1 where id > 0 and id < 30;
> >QUERY PLAN
> > 
> >  Append  (cost=0.00..11.83 rows=59 width=11)
> >->  Seq Scan on test1_1  (cost=0.00..5.00 rows=58 width=11)
> >  Filter: ((id > 0) AND (id < 30))
> >->  Seq Scan on test1_def  (cost=0.00..6.53 rows=1 width=12)
> >  Filter: ((id > 0) AND (id < 30))
> > (5 rows)
> >
> > There is no need to scan the default partition, but it's scanned.
> > -
> >
> > In the current implement, whether the default partition is scanned or
> > not is determined according to each condition of given WHERE clause at
> > get_matching_range_bounds().  In this example, scan_default is set
> > true according to id > 0 because id >= 200 matches the default
> > partition.  Similarly, according to id < 30, scan_default is set true.
> > Then, these results are combined according to AND/OR at 
> > perform_pruning_combine_step().
> > In this case, final result's scan_default is set true.
> >
> > The modifications I made are as follows:
> > - get_matching_range_bounds() determines only offsets of range bounds
> >   according to each condition
> > - These results are combined at perform_pruning_combine_step()
> > - Whether the default partition is scanned or not is determined at
> >   get_matching_partitions()
> >
> > Attached the patch.  Any feedback is greatly appreciated.
> 
> Thank you for reporting.  Can you please add this to March CF in Bugs 
> category so as not to lose
track
> of this?
> 
> I will try to send review comments soon.
> 
Thank you for your reply.  I added this to March CF.

Regards,
Yuzuko Hosoya
NTT Open Source Software Center





Re: TupleTableSlot abstraction

2019-02-26 Thread Andres Freund
On 2019-02-27 15:42:50 +0900, Michael Paquier wrote:
> On Tue, Feb 26, 2019 at 10:38:45PM -0800, Andres Freund wrote:
> > Im not sure I understand. How can adding a memory context + reset to
> > ctas and matview receivers negatively impact other dest receivers?
> 
> I don't think you got my point here: imagine that a plugin use the
> current receiveSlot logic from createas.c or matview.c, and forgets to
> free the tuple copied.  On v11, that works fine.  On current HEAD,
> they win silently a new leak.

The copy was made in intorel_receive()?



Re: Libpq support to connect to standby server as priority

2019-02-26 Thread Haribabu Kommi
On Mon, Feb 25, 2019 at 11:38 AM Tsunakawa, Takayuki <
tsunakawa.ta...@jp.fujitsu.com> wrote:

> Hi Hari-san,
>
> I've reviewed all files.  I think I'll proceed to testing when I've
> reviewed the revised patch and the patch for target_server_type.
>
>
Thanks for the review.


>
> (1) patch 0001
> CONNECTION_CHECK_WRITABLE,  /* Check if we could make a
> writable
>  *
> connection. */
> +   CONNECTION_CHECK_TARGET,/* Check if we have a proper target
> +*
> connection */
> CONNECTION_CONSUME  /* Wait for any pending
> message and consume
>  * them. */
>
> According to the following comment, a new enum value should be added at
> the end.
>
> /*
>  * Although it is okay to add to these lists, values which become unused
>  * should never be removed, nor should constants be redefined - that would
>  * break compatibility with existing code.
>  */
>

fixed.


>
>
> (2) patch 0002
> It seems to align better with the existing code to set the default value
> to pg_conn.requested_session_type explicitly in makeEmptyPGconn(), even if
> the default value is 0.  Although I feel it's redundant, other member
> variables do so.
>
>
fixed.


>
> (3) patch 0003
>  IntervalStyle was not reported by releases before
> 8.4;
> -application_name was not reported by releases
> before 9.0.)
> +application_name was not reported by releases
> before 9.0
> +transaction_read_only was not reported by releases
> before 12.0.)
>
> ";" is missing at the end of the third line.
>
>
fixed.


>
> (4) patch 0004
> -   /* Type of connection to make.  Possible values: any, read-write.
> */
> +   /* Type of connection to make.  Possible values: any, read-write,
> perfer-read. */
> char   *target_session_attrs;
>
> perfer -> prefer
>
>
fixed.


>
> (5) patch 0004
> @@ -3608,6 +3691,9 @@ makeEmptyPGconn(void)
> conn = NULL;
> }
>
> +   /* Initial value */
> +   conn->read_write_host_index = -1;
>
> The new member should be initialized earlier in this function.  Otherwise,
> as you can see in the above fragment, conn can be NULL in an out-of-memory
> case.
>
>
fixed.



>
> (6) patch 0004
> Don't we add read-only as well as prefer-read, which corresponds to Slave
> or Secondary of PgJDBC's targetServerType?  I thought the original proposal
> was to add both.
>
>
Added read-only option.



>
> (7) patch 0004
> @@ -2347,6 +2367,7 @@ keep_going:
>  /* We will come back to here until there is
>
> conn->try_next_addr = true;
> goto keep_going;
> }
> +
>
> appendPQExpBuffer(>errorMessage,
>
> libpq_gettext("could not create socket: %s\n"),
>
> Is this an unintended newline addition?
>
>
removed.


>
> (8) patch 0004
> +   const char *type =
> (conn->requested_session_type == SESSION_TYPE_PREFER_READ) ?
> +
>  "read-only" : "writable";
>
> I'm afraid these strings are not translatable into other languages.
>
>
OK. I added two separate error message prepare statements for both
read-write and read-only
so, it shouldn't be a problem.



>
> (9) patch 0004
> +   /* Record read-write host
> index */
> +   if
> (conn->read_write_host_index == -1)
> +
>  conn->read_write_host_index = conn->whichhost;
>
> At this point, the session can be either read-write or read-only, can't
> it?  Add the check "!conn->transaction_read_only" in this if condition?
>

Yes, fixed.


>
> (10) patch 0004
> +   if ((conn->target_session_attrs != NULL) &&
> +  (conn->requested_session_type
> == SESSION_TYPE_PREFER_READ) &&
> +  (conn->read_write_host_index !=
> -2))
>
> The first condition is not necessary because the second one exists.
>
> The parenthesis surrounding each of these conditions are redundant.  It
> would be better to remove them for readability.  This also applies to the
> following part:
>
> +   if (((strncmp(val, "on", 2) == 0)
> &&
> +
>  (conn->requested_session_type == SESSION_TYPE_READ_WRITE)) ||
> +   ((strncmp(val, "off", 3)
> == 0) &&
> +
>  (conn->requested_session_type == SESSION_TYPE_PREFER_READ) &&
> +
>  (conn->read_write_host_index != -2)))
>

fixed.

Attached are the updated patches.
The target_server_type option yet to be implemented.

Regards,
Haribabu Kommi
Fujitsu Australia


0003-Make-transaction_read_only-as-GUC_REPORT-varaible.patch
Description: Binary data


0002-New-TargetSessionAttrsType-enum.patch

Re: TupleTableSlot abstraction

2019-02-26 Thread Michael Paquier
On Tue, Feb 26, 2019 at 10:38:45PM -0800, Andres Freund wrote:
> Im not sure I understand. How can adding a memory context + reset to
> ctas and matview receivers negatively impact other dest receivers?

I don't think you got my point here: imagine that a plugin use the
current receiveSlot logic from createas.c or matview.c, and forgets to
free the tuple copied.  On v11, that works fine.  On current HEAD,
they win silently a new leak.
--
Michael


signature.asc
Description: PGP signature


Re: TupleTableSlot abstraction

2019-02-26 Thread Andres Freund
Hi,

On 2019-02-27 15:34:07 +0900, Michael Paquier wrote:
> On Tue, Feb 26, 2019 at 09:42:38PM -0800, Andres Freund wrote:
> > I'm not so sure that's the architecturally correct fix however. Is it
> > actually guaranteed, given expanded tuples, toasting, etc, that there's
> > no other memory leak here? I wonder if we shouldn't work twoards using a
> > short lived memory context here. Note how e.g. printtup() uses a short
> > lived context for its work.
> 
> Perhaps.  I got to wonder if this change would not impact code using
> their own DestReceiver, resulting in similar leaks when they insert
> tuples on-the-fly.

Im not sure I understand. How can adding a memory context + reset to
ctas and matview receivers negatively impact other dest receivers?


> I was playing a bit with some refactoring of relation creation for
> CTAS in the scope of temporary matviews, and noticed this issue on the
> CF list, so that was a bit annoying, and issues like that tend to be
> easily forgotten..

It's been 10 days since the report, nobody pinged, and obviously I'm
working on pluggable storage, so ...

Greetings,

Andres Freund



Re: Autovaccuum vs temp tables crash

2019-02-26 Thread Michael Paquier
On Tue, Feb 26, 2019 at 07:21:40PM -0500, Tom Lane wrote:
> The existing state of affairs is that a superuser who really needs to drop
> a temp schema can do so, if she's careful that it's not active.  Pinning
> things would break that, or at least add an additional roadblock.  If it's
> some sort of virtual pin rather than a regular pg_depend entry, then it
> *would* be impossible to get around (mumble ... DELETE FROM pg_namespace
> ... mumble).  As against that, what problem are we fixing by preventing
> superusers from doing that?  A careless superuser can screw things up
> arbitrarily badly in any case, so I'm not that fussed about the hazard
> that the namespace isn't idle.

And when you try to do chirugy on a corrupted cluster, it can be on
the contrary very useful to be able to work with objects and
manipulate them more freely as a superuser.
--
Michael


signature.asc
Description: PGP signature


Re: TupleTableSlot abstraction

2019-02-26 Thread Michael Paquier
On Tue, Feb 26, 2019 at 09:42:38PM -0800, Andres Freund wrote:
> I'm not so sure that's the architecturally correct fix however. Is it
> actually guaranteed, given expanded tuples, toasting, etc, that there's
> no other memory leak here? I wonder if we shouldn't work twoards using a
> short lived memory context here. Note how e.g. printtup() uses a short
> lived context for its work.

Perhaps.  I got to wonder if this change would not impact code using
their own DestReceiver, resulting in similar leaks when they insert
tuples on-the-fly.  Such issues can be surprising for fork an plugin
developers.  I was playing a bit with some refactoring of relation
creation for CTAS in the scope of temporary matviews, and noticed this
issue on the CF list, so that was a bit annoying, and issues like that
tend to be easily forgotten..
--
Michael


signature.asc
Description: PGP signature


Re: psql display of foreign keys

2019-02-26 Thread Michael Paquier
On Tue, Feb 26, 2019 at 07:27:57PM -0300, Alvaro Herrera wrote:
> Thanks for committing pg_partition_root ... but it turns out to be
> useless for this purpose.

Well, what's done is done.  The thing is useful by itself in my
opinion.

> It turns out that we need to obtain the list
> of *ancestors* of the table being displayed, which pg_partition_tree
> does not easily give you.  So I ended up adding yet another auxiliary
> function, pg_partition_ancestors, which in its current formulation
> returns just a lits of OIDs of ancestor tables.  This seems generally
> useful, and can be used in conjunction with pg_partition_tree():
> 
> alvherre=# select t.* from pg_partition_tree(pg_partition_root('pk11')) t
>join pg_partition_ancestors('pk11', true) a on (t.relid = a.relid);
>  relid | parentrelid | isleaf | level 
> ---+-++---
>  pk| | f  | 0
>  pk1   | pk  | f  | 1
>  pk11  | pk1 | t  | 2
> (3 filas)
> 
> (A small tweak is to change the return type from OID to regclass.
> Docbook additions missing also.)

In the second patch, pg_partition_ancestors always sets include_self
to true.  What's the use case you have in mind to set it to false?  In
the other existing functions we always include the argument itself, so
we may want to keep things consistent.

I think that you should make the function return a record of regclass
elements instead of OIDs to be consistent.  This could save casts for
the callers.

Adding the self-member at the beginning of the record set is more
consistent with the order of the results returned by
get_partition_ancestors().

It would be nice to add some tests in partition_info.sql for tables
and indexes (both work).

> Anyway, given this function, it's possible to fix the psql display to be
> as I showed previously.  Patches attached.

+"  FROM pg_constraint, pg_partition_ancestors('%s', 't')\n"
+" WHERE confrelid = relid AND contype = 'f' AND conparentid = 0\n"
A JOIN would have been cleaner in my opinion, but feel free with the
style you think is more adapted.

For the meaning of using pg_partition_ancestors, I see...  Not only do
you want to show the foreign keys defined in the top-most parent, but
also these defined in intermediate layers.  That makes sense.  Using
only pg_partition_root would have been enough to show FKs in the
top-most parent, but the intermediate ones would be missed (using only
pg_partition_root() would miss the FKs fk_partitioned_fk_5_a_fkey1 and
fk_partitioned_fk_5_a_fkey when doing "\d fk_partitioned_fk_5_1" based
on the test set).
--
Michael


signature.asc
Description: PGP signature


Re: When is the MessageContext released?

2019-02-26 Thread Andres Freund
On 2019-02-27 14:08:47 +0800, Andy Fan wrote:
> Hi :
>   I run a query like "select * from t" and set the break like this:
> 
> break exec_simple_query
> break MemoryContextDelete
>   commands
>p context->name
>c
>   end
> 
> I can see most of the MemoryContext is relased, but never MessageContext,
> when will it be released?

It's released above exec_simple_query, as it actually contains the data
that lead us to call exec_simple_query().  See the main for loop in
PostgresMain():

/*
 * Release storage left over from prior query cycle, and create 
a new
 * query input buffer in the cleared MessageContext.
 */
MemoryContextSwitchTo(MessageContext);
MemoryContextResetAndDeleteChildren(MessageContext);

Greetings,

Andres Freund



Re: NOT IN subquery optimization

2019-02-26 Thread Richard Guo
On Wed, Feb 27, 2019 at 4:52 AM David Rowley 
wrote:

> On Wed, 27 Feb 2019 at 03:07, Jim Finnerty  wrote:
>
> If you're proposing to do that for this thread then I can take my
> planner only patch somewhere else.  I only posted my patch as I pretty
> much already had what I thought you were originally talking about.
> However, please be aware there are current patents around adding
> execution time smarts in this area, so it's probably unlikely you'll
> find a way to do this in the executor that does not infringe on those.
> Probably no committer would want to touch it.  I think my patch covers
> a good number of use cases and as far as I understand, does not go
> near any current patents.
>
> Thanks for pointing out the patent concerns. I was not aware of that
before.
Could you please provide some clue where I can find more info about the
patents?

Thanks
Richard


When is the MessageContext released?

2019-02-26 Thread Andy Fan
Hi :
  I run a query like "select * from t" and set the break like this:

break exec_simple_query
break MemoryContextDelete
  commands
   p context->name
   c
  end

I can see most of the MemoryContext is relased, but never MessageContext,
when will it be released?

/*
* Create the memory context we will use in the main loop.
*
* MessageContext is reset once per iteration of the main loop, ie, upon
* completion of processing of each command message from the client.
*/

I'm hacking the code with the latest commit.
https://github.com/postgres/postgres/commit/414a9d3cf34c7aff1c63533df4c40ebb63bd0840


Thanks!


Re: Index INCLUDE vs. Bitmap Index Scan

2019-02-26 Thread Markus Winand



> On 27.02.2019, at 02:00, Justin Pryzby  wrote:
> 
> On Tue, Feb 26, 2019 at 09:07:01PM +0100, Markus Winand wrote:
>> CREATE INDEX idx ON tbl (a, b, c);
>> Bitmap Heap Scan on tbl  (cost=4.14..8.16 rows=1 width=7616) (actual 
>> time=0.021..0.021 rows=1 loops=1)
>>   Recheck Cond: ((a = 1) AND (c = 1))
>>   ->  Bitmap Index Scan on idx  (cost=0.00..4.14 rows=1 width=0) (actual 
>> time=0.018..0.018 rows=1 loops=1)
>> Index Cond: ((a = 1) AND (c = 1))
>> 
>> (As a side node: I also dislike it how Bitmap Index Scan mixes search 
>> conditions and filters in “Index Cond”)
> 
> I don't think it's mixing them;  it's using index scan on leading *and*
> nonleading column.  That's possible even if frequently not efficient.

The distinction leading / non-leading is very important for performance. Other 
database products use different names in the execution plan so that it is 
immediately visible (without knowing the index definition).

- Oracle: access vs. filter predicates
- SQL Server: “seek predicates” vs. “predicates”
- Db2: START/STOP vs. SARG
- MySQL/MariaDB show how many leading columns of the index are used — the rest 
is just “filtering"

PostgreSQL: no difference visible in the execution plan.

CREATE INDEX idx ON tbl (a,b,c);

EXPLAIN (analyze, buffers)
SELECT *
 FROM tbl
WHERE a = 1
  AND c = 1;

 QUERY PLAN

 Bitmap Heap Scan on tbl  (cost=4.14..8.16 rows=1 width=7616) (actual 
time=0.017..0.018 rows=1 loops=1)
   Recheck Cond: ((a = 1) AND (c = 1))
   Heap Blocks: exact=1
   Buffers: shared hit=1 read=1
   ->  Bitmap Index Scan on idx  (cost=0.00..4.14 rows=1 width=0) (actual 
time=0.014..0.014 rows=1 loops=1)
 Index Cond: ((a = 1) AND (c = 1))
 Buffers: shared read=1
 Planning Time: 0.149 ms
 Execution Time: 0.035 ms


DROP INDEX idx;
CREATE INDEX idx ON tbl (a, c, b); -- NOTE: column “c” is second


 QUERY PLAN

 Bitmap Heap Scan on tbl  (cost=4.14..8.16 rows=1 width=7616) (actual 
time=0.013..0.013 rows=1 loops=1)
   Recheck Cond: ((a = 1) AND (c = 1))
   Heap Blocks: exact=1
   Buffers: shared hit=1 read=1
   ->  Bitmap Index Scan on idx  (cost=0.00..4.14 rows=1 width=0) (actual 
time=0.009..0.009 rows=1 loops=1)
 Index Cond: ((a = 1) AND (c = 1))
 Buffers: shared read=1
 Planning Time: 0.262 ms
 Execution Time: 0.036 ms
(9 rows)

-markus




Re: TupleTableSlot abstraction

2019-02-26 Thread Andres Freund
Hi,

On 2019-02-27 14:21:52 +0900, Michael Paquier wrote:
> On Sat, Feb 16, 2019 at 05:07:44PM -0500, Jeff Janes wrote:
> > By blind analogy to the changes made to matview.c, I think that createas.c
> > is missing a heap_freetuple, as attached.

First, sorry to have been slow answering. I was whacking around code
further modifying this, and thought I'd just solve the immediate issue
here by committing the followup work that removes getting the tuple out
of the slot entirely. That took longer than planned, so it makes sense
to commit an interim fix.


> I think that you are right.  CTAS and materialized views share a lot
> when it comes to relation creation and initial table loading.  I have
> reproduced the leak and could notice that your fix is correct.  So
> committed.

I'm not so sure that's the architecturally correct fix however. Is it
actually guaranteed, given expanded tuples, toasting, etc, that there's
no other memory leak here? I wonder if we shouldn't work twoards using a
short lived memory context here. Note how e.g. printtup() uses a short
lived context for its work.

Greetings,

Andres Freund



Re: pgsql: Avoid creation of the free space map for small heap relations, t

2019-02-26 Thread John Naylor
On Wed, Feb 27, 2019 at 5:06 AM Amit Kapila  wrote:
> I have tried this test many times (more than 1000 times) by varying
> thread count, but couldn't reproduce it.  My colleague, Kuntal has
> tried a similar test overnight, but the issue didn't reproduce which
> is not surprising to me seeing the nature of the problem.  As I could
> reproduce it via the debugger, I think we can go ahead with the fix.
> I have improved the comments in the attached patch and I have also
> tried to address Tom's concern w.r.t comments by adding additional
> comments atop of data-structure used to maintain the local map.

The flaw in my thinking was treating extension too similarly too
finding an existing block. With your patch clearing the local map in
the correct place, it seems the call at hio.c:682 is now superfluous?
It wasn't sufficient before, so should this call be removed so as not
to mislead? Or maybe keep it to be safe, with a comment like "This
should already be cleared by now, but make sure it is."

+ * To maintain good performance, we only mark every other page as available
+ * in this map.

I think this implementation detail is not needed to comment on the
struct. I think the pointer to the README is sufficient.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Index INCLUDE vs. Bitmap Index Scan

2019-02-26 Thread Markus Winand



> On 27.02.2019, at 00:22, Tom Lane  wrote:
> 
> Markus Winand  writes:
>> I think Bitmap Index Scan should take advantage of B-tree INCLUDE columns, 
>> which it doesn’t at the moment (tested on master as of yesterday).
> 
> Regular index scans don't do what you're imagining either (i.e., check
> filter conditions in advance of visiting the heap).  There's a roadblock
> to implementing such behavior, which is that we might end up applying
> filter expressions to dead rows.  That could make users unhappy.
> For example, given a filter condition like "1.0/c > 0.1", people
> would complain if it still got zero-divide failures even after they'd
> deleted all rows with c=0 from their table.

Ok, but I don’t see how this case different for key columns vs. INCLUDE columns.

When I test this with the (a, b, c) index (no INCLUDE), different plans are 
produced for "c=1" (my original example) vs. "1.0/c > 0.1”.

The second one postpones this condition to the Bitmap Heap Scan.

 QUERY PLAN

 Bitmap Heap Scan on tbl  (cost=4.14..8.16 rows=1 width=4) (actual 
time=0.023..0.028 rows=8 loops=1)
   Recheck Cond: (a = 1)
   Filter: ((1.0 / (c)::numeric) > 0.1)
   Heap Blocks: exact=2
   Buffers: shared hit=3
   ->  Bitmap Index Scan on idx  (cost=0.00..4.14 rows=1 width=0) (actual 
time=0.007..0.007 rows=8 loops=1)
 Index Cond: (a = 1)
 Buffers: shared hit=1
 Planning Time: 0.053 ms
 Execution Time: 0.044 ms

I’ve never noticed that behaviour before, but if it is there to prevent the 
exception-on-dead-tuple problem, the same could be applied to INCLUDE columns?

I realise that this will not cover all use cases I can imagine but it would be 
consistent for key and non-key columns. 

-markus




RE: Protect syscache from bloating with negative cache entries

2019-02-26 Thread Ideriha, Takeshi
>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>>>* 0001: add dlist_push_tail() ... as is
>>>* 0002: memory accounting, with correction based on feedback
>>>* 0003: merge the original 0003 and 0005, with correction based on
>>>feedback
>>
>>Attached are simpler version based on Horiguchi san's ver15 patch,
>>which means cache is pruned by both time and size.
>>(Still cleanup function is complex but it gets much simpler.)
>
>I don't mean to disregard what Horiguchi san and others have developed and
>discussed.
>But I refactored again the v15 patch to reduce complexity of v15 patch because 
>it
>seems to me one of the reason for dropping feature for pruning by size stems 
>from
>code complexity.
>
>Another thing is there's been discussed about over memory accounting overhead 
>but
>the overhead effect hasn't been measured in this thread. So I'd like to 
>measure it.

I measured the memory context accounting overhead using Tomas's tool 
palloc_bench, 
which he made it a while ago in the similar discussion.
https://www.postgresql.org/message-id/53f7e83c.3020...@fuzzy.cz 

This tool is a little bit outdated so I fixed it but basically I followed him.
Things I did:
- make one MemoryContext
- run both palloc() and pfree() for 32kB area 1,000,000 times. 
- And measure this time 

The result shows that master is 30 times faster than patched one.
So as Andres mentioned in upper thread it seems it has overhead.

[master (without v15 patch)]
61.52 ms
60.96 ms
61.40 ms
61.42 ms
61.14 ms

[with v15 patch]
1838.02 ms
1754.84 ms
1755.83 ms
1789.69 ms
1789.44 ms

Regards,
Takeshi Ideriha


Re: TupleTableSlot abstraction

2019-02-26 Thread Michael Paquier
On Sat, Feb 16, 2019 at 05:07:44PM -0500, Jeff Janes wrote:
> By blind analogy to the changes made to matview.c, I think that createas.c
> is missing a heap_freetuple, as attached.

I think that you are right.  CTAS and materialized views share a lot
when it comes to relation creation and initial table loading.  I have
reproduced the leak and could notice that your fix is correct.  So
committed.
--
Michael


signature.asc
Description: PGP signature


Re: pgsql: Avoid creation of the free space map for small heap relations, t

2019-02-26 Thread John Naylor
On Tue, Feb 26, 2019 at 10:28 AM Amit Kapila  wrote:
> John, others, can you review my findings and patch?

I can confirm your findings with your debugging steps. Nice work!

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Early WIP/PoC for inlining CTEs

2019-02-26 Thread Andrew Gierth
> "Tom" == Tom Lane  writes:

 >> I also thought about that. But what I thought about it on reflection
 >> was: if the user explicitly wrote NOT MATERIALIZED, then we should
 >> assume they mean it.

 Tom> Ah, but the example I gave also had MATERIALIZED on the inner WITH.
 Tom> Why should the user not also mean that?

The inner WITH does get materialized, it just gets materialized twice.
If the user doesn't want that, then they can avoid using NOT MATERIALIZED
on the outer CTE; but if we force it to materialize the outer query,
then that leaves the user without recourse.

Consider a case like:

create view foo as
  with s as materialized (select something)
  select * from large l
  where l.foo in (select * from s) or l.bar in (select * from s);

with
  bar as not materialized (select * from foo)
select * from bar b1, bar b2 where b1.col='x' and b2.col='y';

In a case like this, materializing "s" twice may be far less expensive
than materializing the result of "select * from large..." without
benefit of pushed-down quals.

-- 
Andrew (irc:RhodiumToad)



Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2019-02-26 Thread Pavan Deolasee
On Wed, Feb 27, 2019 at 7:05 AM Jeff Janes  wrote:

>
>
> After doing a truncation and '\copy ... with (freeze)' of a table with
> long data, I find that the associated toast table has a handful of unfrozen
> blocks.  I don't know if that is an actual problem, but it does seem a bit
> odd, and thus suspicious.
>
>
Hi Jeff, thanks for looking at it and the test. I can reproduce the problem
and quite curiously block number 1 and then every 32672th block is getting
skipped.

postgres=# select * from pg_visibility('pg_toast.pg_toast_16384') where
all_visible = 'f';
 blkno  | all_visible | all_frozen | pd_all_visible
+-++
  1 | f   | f  | f
  32673 | f   | f  | f
  65345 | f   | f  | f
  98017 | f   | f  | f
 130689 | f   | f  | f
 163361 | f   | f  | f
 

Having investigated this a bit, I see that a relcache invalidation arrives
after 1st and then after every 32672th block is filled. That clears the
rel->rd_smgr field and we lose the information about the saved target
block. The code then moves to extend the relation again and thus skips the
previously less-than-half filled block, losing the free space in that block.

postgres=# SELECT * FROM
page_header(get_raw_page('pg_toast.pg_toast_16384', 0));
lsn | checksum | flags | lower | upper | special | pagesize |
version | prune_xid
+--+---+---+---+-+--+-+---
 1/15B37748 |0 | 4 |40 |64 |8192 | 8192 |
 4 | 0
(1 row)

postgres=# SELECT * FROM
page_header(get_raw_page('pg_toast.pg_toast_16384', 1));
lsn | checksum | flags | lower | upper | special | pagesize |
version | prune_xid
+--+---+---+---+-+--+-+---
 1/15B39A28 |0 | 4 |28 |  7640 |8192 | 8192 |
 4 | 0
(1 row)

postgres=# SELECT * FROM
page_header(get_raw_page('pg_toast.pg_toast_16384', 2));
lsn | checksum | flags | lower | upper | special | pagesize |
version | prune_xid
+--+---+---+---+-+--+-+---
 1/15B3BE08 |0 | 4 |40 |64 |8192 | 8192 |
 4 | 0
(1 row)

So the block 1 has a large amount of free space (upper - lower), which
never gets filled.

I am not yet sure what causes the relcache invalidation at regular
intervals. But if I have to guess, it could be because of a new VM (or
FSM?) page getting allocated. I am bit puzzled because this issue seems to
only occur with toast tables since I tested the patch while writing it on a
regular table and did not see any block remaining unfrozen. I tested only
upto 450 blocks, but that shouldn't matter because with your test, we see
the problem with block 1 as well. So something to look into in more detail.

While we could potentially fix this by what you'd done in the original
patch and what Kuntal also suggested, i.e. by setting the PD_ALL_VISIBLE
bit during page initialisation itself, I am a bit circumspect about that
approach for two reasons:

1. It requires us to then add extra logic to avoid clearing the bit during
insertions
2. It requires us to also update the VM bit during page init or risk having
divergent views on the page-level bit and the VM bit.

And even if we do that, this newly discovered problem of less-than-half
filled intermediate blocks remain. I wonder if we should instead track the
last used block in BulkInsertState and if the relcache invalidation flushes
smgr, start inserting again from the last saved block. In fact, we already
track the last used buffer in BulkInsertState and that's enough to know the
last used block.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Early WIP/PoC for inlining CTEs

2019-02-26 Thread Tom Lane
Andrew Gierth  writes:
> "Tom" == Tom Lane  writes:
>  Tom> Also, I thought of a somewhat-related scenario that the code isn't
>  Tom> accounting for: you can break the restrictions about single
>  Tom> evaluation with nested WITHs, like

> I also thought about that. But what I thought about it on reflection
> was: if the user explicitly wrote NOT MATERIALIZED, then we should
> assume they mean it.

Ah, but the example I gave also had MATERIALIZED on the inner WITH.
Why should the user not also mean that?

regards, tom lane



RE: Timeout parameters

2019-02-26 Thread Nagaura, Ryohei
Hi, kirk-san.

Thank you for review.

> From: Jamison, Kirk 
> Your socket_timeout patch still does not apply either with git or patch 
> command. It
> says it's still corrupted.
> I'm not sure about the workaround, because the --ignore-space-change and
> --ignore-whitespace did not work for me.
> Maybe it might have something to do with your editor when creating the patch.
> Could you confirm?
> The CFbot is also another way to check if your latest patches work.
> http://commitfest.cputube.org/
It may be caused by git diff command with no option.
The new patches are created by git diff -w.

About socket_timeout:
Your feedback looks good so that I adopted it. Please confirm it.

About TCP_USER_TIMEOUT:
There is no change in sources.
Just change the command option "git diff" -> "git diff -w".

Best regards,
-
Ryohei Nagaura



TCP_backend_v7.patch
Description: TCP_backend_v7.patch


TCP_interface_v7.patch
Description: TCP_interface_v7.patch


socket_timeout_v7.patch
Description: socket_timeout_v7.patch


Re: pgsql: Avoid creation of the free space map for small heap relations, t

2019-02-26 Thread Amit Kapila
On Tue, Feb 26, 2019 at 2:58 PM Amit Kapila  wrote:
>
> On Mon, Feb 25, 2019 at 10:32 PM Tom Lane  wrote:
> >
>
> To fix this symptom, we can ensure that once we didn't get any block
> from local map, we must clear it.  See the attached patch.  I will try
> to evaluate this code path to see if there is any similar race
> condition and will also try to generate a reproducer.  I don't have
> any great idea to write a reproducer for this issue apart from trying
> some thing similar to ecpg/test/thread/prep.pgc, if you can think of
> any, I am all ears.
>

I have tried this test many times (more than 1000 times) by varying
thread count, but couldn't reproduce it.  My colleague, Kuntal has
tried a similar test overnight, but the issue didn't reproduce which
is not surprising to me seeing the nature of the problem.  As I could
reproduce it via the debugger, I think we can go ahead with the fix.
I have improved the comments in the attached patch and I have also
tried to address Tom's concern w.r.t comments by adding additional
comments atop of data-structure used to maintain the local map.

Let me know what you think?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


fix_loc_map_clear_2.patch
Description: Binary data


Re: Problem with default partition pruning

2019-02-26 Thread Amit Langote
Hosoya-san,

On 2019/02/22 17:14, Yuzuko Hosoya wrote:
> Hi,
> 
> I found the bug of default partition pruning when executing a range query.
> 
> -
> postgres=# create table test1(id int, val text) partition by range (id); 
> postgres=# create table test1_1 partition of test1 for values from (0) to 
> (100); 
> postgres=# create table test1_2 partition of test1 for values from (150) to 
> (200);
> postgres=# create table test1_def partition of test1 default; 
> 
> postgres=# explain select * from test1 where id > 0 and id < 30;
>QUERY PLAN   
> 
>  Append  (cost=0.00..11.83 rows=59 width=11)
>->  Seq Scan on test1_1  (cost=0.00..5.00 rows=58 width=11)
>  Filter: ((id > 0) AND (id < 30))
>->  Seq Scan on test1_def  (cost=0.00..6.53 rows=1 width=12)
>  Filter: ((id > 0) AND (id < 30))
> (5 rows)
> 
> There is no need to scan the default partition, but it's scanned.
> -
> 
> In the current implement, whether the default partition is scanned
> or not is determined according to each condition of given WHERE
> clause at get_matching_range_bounds().  In this example, scan_default
> is set true according to id > 0 because id >= 200 matches the default
> partition.  Similarly, according to id < 30, scan_default is set true.
> Then, these results are combined according to AND/OR at 
> perform_pruning_combine_step().
> In this case, final result's scan_default is set true.
> 
> The modifications I made are as follows:
> - get_matching_range_bounds() determines only offsets of range bounds
>   according to each condition 
> - These results are combined at perform_pruning_combine_step()
> - Whether the default partition is scanned or not is determined at 
>   get_matching_partitions()
> 
> Attached the patch.  Any feedback is greatly appreciated.

Thank you for reporting.  Can you please add this to March CF in Bugs
category so as not to lose track of this?

I will try to send review comments soon.

Regards,
Amit




Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's

2019-02-26 Thread James Coleman
On Mon, Feb 18, 2019 at 4:53 PM Tom Lane  wrote:
> So ... this is actively wrong.
>
> This case shows that you can't ignore the empty-array possibility
> for a ScalarArrayOpExpr with useOr = false, because
> "SELECT null::int = all(array[]::int[])" yields TRUE:
>
> contrib_regression=# select * from test_predtest($$
> select x is not null, x = all(array(select i from generate_series(1,0) i))
> from integers
> $$);
> WARNING:  strong_implied_by result is incorrect
> -[ RECORD 1 ]-+--
> strong_implied_by | t
> weak_implied_by   | f
> strong_refuted_by | f
> weak_refuted_by   | f
> s_i_holds | f
> w_i_holds | f
> s_r_holds | f
> w_r_holds | f
>
> This case shows that you can't ignore the distinction between null
> and false results once you've descended through strict function(s):
>
> contrib_regression=# select * from test_predtest($$
> select x is not null, strictf(true, x = any(array(select i from 
> generate_series(1,0) i)))
> from integers
> $$);
> WARNING:  strong_implied_by result is incorrect
> -[ RECORD 1 ]-+--
> strong_implied_by | t
> weak_implied_by   | f
> strong_refuted_by | f
> weak_refuted_by   | f
> s_i_holds | f
> w_i_holds | f
> s_r_holds | f
> w_r_holds | f
>
> (In this usage, the strictf() function from the test_predtest.sql
> will simply return the NOT of its second input; so it gives different
> answers depending on whether the SAOP returned false or null.)
>
> I think you could probably repair the second problem by adding
> an additional argument to clause_is_strict_for() indicating whether
> it has to prove the clause to be NULL or merely non-TRUE; when
> recursing you'd always have to ask for the former.

I've implemented this fix as suggested. The added argument I've
initially called "allow_false"; I don't love the name, but it is
directly what it does. I'd appreciate other suggestions (if you prefer
it change).

> The first problem could be resolved by insisting on being able to
> prove the array non-empty when !useOr, but I'm not sure if it's
> really worth the trouble, as opposed to just not trying to prove
> anything for !useOr cases.  I'm not sure that "x op ALL(ARRAY)"
> occurs often enough in the wild to justify expending code on.
>
> The other case where being able to prove the array nonempty might
> be worth something is when you've recursed and hence need to be
> able to prove the SAOP's result to be NULL not just not-TRUE.
> Again though, I don't know how often that occurs in practice,
> so even the combination of those two motivations might not be
> worth having code to check the array argument.

I've implemented this also; the thing that puts it over the edge for
me on the question of "is it worth it in the wild" is cases like "x
!op ALL(ARRAY)", since it seems plausible to me that that's (unlike
the non-negated case) a reasonably common operation. At the very least
it seems likely to be a more logically interesting operation, which
I'm taking as a proxy for common, I suppose.

> It'd also be worth spending some brain cells on what happens
> when the SAOP's array argument is null overall, which causes
> its result to be null (not false).  Maybe the logic goes
> through without needing any explicit test for that case
> (and indeed I couldn't break it in testing that), but it'd
> likely be worth a comment.

Since all SAOPs (both ALL and ANY) return NULL when the array is NULL
regardless of the LHS argument, then this case qualifies as strict.
I've included this in the above fix since both NULL constant arrays
and non-empty arrays can both be proven strict with strict operators.

Ideally I think this case would be handled elsewhere in the optimizer
by directly replacing the saop and a constant NULL array with NULL,
but this isn't the patch to do that, and at any rate I'd have to do a
bit more investigation to understand how to do such (if it's easily
possible).

> I don't especially care for the proposed function name
> "clause_proves_for_null_test".  The only thing that brings to my
> mind is the question "proves WHAT, exactly?" --- and as these
> examples show, being crystal clear on what it proves is essential.

Given the new argument, I've reverted the name change.

I also updated the tests with a new helper function "opaque_array" to
make it easy to test cases where the planner can't inspect the array.
That way we don't need to rely on CTEs as an optimization fence --
thus improving both maintainability and readability. I also noticed
there were also some tests where the array was already opaque yet I
was still using CTEs as an optimization fence; I've cleaned those up.

I've updated the comments significantly to reflect the above changes.

I've attached an updated (and rebased) patch.

James Coleman
commit 2cf8f951ca6524d4eb4c927d9903736a0751c89c
Author: jcoleman 
Date:   Wed Nov 14 16:49:52 2018 +

Extend IS [NOT] NULL predicate proofs for arrays

For the purposes of predicate 

Re: Pluggable Storage - Andres's take

2019-02-26 Thread Haribabu Kommi
On Wed, Feb 27, 2019 at 11:10 AM Andres Freund  wrote:

> Hi,
>
> On 2019-01-21 10:32:37 +1100, Haribabu Kommi wrote:
> > I am not able to remove the complete t_tableOid from HeapTuple,
> > because of its use in triggers, as the slot is not available in triggers
> > and I need to store the tableOid also as part of the tuple.
>
> What precisely do you man by "use in triggers"? You mean that a trigger
> might access a HeapTuple's t_tableOid directly, even though all of the
> information is available in the trigger context?
>

I forgot the exact scenario, but during the trigger function execution, the
pl/pgsql function execution access the TableOidAttributeNumber from the
stored
tuple using the heap_get* function. Because of lack of slot support in the
triggers,
we still need to maintain the t_tableOid with proper OID. The heaptuple
t_tableOid
member data is updated whenever the heaptuple is generated from slot.

Regards,
Haribabu Kommi
Fujitsu Australia


RE: Timeout parameters

2019-02-26 Thread Jamison, Kirk
Hi Nagaura-san,

Your socket_timeout patch still does not apply either with
git or patch command. It says it's still corrupted.
I'm not sure about the workaround, because the --ignore-space-change
and --ignore-whitespace did not work for me.
Maybe it might have something to do with your editor when creating
the patch. Could you confirm?
The CFbot is also another way to check if your latest patches work.
http://commitfest.cputube.org/

On Tuesday, February 26, 2019 5:16PM (GMT+9), Nagaura, Ryohei wrote:
> "communication inactivity" seems to be a little extreme.
> If the communication layer is truly dead you will use keepalive.
> This use case is when socket option is not available for some reason.
> So it would be better "terminating the connection" in my thought.

About the doc. Alright. 
- There's a typo in the doc: connectoin --> connection
- In the code, there's a part where between 0-2 seconds
is interpreted as 2 because it's the minimum time. I think
the comment should be improved, and use insist (force) instead
of "need". This should also be added in the documentation.

From:
/*
 * Rounding could cause communication to fail; need at least 2 secs
 */

To:
/*
 * Rounding could cause communication to fail;
 * insist on at least two seconds.
 */

Docs
- added how it should be written as decimal integer, similar
to connect_timeout
- used closing instead of terminating, to be consistent with
other params
- added the minimum allowed timeout

socket_timeout

Controls the number of second (write as a decimal integer, e.g. 10) of
client's waiting time for individual socket read/write operations
before closing the connection. This can be used both as a force global
query timeout and network problems detector. A value of zero (the
default) turns this off, which means wait indefinitely. The minimum
allowed timeout is 2 seconds, so a value of 1 is interpreted as 2.

> About TCP_USER_TIMEOUT patches, there are only miscellaneous changes:
> removing trailing spaces and making comments of parameters lower case
> as you pointed out.
Thanks for the update.
Confirmed the change.


Regards,
Kirk Jamison



Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2019-02-26 Thread Jeff Janes
On Thu, Feb 21, 2019 at 1:05 AM Pavan Deolasee 
wrote:

> Hi,
>
> Jeff Janes raised an issue [1] about PD_ALL_VISIBLE not being set
> correctly while loading data via COPY FREEZE and had also posted a draft
> patch.
>
> I now have what I think is a more complete patch. I took a slightly
> different approach and instead of setting PD_ALL_VISIBLE bit initially and
> then not clearing it during insertion, we now recheck the page for
> all-frozen, all-visible tuples just before switching to a new page. This
> allows us to then also mark set the visibility map bit, like we do in
> vacuumlazy.c
>
> Some special treatment is required to handle the last page before bulk
> insert it shutdown. We could have chosen not to do anything special for the
> last page and let it remain unfrozen, but I thought it makes sense to take
> that extra effort so that we can completely freeze the table and set all VM
> bits at the end of COPY FREEZE.
>
> Let me know what you think.
>

Hi Pavan, thanks for picking this up.

After doing a truncation and '\copy ... with (freeze)' of a table with long
data, I find that the associated toast table has a handful of unfrozen
blocks.  I don't know if that is an actual problem, but it does seem a bit
odd, and thus suspicious.

perl -le 'print join "", map rand(), 1..500 foreach 1..100' > foo

create table foobar1 (x text);
begin;
truncate foobar1;
\copy foobar1 from foo with (freeze)
commit;
select all_visible,all_frozen,pd_all_visible, count(*) from
pg_visibility('pg_toast.pg_toast_25085') group by 1,2,3;
 all_visible | all_frozen | pd_all_visible |  count
-+++-
 f   | f  | f  |  18
 t   | t  | t  | 530,361
(2 rows)

Cheers,

Jeff

>


Re: Early WIP/PoC for inlining CTEs

2019-02-26 Thread Andrew Gierth
> "Tom" == Tom Lane  writes:

 Tom> Also, I thought of a somewhat-related scenario that the code isn't
 Tom> accounting for: you can break the restrictions about single
 Tom> evaluation with nested WITHs, like

I also thought about that. But what I thought about it on reflection
was: if the user explicitly wrote NOT MATERIALIZED, then we should
assume they mean it.

-- 
Andrew (irc:RhodiumToad)



Re: New vacuum option to do only freezing

2019-02-26 Thread Bossart, Nathan
Sorry for the delay.  I finally got a chance to look through the
latest patches.

On 2/3/19, 1:48 PM, "Masahiko Sawada"  wrote:
> On Fri, Feb 1, 2019 at 11:43 PM Bossart, Nathan  wrote:
>>
>> +   if (skip_index_vacuum)
>> +   ereport(elevel,
>> +   (errmsg("\"%s\": marked %.0f row 
>> versions as dead in %u pages",
>> +   
>> RelationGetRelationName(onerel),
>> +   tups_vacuumed, 
>> vacuumed_pages)));
>>
>> IIUC tups_vacuumed will include tuples removed during HOT pruning, so
>> it could be inaccurate to say all of these tuples have only been
>> marked "dead."  Perhaps we could keep a separate count of tuples
>> removed via HOT pruning in case we are using DISABLE_INDEX_CLEANUP.
>> There might be similar problems with the values stored in vacrelstats
>> that are used at the end of heap_vacuum_rel() (e.g. tuples_deleted).
>
> Yeah, tups_vacuumed include such tuples so the message is inaccurate.
> So I think that we should not change the message but we can report the
> dead item pointers we left in errdetail. That's more accurate and
> would help user.
>
> postgres(1:1130)=# vacuum (verbose, disable_index_cleanup) test;
> INFO:  vacuuming "public.test"
> INFO:  "test": removed 49 row versions in 1 pages
> INFO:  "test": found 49 removable, 51 nonremovable row versions in 1
> out of 1 pages
> DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 509
> There were 0 unused item pointers.
> Skipped 0 pages due to buffer pins, 0 frozen pages.
> 0 pages are entirely empty.
> 49 tuples are left as dead.
> CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
> VACUUM

This seems reasonable to me.

The current version of the patches builds cleanly, passes 'make
check-world', and seems to work well in my own manual tests.  I have a
number of small suggestions, but I think this will be ready-for-
committer soon.

+   Assert(!skip_index_vacuum);

There are two places in lazy_scan_heap() that I see this without any
comment.  Can we add a short comment explaining why this should be
true at these points?

+   if (skip_index_vacuum)
+   appendStringInfo(, ngettext("%.0f tuple is left as dead.\n",
+   
"%.0f tuples are left as dead.\n",
+   
nleft),
+nleft);

I think we could emit this metric for all cases, not only when
DISABLE_INDEX_CLEANUP is used.

+   /*
+* Remove tuples from heap if the table has no index.  
If the table
+* has index but index vacuum is disabled, we don't 
vacuum and forget
+* them. The vacrelstats->dead_tuples could have tuples 
which became
+* dead after checked at HOT-pruning time which are 
handled by
+* lazy_vacuum_page() but we don't worry about handling 
those because
+* it's a very rare condition and these would not be a 
large number.
+*/

Based on this, it sounds like nleft could be inaccurate.  Do you think
it is worth adjusting the log message to reflect that, or is this
discrepancy something we should just live with?  I think adding
something like "at least N tuples left marked dead" is arguably a bit
misleading, since the only time the number is actually higher is when
this "very rare condition" occurs.

+   /*
+* Skip index vacuum if it's requested for table with indexes. In this
+* case we use the one-pass strategy and don't remove tuple storage.
+*/
+   skip_index_vacuum =
+   (options & VACOPT_DISABLE_INDEX_CLEANUP) != 0 && 
vacrelstats->hasindex;

AFAICT we don't actually need to adjust this based on
vacrelstats->hasindex because we are already checking for indexes
everywhere we check for this option.  What do you think about leaving
that part out?

+   if (vacopts->disable_index_cleanup)
+   {
+   /* DISABLE_PAGE_SKIPPING is supported since 12 
*/
+   Assert(serverVersion >= 12);
+   appendPQExpBuffer(sql, 
"%sDISABLE_INDEX_CLEANUP", sep);
+   sep = comma;
+   }

s/DISABLE_PAGE_SKIPPING/DISABLE_INDEX_CLEANUP/

+   printf(_("  --disable-index-cleanup disable index vacuuming and 
index clenaup\n"));

s/clenaup/cleanup/

Nathan



Re: Index INCLUDE vs. Bitmap Index Scan

2019-02-26 Thread Justin Pryzby
On Tue, Feb 26, 2019 at 09:07:01PM +0100, Markus Winand wrote:
> CREATE INDEX idx ON tbl (a, b, c);
>  Bitmap Heap Scan on tbl  (cost=4.14..8.16 rows=1 width=7616) (actual 
> time=0.021..0.021 rows=1 loops=1)
>Recheck Cond: ((a = 1) AND (c = 1))
>->  Bitmap Index Scan on idx  (cost=0.00..4.14 rows=1 width=0) (actual 
> time=0.018..0.018 rows=1 loops=1)
>  Index Cond: ((a = 1) AND (c = 1))
> 
> (As a side node: I also dislike it how Bitmap Index Scan mixes search 
> conditions and filters in “Index Cond”)

I don't think it's mixing them;  it's using index scan on leading *and*
nonleading column.  That's possible even if frequently not efficient.

Justin



Re: NOT IN subquery optimization

2019-02-26 Thread David Rowley
On Wed, 27 Feb 2019 at 13:41, Li, Zheng  wrote:
> We still check for inner side's nullability, when it is nullable we
> append a "var is NULL" to the anti join condition. So every outer
> tuple is going to evaluate to true on the join condition when there
> is indeed a null entry in the inner.

That's possible, at least providing the var is NULL is an OR
condition, but the problem there is that you force the plan into a
nested loop join. That's unfortunately not going to perform very well
when the number of rows to process is large.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: NOT IN subquery optimization

2019-02-26 Thread Li, Zheng
I'm totally fine with setting the target to PG13.

--
I'm interested to know how this works without testing for inner
nullability.  If any of the inner side's join exprs are NULL then no
records can match. What do you propose to work around that?
--

We still check for inner side's nullability, when it is nullable we
append a "var is NULL" to the anti join condition. So every outer
tuple is going to evaluate to true on the join condition when there
is indeed a null entry in the inner. 
Actually I think the nested loop anti join can end early in this case,
but I haven't find a way to do it properly, this may be one other reason
why we need a new join type for NOT IN.

e.g.
explain select count(*) from s where u not in (select n from l);
 QUERY PLAN

 Aggregate  (cost=2892.88..2892.89 rows=1 width=8)
   ->  Nested Loop Anti Join  (cost=258.87..2892.88 rows=1 width=0)
 ->  Seq Scan on s  (cost=0.00..1.11 rows=11 width=4)
 ->  Bitmap Heap Scan on l  (cost=258.87..262.88 rows=1 width=4)
   Recheck Cond: ((s.u = n) OR (n IS NULL))
   ->  BitmapOr  (cost=258.87..258.87 rows=1 width=0)
 ->  Bitmap Index Scan on l_n  (cost=0.00..4.43 rows=1 
width=0)
   Index Cond: (s.u = n)
 ->  Bitmap Index Scan on l_n  (cost=0.00..4.43 rows=1 
width=0)
   Index Cond: (n IS NULL)

Zheng



Re: NOT IN subquery optimization

2019-02-26 Thread David Rowley
On Wed, 27 Feb 2019 at 13:13, Tom Lane  wrote:
>
> "Li, Zheng"  writes:
> > However, given that the March CommitFest is imminent and the runtime smarts 
> > patent concerns David had pointed out (which I was not aware of before), we 
> > would not move that direction at the moment.
>
> > I propose that we collaborate to build one patch from the two patches 
> > submitted in this thread for the CF.
>
> TBH, I think it's very unlikely that any patch for this will be seriously
> considered for commit in v12.  It would be against project policy, and
> spending a lot of time reviewing the patch would be quite unfair to other
> patches that have been in the queue longer.  Therefore, I'd suggest that
> you not bend things out of shape just to have some patch to submit before
> March 1.  It'd be better to work with the goal of having a coherent patch
> ready for the first v13 CF, probably July-ish.

FWIW, I did add this to the March CF, but I set the target version to
13.  I wasn't considering this for PG12. I see Zheng was, but I agree
with you on PG13 being the target for this.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: NOT IN subquery optimization

2019-02-26 Thread David Rowley
On Wed, 27 Feb 2019 at 13:05, Li, Zheng  wrote:
> -With the latest fix (for the empty table case), our patch does the 
> transformation as long as the outer is non-nullable regardless of the inner 
> nullability, experiments show that the results are always faster.

Hi Zheng,

I'm interested to know how this works without testing for inner
nullability.  If any of the inner side's join exprs are NULL then no
records can match. What do you propose to work around that?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Autovaccuum vs temp tables crash

2019-02-26 Thread Tom Lane
Alvaro Herrera  writes:
> On 2019-Feb-23, Tom Lane wrote:
>> However, if someone held a gun to my head and said fix it, I'd be inclined
>> to do so by having temp-namespace creation insert a "pin" dependency into
>> pg_depend.  Arguably, the only reason we don't create all the temp
>> namespaces during bootstrap is because we aren't sure how many we'd need
>> --- but if we did do that, they'd presumably end up pinned.

> Is there a problem if we start with very high max_backends and this pins
> a few thousands schemas that are later no longer needed?  There's no
> decent way to drop them ... (I'm not sure it matters all that much,
> except for bloat in pg_namespace.)

> How about hardcoding a pin for any schema that's within the current
> max_backends?

I remain skeptical that there's a problem here that so badly needs
fixed as to justify half-baked hacks in the dependency system.  We'd
be more likely to create problems than fix them.

The existing state of affairs is that a superuser who really needs to drop
a temp schema can do so, if she's careful that it's not active.  Pinning
things would break that, or at least add an additional roadblock.  If it's
some sort of virtual pin rather than a regular pg_depend entry, then it
*would* be impossible to get around (mumble ... DELETE FROM pg_namespace
... mumble).  As against that, what problem are we fixing by preventing
superusers from doing that?  A careless superuser can screw things up
arbitrarily badly in any case, so I'm not that fussed about the hazard
that the namespace isn't idle.

regards, tom lane



Re: NOT IN subquery optimization

2019-02-26 Thread Tom Lane
"Li, Zheng"  writes:
> However, given that the March CommitFest is imminent and the runtime smarts 
> patent concerns David had pointed out (which I was not aware of before), we 
> would not move that direction at the moment.

> I propose that we collaborate to build one patch from the two patches 
> submitted in this thread for the CF.

TBH, I think it's very unlikely that any patch for this will be seriously
considered for commit in v12.  It would be against project policy, and
spending a lot of time reviewing the patch would be quite unfair to other
patches that have been in the queue longer.  Therefore, I'd suggest that
you not bend things out of shape just to have some patch to submit before
March 1.  It'd be better to work with the goal of having a coherent patch
ready for the first v13 CF, probably July-ish.

regards, tom lane



Re: Pluggable Storage - Andres's take

2019-02-26 Thread Andres Freund
Hi,

On 2019-01-21 10:32:37 +1100, Haribabu Kommi wrote:
> I am not able to remove the complete t_tableOid from HeapTuple,
> because of its use in triggers, as the slot is not available in triggers
> and I need to store the tableOid also as part of the tuple.

What precisely do you man by "use in triggers"? You mean that a trigger
might access a HeapTuple's t_tableOid directly, even though all of the
information is available in the trigger context?

Greetings,

Andres Freund



Re: Allowing extensions to supply operator-/function-specific info

2019-02-26 Thread Tom Lane
Paul Ramsey  writes:
>> On Feb 26, 2019, at 2:19 PM, Tom Lane  wrote:
>> In most cases, multiple matching arguments are going to lead to
>> failure to construct any useful index condition, because your
>> comparison value has to be a pseudoconstant (ie, not a variable
>> from the same table, so in both of the above examples there's
>> no function argument you could compare to).  

> This term “pseudoconstant” has been causing me some worry as it crops up
> in your explanations a fair amount.

It is defined in the documentation, but what it boils down to is that
your comparison value can't contain either (1) variables from the same
table the index is on or (2) volatile functions.  There is a function
defined in optimizer.h that can check that for you, so you don't have
to worry too much about the details.

> I expect to have queries of the form

>   SELECT a.*, b.*  
>   FROM a
>   JOIN b
>   ON ST_Intersects(a.geom, b.geom)

Sure, that's fine.  If there are indexes on both a.geom and b.geom,
you'll get separate opportunities to match to each of those, and
what you'd be constructing in each case is an indexqual that has to be
used on the inner side of a nestloop join (so that the outer side can
provide the comparison value).  What's not fine is "WHERE
ST_Intersects(a.geom, a.othergeom)" ... you can't make an indexscan
out of that, at least not with the && operator.

regards, tom lane



Re: NOT IN subquery optimization

2019-02-26 Thread Li, Zheng
I agree we will need some runtime smarts (such as a new anti join type as 
pointed out by Richard) to "ultimately" account for all the cases of NOT IN 
queries.

However, given that the March CommitFest is imminent and the runtime smarts 
patent concerns David had pointed out (which I was not aware of before), we 
would not move that direction at the moment.

I propose that we collaborate to build one patch from the two patches submitted 
in this thread for the CF. The two patches are for the same purpose and 
similar. However, they differ in the following ways as far as I can tell:

Nullability Test:
-David's patch uses strict predicates for nullability test.
-Our patch doesn't use strict predicates, but it accounts for COALESCE and 
null-padded rows from outer join. In addition, we made reduce_outer_joins() 
work before the transformation which makes the nullability test more accurate.

Anti Join Transformation:
-Dvaid's patch does the transformation when both inner and outer outputs are 
non-nullable.
-With the latest fix (for the empty table case), our patch does the 
transformation as long as the outer is non-nullable regardless of the inner 
nullability, experiments show that the results are always faster.

David, please let me know what you think. If you would like to collaborate, 
I'll start merging with your code on using strict predicates to make a better 
Nullability Test.

Thanks,
Zheng



Re: Allowing extensions to supply operator-/function-specific info

2019-02-26 Thread Paul Ramsey


> On Feb 26, 2019, at 2:19 PM, Tom Lane  wrote:
> 
> In most cases, multiple matching arguments are going to lead to
> failure to construct any useful index condition, because your
> comparison value has to be a pseudoconstant (ie, not a variable
> from the same table, so in both of the above examples there's
> no function argument you could compare to).  

This term “pseudoconstant” has been causing me some worry as it crops up in 
your explanations a fair amount. I expect to have queries of the form

  SELECT a.*, b.*  
  FROM a
  JOIN b
  ON ST_Intersects(a.geom, b.geom)

And I expect to be able to rewrite that in terms of having an additional call 
to the index operator (&&) and there won’t be a constant on either side of the 
operator. Am I mis-understanding the term, or are there issues with using this 
API in a join context?

P.

> But we don't prejudge
> that, because it's possible that a function with 3 or more arguments
> could produce something useful anyway.  For instance, if what we've
> got is "f(x, y, constant)" then it's possible that the semantics of
> the function are such that y can be ignored and we can make something
> indexable like "x && constant".  All this is the support function's
> job to know.

>   regards, tom lane




Re: Segfault when restoring -Fd dump on current HEAD

2019-02-26 Thread Tom Lane
Alvaro Herrera  writes:
> I think it would be better to just put back the .defn = "" (etc) to the
> ArchiveEntry calls.

Yeah, that was what I was imagining --- just make the ArchiveEntry calls
act exactly like they did before in terms of what gets filled into the
TOC fields.  This episode is a fine reminder of why premature optimization
is the root of all evil...

regards, tom lane



Re: POC: converting Lists into arrays

2019-02-26 Thread Tom Lane
I wrote:
> I had an idea that perhaps is worth considering --- upthread I rejected
> the idea of deleting lnext() entirely, but what if we did so?  We could
> redefine foreach() to not use it:

> #define foreach(cell, l) \
> for (int cell##__index = 0; \
>  (cell = list_nth_cell(l, cell##__index)) != NULL; \
>  cell##__index++)

> I think this would go a very long way towards eliminating the hazards
> associated with iterating around a list-modification operation.

I spent some time trying to modify my patch to work that way, but
abandoned the idea before completing it, because it became pretty
clear that it is a bad idea.  There are at least two issues:

1. In many places, the patch as I had it only required adding an
additional parameter to lnext() calls.  Removing lnext() altogether is
far more invasive, requiring restructuring loop logic in many places
where we otherwise wouldn't need to.  Since most of the point of this
proposal is to not have a larger patch footprint than we have to, that
seemed like a step in the wrong direction.

2. While making foreach() work this way might possibly help in avoiding
writing bad code in the first place, a loop of this form is really
just about as vulnerable to being broken by list insertions/deletions
as what I had before.  If you don't make suitable adjustments to the
integer index after an insertion/deletion then you're liable to skip
over, or double-visit, some list entries; and there's nothing here to
help you notice that you need to do that.  Worse, doing things like
this utterly destroys our chance of detecting such errors, because
there's no state being kept that's in any way checkable.

I was driven to realize point 2 by noticing, while trying to get rid
of some lnext calls, that I'd mostly failed in the v1 patch to fix
loops that contain embedded list_delete() calls other than
list_delete_cell().  This is because the debug support I'd added failed
to force relocation of lists after a deletion (unlike the insertion
cases).  It won't take much to add such relocation, and I'll go do that;
but with an integer-index-based loop implementation we've got no chance
of having debug support that could catch failure to update the loop index.

So I think that idea is a failure, and going forward with the v1
approach has better chances.

I did find a number of places where getting rid of explicit lnext()
calls led to just plain cleaner code.  Most of these were places that
could be using forboth() or forthree() and just weren't.  There's
also several places that are crying out for a forfour() macro, so
I'm not sure why we've stubbornly refused to provide it.  I'm a bit
inclined to just fix those things in the name of code readability,
independent of this patch.

I also noticed that there's quite a lot of places that are testing
lnext(cell) for being NULL or not.  What that really means is whether
this cell is last in the list or not, so maybe readability would be
improved by defining a macro along the lines of list_cell_is_last().
Any thoughts about that?

regards, tom lane



Re: [Patch][WiP] Tweaked LRU for shared buffers

2019-02-26 Thread Benjamin Manes
Hi Tomas,

If you are on a benchmarking binge and feel like generating some trace
files (as mentioned earlier), I'd be happy to help in regards to running
them through simulations to show how different policies behave. We can add
more types to match this patch / Postgres' GClock as desired, too.

On Tue, Feb 26, 2019 at 3:04 PM Tomas Vondra 
wrote:

> On 2/17/19 2:53 PM, Tomas Vondra wrote:
> > On 2/17/19 2:14 PM, Едигарьев, Иван Григорьевич wrote:
> >> Hi there. I was responsible for the benchmarks, and I would be glad to
> >> make clear that part for you.
> >>
> >> On Sat, 16 Feb 2019 at 02:30, Tomas Vondra <
> tomas.von...@2ndquadrant.com> wrote:
> >>> Interesting. Where do these numbers (5/8 and 1/8) come from?
> >>
> >> The first number came from MySQL realization of LRU algorithm
> >> 
> >> and the second from simple tuning, we've tried to change 1/8 a little,
> >> but it didn't change metrics significantly.
> >>
> >>> That TPS chart looks a bit ... wild. How come the master jumps so much
> >>> up and down? That's a bit suspicious, IMHO.
> >>
> >> Yes, it is. It would be great if someone will try to reproduce those
> results.
> >>
> >
> > I'll try.
> >
>
> I've tried to reproduce this behavior, and I've done a quite extensive
> set of tests on two different (quite different) machines, but so far I
> have not observed anything like that. The results are attached, along
> with the test scripts used.
>
> I wonder if this might be due to pg_ycsb using random_zipfian, which has
> somewhat annoying behavior for some parameters (as I've mentioned in a
> separate thread). But that should affect all the runs, not just some
> shared_buffers sizes.
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: Segfault when restoring -Fd dump on current HEAD

2019-02-26 Thread Alvaro Herrera
On 2019-Feb-26, Dmitry Dolgov wrote:

> > On Tue, Feb 26, 2019 at 6:38 AM Michael Paquier  wrote:
> >
> > Works for me.  With a quick read of the code, it seems to me that it
> > is possible to keep compatibility while keeping the simplifications
> > around ArchiveEntry()'s refactoring.
> 
> Yes, it should be rather simple, we can e.g. return to the old less consistent
> NULL handling approach something (like in the attached patch), or replace a 
> NULL
> value with an empty string in WriteToc. Give me a moment, I'll check it out. 
> At
> the same time I would suggest to keep replace_line_endings -> sanitize_line,
> since it doesn't break compatibility.

I think it would be better to just put back the .defn = "" (etc) to the
ArchiveEntry calls.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Index INCLUDE vs. Bitmap Index Scan

2019-02-26 Thread Andres Freund
Hi,

On 2019-02-26 18:22:41 -0500, Tom Lane wrote:
> Markus Winand  writes:
> > I think Bitmap Index Scan should take advantage of B-tree INCLUDE columns, 
> > which it doesn’t at the moment (tested on master as of yesterday).
> 
> Regular index scans don't do what you're imagining either (i.e., check
> filter conditions in advance of visiting the heap).  There's a roadblock
> to implementing such behavior, which is that we might end up applying
> filter expressions to dead rows.  That could make users unhappy.
> For example, given a filter condition like "1.0/c > 0.1", people
> would complain if it still got zero-divide failures even after they'd
> deleted all rows with c=0 from their table.
> 
> Generally speaking, we expect indexable operators not to throw
> errors on any input values, which is why this problem isn't serious
> for the index conditions proper.  But we can't make that assumption
> for arbitrary filter conditions.

We could probably work around that, by only doing such pre-checks on
index tuples only for tuples known to be visible according to the
VM. Would quite possibly be too hard to understand for users
though. Also not sure if it'd actually perform that well due to the
added random IO, as we'd have to do such checks before entering a tuple
into the bitmap (as obviously we don't have a tuple afterwards).

Greetings,

Andres Freund



Re: Autovaccuum vs temp tables crash

2019-02-26 Thread Alvaro Herrera
On 2019-Feb-23, Tom Lane wrote:

> However, if someone held a gun to my head and said fix it, I'd be inclined
> to do so by having temp-namespace creation insert a "pin" dependency into
> pg_depend.  Arguably, the only reason we don't create all the temp
> namespaces during bootstrap is because we aren't sure how many we'd need
> --- but if we did do that, they'd presumably end up pinned.

Is there a problem if we start with very high max_backends and this pins
a few thousands schemas that are later no longer needed?  There's no
decent way to drop them ... (I'm not sure it matters all that much,
except for bloat in pg_namespace.)

How about hardcoding a pin for any schema that's within the current
max_backends?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Index INCLUDE vs. Bitmap Index Scan

2019-02-26 Thread Tom Lane
Markus Winand  writes:
> I think Bitmap Index Scan should take advantage of B-tree INCLUDE columns, 
> which it doesn’t at the moment (tested on master as of yesterday).

Regular index scans don't do what you're imagining either (i.e., check
filter conditions in advance of visiting the heap).  There's a roadblock
to implementing such behavior, which is that we might end up applying
filter expressions to dead rows.  That could make users unhappy.
For example, given a filter condition like "1.0/c > 0.1", people
would complain if it still got zero-divide failures even after they'd
deleted all rows with c=0 from their table.

Generally speaking, we expect indexable operators not to throw
errors on any input values, which is why this problem isn't serious
for the index conditions proper.  But we can't make that assumption
for arbitrary filter conditions.

> (As a side node: I also dislike it how Bitmap Index Scan mixes search 
> conditions and filters in “Index Cond”)

What do you think is being mixed exactly?

regards, tom lane



Re: Segfault when restoring -Fd dump on current HEAD

2019-02-26 Thread Tom Lane
Alvaro Herrera  writes:
> Hmm, shouldn't we modify sanitize_line so that it returns strdup(hyphen)
> when input is empty and want_hyphen, too?

If this patch is touching the behavior of functions like that, then
it's going in the wrong direction; the need for any such change suggests
strongly that you've failed to restore the old behavior as to which TOC
fields can be null or not.

There might be reason to make such cleanups/improvements separately,
but let's *not* fuzz things up by doing them in the corrective patch.

regards, tom lane



Re: Segfault when restoring -Fd dump on current HEAD

2019-02-26 Thread Alvaro Herrera
On 2019-Feb-26, Dmitry Dolgov wrote:

> Yes, it should be rather simple, we can e.g. return to the old less consistent
> NULL handling approach something (like in the attached patch), or replace a 
> NULL
> value with an empty string in WriteToc. Give me a moment, I'll check it out. 
> At
> the same time I would suggest to keep replace_line_endings -> sanitize_line,
> since it doesn't break compatibility.

Hmm, shouldn't we modify sanitize_line so that it returns strdup(hyphen)
when input is empty and want_hyphen, too?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Segfault when restoring -Fd dump on current HEAD

2019-02-26 Thread Alvaro Herrera
On 2019-Feb-26, Michael Paquier wrote:

> On Tue, Feb 26, 2019 at 12:16:35AM -0500, Tom Lane wrote:
> > Well, if we didn't want to fix this, a reasonable way to go about
> > it would be to bump the archive version number in pg_dump output,
> > so that old versions would issue a useful complaint instead of crashing.
> > However, I repeat that this patch was sold as a notational improvement,
> > not something that was going to break format compatibility.  I think if
> > anyone had mentioned the latter, there would have been push-back against
> > its being committed at all.  I am providing such push-back right now,
> > because I don't think we should break file compatibility for this.
> 
> While I agree that the patch makes handling of the different fields in
> archive entries cleaner, I agree as well that this is not enough to
> justify a dump version bump.

Yeah, it was a nice thing to have, but I didn't keep in mind that we
ought to be able to provide such upwards compatibility.  (Is this
upwards or downwards or backwards or forwards compatibility, now,
actually?  I can't quite figure it out which direction it goes.)


> > I think this patch needs to be worked over so that what it writes
> > is exactly what was written before.  If the author is unwilling
> > to do that PDQ, it should be reverted.
> 
> Works for me.  With a quick read of the code, it seems to me that it
> is possible to keep compatibility while keeping the simplifications
> around ArchiveEntry()'s refactoring.  Alvaro?

Yeah, let me review the patch Dmitry just sent.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Parallel query vs smart shutdown and Postmaster death

2019-02-26 Thread Thomas Munro
On Mon, Feb 25, 2019 at 2:13 PM Thomas Munro  wrote:
> 1.  In a nearby thread, I misdiagnosed a problem reported[1] by Justin
> Pryzby (though my misdiagnosis is probably still a thing to be fixed;
> see next).  I think I just spotted the real problem he saw: if you
> execute a parallel query after a smart shutdown has been initiated,
> you wait forever in gather_readnext()!  Maybe parallel workers can't
> be launched in this state, but we lack code to detect this case?  I
> haven't dug into the exact mechanism or figured out what to do about
> it yet, and I'm tied up with something else for a bit, but I will come
> back to this later if nobody beats me to it.

Given smart shutdown's stated goal, namely that it "lets existing
sessions end their work normally", my questions are:

1.  Why does pmdie()'s SIGTERM case terminate parallel workers
immediately?  That breaks aborts running parallel queries, so they
don't get to end their work normally.
2.  Why are new parallel workers not allowed to be started while in
this state?  That hangs future parallel queries forever, so they don't
get to end their work normally.
3.  Suppose we fix the above cases; should we do it for parallel
workers only (somehow), or for all bgworkers?  It's hard to say since
I don't know what all bgworkers do.

In the meantime, perhaps we should teach the postmaster to report this
case as a failure to launch in back-branches, so that at least
parallel queries don't hang forever?  Here's an initial sketch of a
patch like that: it gives you "ERROR:  parallel worker failed to
initialize" and "HINT:  More details may be available in the server
log." if you try to run a parallel query.  The HINT is right, the
server logs say that a smart shutdown is in progress.  If that seems a
bit hostile, consider that any parallel queries that were running at
the moment the smart shutdown was requested have already been ordered
to quit; why should new queries started after that get a better deal?
Then perhaps we could do some more involved surgery on master that
achieves smart shutdown's stated goal here, and lets parallel queries
actually run?  Better ideas welcome.

-- 
Thomas Munro
https://enterprisedb.com


0001-Report-bgworker-launch-failure-during-smart-shutdown.patch
Description: Binary data


Re: Allowing extensions to supply operator-/function-specific info

2019-02-26 Thread Tom Lane
Paul Ramsey  writes:
> On Feb 26, 2019, at 2:19 PM, Tom Lane  wrote:
>> What's the query look like exactly?  The other two calls will occur
>> anyway, but SupportRequestIndexCondition depends on the function
>> call's placement.

> select geos_intersects_new(g, 'POINT(0 0)') from foo;

Right, so that's not useful for an index scan.  Try

select * from foo where geos_intersects_new(g, 'POINT(0 0)').

regards, tom lane



Re: Allowing extensions to supply operator-/function-specific info

2019-02-26 Thread Paul Ramsey


> On Feb 26, 2019, at 2:19 PM, Tom Lane  wrote:
> 
>> I have
>> created a table (foo) a geometry column (g) and an index (GIST on
>> foo(g)) and am running a query against foo using a noop function with
>> a support function bound to it.
> 
>> The support function is called, twice, once in
>> T_SupportRequestSimplify mode and once in T_SupportRequestCost mode.
> 
> What's the query look like exactly?  The other two calls will occur
> anyway, but SupportRequestIndexCondition depends on the function
> call's placement.

select geos_intersects_new(g, 'POINT(0 0)') from foo;

> 
>   regards, tom lane




Re: psql display of foreign keys

2019-02-26 Thread Alvaro Herrera
On 2019-Feb-04, Michael Paquier wrote:

> pg_partition_root() has not made it to the finish line yet, still it
> would have been nice to see a rebase, and the patch has been waiting
> for input for 4 weeks now.  So I am marking it as returned with
> feedback.

Thanks for committing pg_partition_root ... but it turns out to be
useless for this purpose.  It turns out that we need to obtain the list
of *ancestors* of the table being displayed, which pg_partition_tree
does not easily give you.  So I ended up adding yet another auxiliary
function, pg_partition_ancestors, which in its current formulation
returns just a lits of OIDs of ancestor tables.  This seems generally
useful, and can be used in conjunction with pg_partition_tree():

alvherre=# select t.* from pg_partition_tree(pg_partition_root('pk11')) t
   join pg_partition_ancestors('pk11', true) a on (t.relid = a.relid);
 relid | parentrelid | isleaf | level 
---+-++---
 pk| | f  | 0
 pk1   | pk  | f  | 1
 pk11  | pk1 | t  | 2
(3 filas)

(A small tweak is to change the return type from OID to regclass.
Docbook additions missing also.)

Anyway, given this function, it's possible to fix the psql display to be
as I showed previously.  Patches attached.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 4d1c83c5226742b720b5b0dead707156f490a2bc Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 26 Feb 2019 17:05:24 -0300
Subject: [PATCH v3 1/2] pg_partition_ancestors

---
 src/backend/utils/adt/partitionfuncs.c | 51 ++
 src/include/catalog/pg_proc.dat|  5 +++
 2 files changed, 56 insertions(+)

diff --git a/src/backend/utils/adt/partitionfuncs.c b/src/backend/utils/adt/partitionfuncs.c
index ffd66b64394..b606776e015 100644
--- a/src/backend/utils/adt/partitionfuncs.c
+++ b/src/backend/utils/adt/partitionfuncs.c
@@ -208,3 +208,54 @@ pg_partition_root(PG_FUNCTION_ARGS)
 	Assert(OidIsValid(rootrelid));
 	PG_RETURN_OID(rootrelid);
 }
+
+/*
+ * pg_partition_ancestors
+ *
+ * Produces view with one row per ancestor of the given partition.
+ * If 'includeself', the partition itself is included in the output.
+ */
+Datum
+pg_partition_ancestors(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	bool		includeself = PG_GETARG_BOOL(1);
+	FuncCallContext *funcctx;
+	ListCell  **next;
+
+	if (!check_rel_can_be_partition(relid))
+		PG_RETURN_NULL();
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		MemoryContext oldcxt;
+		List	   *ancestors;
+
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		oldcxt = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		ancestors = get_partition_ancestors(relid);
+		if (includeself)
+			ancestors = lappend_oid(ancestors, relid);
+
+		next = (ListCell **) palloc(sizeof(ListCell *));
+		*next = list_head(ancestors);
+		funcctx->user_fctx = (void *) next;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	next = (ListCell **) funcctx->user_fctx;
+
+	if (*next != NULL)
+	{
+		Oid			relid = lfirst_oid(*next);
+
+		*next = lnext(*next);
+		SRF_RETURN_NEXT(funcctx, ObjectIdGetDatum(relid));
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a4e173b4846..f433e8ca670 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10528,6 +10528,11 @@
   proargmodes => '{i,o,o,o,o}',
   proargnames => '{rootrelid,relid,parentrelid,isleaf,level}',
   prosrc => 'pg_partition_tree' },
+{ oid => '3425', descr => 'view ancestors of the partition',
+  proname => 'pg_partition_ancestors', prorows => '10', proretset => 't',
+  provolatile => 'v', prorettype => 'oid', proargtypes => 'regclass bool',
+  proallargtypes => '{regclass,bool,regclass}', proargmodes => '{i,i,o}',
+  proargnames => '{partitionid,includeself,relid}', prosrc => 'pg_partition_ancestors' },
 
 # function to get the top-most partition root parent
 { oid => '3424', descr => 'get top-most partition root parent',
-- 
2.17.1

>From fe2b3ccc87f80a43029143f268ac8dc48c3cd1c1 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 26 Feb 2019 18:57:25 -0300
Subject: [PATCH v3 2/2] fix psql display of FKs

---
 src/bin/psql/describe.c   | 134 --
 src/test/regress/expected/foreign_key.out |  26 ++---
 2 files changed, 114 insertions(+), 46 deletions(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 4da6719ce71..ecc597b94eb 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -1479,6 +1479,7 @@ describeOneTableDetails(const char *schemaname,
 		bool		rowsecurity;
 		bool		forcerowsecurity;
 		bool		hasoids;
+		bool		ispartition;
 		Oid			tablespace;
 		char	   *reloptions;
 		char	   *reloftype;
@@ -1501,7 +1502,24 @@ describeOneTableDetails(const char *schemaname,
 		printfPQExpBuffer(,
 		  

Re: Allowing extensions to supply operator-/function-specific info

2019-02-26 Thread Tom Lane
Paul Ramsey  writes:
> New line of questioning: under what conditions will the support
> function be called in a T_SupportRequestIndexCondition mode?

It'll be called if the target function appears at top level of a
WHERE or JOIN condition and any one of the function's arguments
syntactically matches some column of an index.

If there's multiple arguments matching the same index column, say
index on "x" and we have "f(z, x, x)", you'll get one call and
it will tell you about the first match (req->indexarg == 1 in
this example).  Sorting out what to do in such a case is your
responsibility.

If there's arguments matching more than one index column, say
index declared on (x, y) and we have "f(x, y)", you'll get a
separate call for each index column.  Again, sorting out what
to do for each one is your responsibility.

In most cases, multiple matching arguments are going to lead to
failure to construct any useful index condition, because your
comparison value has to be a pseudoconstant (ie, not a variable
from the same table, so in both of the above examples there's
no function argument you could compare to).  But we don't prejudge
that, because it's possible that a function with 3 or more arguments
could produce something useful anyway.  For instance, if what we've
got is "f(x, y, constant)" then it's possible that the semantics of
the function are such that y can be ignored and we can make something
indexable like "x && constant".  All this is the support function's
job to know.

> I have
> created a table (foo) a geometry column (g) and an index (GIST on
> foo(g)) and am running a query against foo using a noop function with
> a support function bound to it.

> The support function is called, twice, once in
> T_SupportRequestSimplify mode and once in T_SupportRequestCost mode.

What's the query look like exactly?  The other two calls will occur
anyway, but SupportRequestIndexCondition depends on the function
call's placement.

regards, tom lane



Re: ATTACH/DETACH PARTITION CONCURRENTLY

2019-02-26 Thread Robert Haas
On Thu, Jan 31, 2019 at 1:02 PM Robert Haas  wrote:
> New patch series attached.

And here's yet another new patch series, rebased over today's commit
and with a couple of other fixes:

1. I realized that the PartitionDirectory for the planner ought to be
attached to the PlannerGlobal, not the PlannerInfo; we don't want to
create more than one partition directory per query planning cycle, and
we do want our notion of the PartitionDesc for a given relation to be
stable between the outer query and any subqueries.

2. I discovered - via CLOBBER_CACHE_ALWAYS testing - that the
PartitionDirectory has to hold a reference count on the relcache
entry.  In hindsight, this should have been obvious: the planner keeps
the locks when it closes a relation and later reopens it, but it
doesn't keep the relation open, which is what prevents recycling of
the old PartitionDesc.  Unfortunately these additional reference count
manipulations are probably not free.  I don't know expensive they are,
though; maybe it's not too bad.

Aside from these problems, I think I have spotted a subtle problem in
0001. I'll think about that some more and post another update.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


v3-0002-Ensure-that-repeated-PartitionDesc-lookups-return.patch
Description: Binary data


v3-0004-Reduce-the-lock-level-required-to-attach-a-partit.patch
Description: Binary data


v3-0003-Teach-runtime-partition-pruning-to-cope-with-conc.patch
Description: Binary data


v3-0001-Ensure-that-RelationBuildPartitionDesc-sees-a-con.patch
Description: Binary data


Re: Retrieving Alias Name

2019-02-26 Thread Julien Rouhaud
On Tue, Feb 26, 2019 at 10:48 PM Walter Cai  wrote:
>
> I'm currently using a (very rough) scheme to retrieve relation names from a 
> PlannerInfo, and a RelOptInfo struct:
>
> PlannerInfo *root
> RelOptInfo *inner_rel
>
> //...
>
> RangeTblEntry *rte;
> int x = -1;
> while ((x = bms_next_member(inner_rel->relids, x)) >= 0)
> {
> rte = root->simple_rte_array[x];
> if (rte->rtekind == RTE_RELATION)
> {
> char *rel_name = get_rel_name(rte->relid);
> // do stuff...
> }
> }
>
> However, I now realize it would be better to access aliases as they appear in 
> the SQL query. For instance, if the query contains "... FROM rel_name AS 
> rel_alias ..." I would like to retrieve `rel_alias` instead of `rel_name`.
>
> Is it possible to derive the alias in a similar way?

You can use rte->eref->aliasname, which will contain the alias is one
was provided, otherwise the original name.  You can see RangeTblEntry
comment in include/nodes/parsenodes.h for more details.



Retrieving Alias Name

2019-02-26 Thread Walter Cai
Hi,

I'm currently using a (very rough) scheme to retrieve relation names from a
PlannerInfo, and a RelOptInfo struct:

PlannerInfo *root
RelOptInfo *inner_rel

//...

RangeTblEntry *rte;
int x = -1;
while ((x = bms_next_member(inner_rel->relids, x)) >= 0)
{
rte = root->simple_rte_array[x];
if (rte->rtekind == RTE_RELATION)
{
char *rel_name = get_rel_name(rte->relid);
// do stuff...
}
}

However, I now realize it would be better to access aliases as they appear
in the SQL query. For instance, if the query contains "... FROM rel_name AS
rel_alias ..." I would like to retrieve `rel_alias` instead of `rel_name`.

Is it possible to derive the alias in a similar way?

For context: this code is being inserted into
src/backend/optimizer/path/costsize.c and specifically in the
calc_joinrel_size_estimate method.

Thanks in advance,
Walter


Re: Remove Deprecated Exclusive Backup Mode

2019-02-26 Thread Chapman Flack
On 2/26/19 2:46 PM, Andres Freund wrote:

> Also, there's so much wrong stuff on the wiki that people that know to
> look there just don't believe what they read.

Should there be a wiki errata page on the wiki?

I'm fairly serious ... for those times when you're reading the Foo
page on the wiki, and realize it still describes the way Foo worked
before 0abcde, and you don't feel you have time to write a polished
update ... go to the Errata page and add a link to Foo, the date, your
name, and "noticed this still describes how Foo worked before 0abcde".

Or is there a MediaWiki template already for that? You could add the
template right at the top of the page, and it would automagically be
added to the category "pages with errata".

That category could be a starting point for anybody who has a bit of
time to improve some wiki pages, and is looking for likely candidates.

Regards,
-Chap



Re: Allowing extensions to supply operator-/function-specific info

2019-02-26 Thread Paul Ramsey
On Mon, Feb 25, 2019 at 4:09 PM Tom Lane  wrote:
>
> Paul Ramsey  writes:
> > On Mon, Feb 25, 2019 at 3:01 PM Tom Lane  wrote:
> >> It's whichever one the index column's opclass belongs to.  Basically what
> >> you're trying to do here is verify whether the index will support the
> >> optimization you want to perform.
>
> > * If I have tbl1.geom
> > * and I have built two indexes on it, a btree_geometry_ops and a
> > gist_geometry_ops_2d, and
> > * and SupportRequestIndexCondition.opfamily returns me the btree family
> > * and I look and see, "damn there is no && operator in there"
> > * am I SOL, even though an appropriate index does exist?
>
> No.  If there are two indexes matching your function's argument, you'll
> get a separate call for each index.  The support function is only
> responsible for thinking about one index at a time and seeing if it
> can be used.  If more than one can be used, figuring out which
> one is better is done by later cost comparisons.

Ah, wonderful!

New line of questioning: under what conditions will the support
function be called in a T_SupportRequestIndexCondition mode? I have
created a table (foo) a geometry column (g) and an index (GIST on
foo(g)) and am running a query against foo using a noop function with
a support function bound to it.

The support function is called, twice, once in
T_SupportRequestSimplify mode and once in T_SupportRequestCost mode.

What triggers T_SupportRequestIndexCondition mode?

Thanks!

P



Re: NOT IN subquery optimization

2019-02-26 Thread David Rowley
On Wed, 27 Feb 2019 at 03:07, Jim Finnerty  wrote:
>
> The problem is that the special optimization that was proposed for the case
> where the subquery has no WHERE clause isn't valid when the subquery returns
> no rows.  That additional optimization needs to be removed, and preferably
> replaced with some sort of efficient run-time test.

That is one option, however, the join type that Richard mentions, to
satisfy point #3, surely only can work for Hash joins and perhaps
Merge joins that required a sort, assuming there's some way for the
sort to communicate about if it found NULLs or not. Either way, we
need to have looked at the entire inner side to ensure there are no
nulls. Probably it would be possible to somehow team that up with a
planner check to see if the inner exprs could be NULL then just
implement points #1 and #2 for other join methods.

If you're proposing to do that for this thread then I can take my
planner only patch somewhere else.  I only posted my patch as I pretty
much already had what I thought you were originally talking about.
However, please be aware there are current patents around adding
execution time smarts in this area, so it's probably unlikely you'll
find a way to do this in the executor that does not infringe on those.
Probably no committer would want to touch it.  I think my patch covers
a good number of use cases and as far as I understand, does not go
near any current patents.

Please let me know if I should move my patch to another thread. I
don't want to hi-jack this if you're going in another direction.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Remove Deprecated Exclusive Backup Mode

2019-02-26 Thread Peter Geoghegan
On Mon, Feb 25, 2019 at 10:49 PM David Steele  wrote:
> Worse, they have scripted the deletion of backup_label so that the
> cluster will restart on crash.  This is the recommendation from our
> documentation after all.  If that script runs after a restore instead of
> a crash, then the cluster will be corrupt -- silently.

I've always thought that the biggest problem is the documentation.
While the "Continuous Archiving and Point-in-Time Recovery (PITR)"
section of the user docs isn't quite as bad as it used to be, it's
still fundamentally structured "bottom up". That's bad.

I think that it would improve matters enormously if that section of
the documentation, and possibly the entire chapter was restructured to
be "top down". This should include prominent links to specific third
party projects, including Barman, pgBackRest, and others. This whole
area has become much more contentious than it needs to be, and I fear
that we cannot see the forest for the trees.

That said, I don't think that we should be in a hurry to fully remove
the exclusive backup mode. Home-spun backup solutions are a bad idea,
but removing exclusive mode strikes me as corrective over-reaction,
and a technical solution to an organizational problem. There are
definitely people that use the interface in a sensible way, despite
its shortcomings.

-- 
Peter Geoghegan



Re: No-rewrite timestamp<->timestamptz conversions

2019-02-26 Thread Noah Misch
On Tue, Feb 26, 2019 at 02:29:01PM +, Simon Riggs wrote:
> Looks good, would need docs.

The ALTER TABLE page just says "old type is either binary coercible to the new
type or an unconstrained domain over the new type."  Avoiding rewrites by way
of a prosupport function or the at-timestamp-v2.patch approach is essentially
another way to achieve binary coercion.  So far, we haven't documented the
individual data types affected.  Since we don't mention e.g. varchar(x) ->
varchar(x+k) explicitly, I plan not to mention timestamp explicitly.



Index INCLUDE vs. Bitmap Index Scan

2019-02-26 Thread Markus Winand
Hi!

I think Bitmap Index Scan should take advantage of B-tree INCLUDE columns, 
which it doesn’t at the moment (tested on master as of yesterday).

Consider this (find the setup at the bottom of this mail).

CREATE INDEX idx ON tbl (a, b) INCLUDE (c);

EXPLAIN (analyze, buffers)
SELECT *
  FROM tbl
 WHERE a = 1
   AND c = 1;

The following plan could be produced:

 QUERY PLAN

 Bitmap Heap Scan on tbl  (cost=4.14..8.16 rows=1 width=7616) (actual 
time=0.027..0.029 rows=1 loops=1)
   Recheck Cond: (a = 1)
   Filter: (c = 1)
   Rows Removed by Filter: 7
   Heap Blocks: exact=2
   Buffers: shared hit=2 read=1
   ->  Bitmap Index Scan on idx  (cost=0.00..4.14 rows=1 width=0) (actual 
time=0.018..0.018 rows=8 loops=1)
 Index Cond: (a = 1)
 Buffers: shared read=1
 Planning Time: 0.615 ms
 Execution Time: 0.060 ms


The Bitmap Index Scan does not filter on column C, which is INCLUDEd in the 
index leaf nodes.

Instead, the Bitmap Heap Scan fetches unnecessary blocks and then applies the 
filter.

I would expect a similar execution as with this index:

DROP INDEX idx;
CREATE INDEX idx ON tbl (a, b, c);

 QUERY PLAN

 Bitmap Heap Scan on tbl  (cost=4.14..8.16 rows=1 width=7616) (actual 
time=0.021..0.021 rows=1 loops=1)
   Recheck Cond: ((a = 1) AND (c = 1))
   Heap Blocks: exact=1
   Buffers: shared hit=1 read=1
   ->  Bitmap Index Scan on idx  (cost=0.00..4.14 rows=1 width=0) (actual 
time=0.018..0.018 rows=1 loops=1)
 Index Cond: ((a = 1) AND (c = 1))
 Buffers: shared read=1
 Planning Time: 0.123 ms
 Execution Time: 0.037 ms

(As a side node: I also dislike it how Bitmap Index Scan mixes search 
conditions and filters in “Index Cond”)

Due to the not-filtered column B in the index, the use of column C is here 
pretty much the same as it is for the first index, which has C in INCLUDE.

Am I missing anything or is it just an oversight because INCLUDE was primarily 
done for Index Only Scan?

As a background, here is the use case I see for this scenario:
Filters that can not be used for searching the tree can be put in INCLUDE 
instead of the key-part of the index even if there is no intend to run the 
query as Index Only Scan (e.g. SELECT *). Columns in INCLUDE can be used for 
filtering (works fine for Index Only Scan, btw).

The most prominent example are post-fix or in-fix LIKE ‘%term%’ filters. The 
benefit of putting such columns in INCLUDE is that it is clearly documented 
that the index column is neither used for searching in the tree nor for 
ordering. It is either used for filtering, or for fetching. Knowing this makes 
it easier to extend existing index.

Imagine you have this query to optimize:

SELECT *
  FROM …
 WHERE a = ?
   AND b = ?
 ORDER BY ts DESC
 LIMIT 10

And find this existing index on that table:

CREATE INDEX … ON … (a, b) INCLUDE (c);

In this case it is a rather safe option to replace the index with the following:

CREATE INDEX … ON … (a, b, ts) INCLUDE (c);

On the other hand, if the original index had all column in the key part, 
changing it to (A, B, TS, C) is a rather dangerous option.

-markus

— Setup for the demo

-- demo table: 4 rows per block
-- the row if interest is the first one, the others spill to a second block so 
the problem can also seen in “Buffers"

CREATE TABLE tbl (a INTEGER, b INTEGER, c INTEGER, junk CHAR(1900));

INSERT INTO tbl VALUES (1, 1, 1, 'junk'),
   (1, 1, 9, 'junk'),
   (1, 1, 9, 'junk'),
   (1, 1, 9, 'junk'),
   (1, 1, 9, 'junk'),
   (1, 1, 9, 'junk'),
   (1, 1, 9, 'junk'),
   (1, 1, 9, 'junk');

-- prevent unwanted plans for demo of Bitmap Index+Heap Scan
SET ENABLE_INDEXSCAN = OFF;
SET ENABLE_SEQSCAN = OFF;



Re: Remove Deprecated Exclusive Backup Mode

2019-02-26 Thread Andres Freund
Hi,

On 2019-02-26 20:38:17 +0100, Magnus Hagander wrote:
> That should not be a wiki page, really, that should be part of the main
> documentation.

+1

> It can be drafted on the wiki of course, but we *really* should get
> something like that into the docs. Because that's what people are going to
> read. They'd only go look for the information on the wiki if they knew
> there was a problem and needed the details, and the big issue is that
> people don't know there are issues in the first place.

Also, there's so much wrong stuff on the wiki that people that know to
look there just don't believe what they read.

Greetings,

Andres Freund



Re: Remove Deprecated Exclusive Backup Mode

2019-02-26 Thread Magnus Hagander
On Tue, Feb 26, 2019 at 6:31 PM Christophe Pettus  wrote:

>
>
> > On Feb 26, 2019, at 09:26, Robert Haas  wrote:
> >
> > On Tue, Feb 26, 2019 at 12:20 PM Fujii Masao 
> wrote:
> >> So, let me clarify the situations;
> >>
> >> (1) If backup_label and recovery.signal exist, the recovery starts
> safely.
> >>   This is the normal case of recovery from the base backup.
> >>
> >> (2)If backup_label.pending and recovery.signal exist, as described
> above,
> >>   PANIC error happens at the start of recovery. This case can happen
> >>   if the operator forgets to rename backup_label.pending, i.e.,
> >>   operation mistake. So, after PANIC, the operator needs to fix her
> or
> >>   his mistake (i.e., rename backup_label.pending) and restart
> >>   the recovery.
> >>
> >> (3) If backup_label.pending exists but recovery.signal doesn't, the
> server
> >>   ignores (or removes) backup_label.pending and do the recovery
> >>   starting the pg_control's REDO location. This case can happen if
> >>   the server crashes while an exclusive backup is in progress.
> >>   So crash-in-the-middle-of-backup doesn't prevent the recovery from
> >>   starting in this idea.
> >
> > The if-conditions for 1 and 2 appear to be the same, which is confusing.
>
> I believe #1 is when backup_label (no .pending) exists, #2 is when
> backup_label.pending (with .pending) exists.
>
> At the absolute minimum, this discussion has convinced me that we need to
> create a wiki page to accurately describe the failure scenarios for both
> exclusive and non-exclusive backups, and the recovery actions for them.  If
> it exists already, my search attempts weren't successful.  If it doesn't,
> I'm happy to start one.
>

That should not be a wiki page, really, that should be part of the main
documentation.

It can be drafted on the wiki of course, but we *really* should get
something like that into the docs. Because that's what people are going to
read. They'd only go look for the information on the wiki if they knew
there was a problem and needed the details, and the big issue is that
people don't know there are issues in the first place.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: A note about recent ecpg buildfarm failures

2019-02-26 Thread Robert Haas
On Tue, Feb 26, 2019 at 1:25 PM Tom Lane  wrote:
> Since my commits 9e138a401 et al on Saturday, buildfarm members
> blobfish, brotula, and wunderpus have been showing core dumps
> in the ecpg preprocessor.  This seemed inexplicable given what
> the commits changed, and even more so seeing that only HEAD is
> failing, while the change was back-patched into all branches.
>
> Mark Wong and I poked into this off-list, and what we find is that
> this seems to be a compiler bug.  Those animals are all running
> nearly the same version of clang (3.8.x / ppc64le).  Looking into
> the assembly code for preproc.y, the crash is occurring at a branch
> that is supposed to jump forward exactly 32768 bytes, but according
> to gdb's disassembler it's jumping backwards exactly -32768 bytes,
> into invalid memory.  It will come as no surprise to hear that the
> branch displacement field in PPC conditional branches is 16 bits
> wide, so that positive 32768 doesn't fit but negative 32768 does.
> Evidently what is happening is that either the compiler or the
> assembler is failing to detect the edge-case field overflow and
> switch to different coding.  So the apparent dependency on 9e138a401
> is because that happened to insert exactly the right number of
> instructions in-between to trigger this scenario.  It's pure luck we
> didn't trip over it before, although none of those buildfarm animals
> have been around for all that long.
>
> Moral: don't use clang 3.8.x on ppc64.  I think Mark is going
> to upgrade those animals to some more recent compiler version.

Wow, that's some pretty impressive debugging!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



A note about recent ecpg buildfarm failures

2019-02-26 Thread Tom Lane
Since my commits 9e138a401 et al on Saturday, buildfarm members
blobfish, brotula, and wunderpus have been showing core dumps
in the ecpg preprocessor.  This seemed inexplicable given what
the commits changed, and even more so seeing that only HEAD is
failing, while the change was back-patched into all branches.

Mark Wong and I poked into this off-list, and what we find is that
this seems to be a compiler bug.  Those animals are all running
nearly the same version of clang (3.8.x / ppc64le).  Looking into
the assembly code for preproc.y, the crash is occurring at a branch
that is supposed to jump forward exactly 32768 bytes, but according
to gdb's disassembler it's jumping backwards exactly -32768 bytes,
into invalid memory.  It will come as no surprise to hear that the
branch displacement field in PPC conditional branches is 16 bits
wide, so that positive 32768 doesn't fit but negative 32768 does.
Evidently what is happening is that either the compiler or the
assembler is failing to detect the edge-case field overflow and
switch to different coding.  So the apparent dependency on 9e138a401
is because that happened to insert exactly the right number of
instructions in-between to trigger this scenario.  It's pure luck we
didn't trip over it before, although none of those buildfarm animals
have been around for all that long.

Moral: don't use clang 3.8.x on ppc64.  I think Mark is going
to upgrade those animals to some more recent compiler version.

regards, tom lane



Re: [RFC] [PATCH] Flexible "partition pruning" hook

2019-02-26 Thread Mike Palmiotto
On Tue, Feb 26, 2019 at 1:55 AM Tsunakawa, Takayuki
 wrote:
>
> From: Mike Palmiotto [mailto:mike.palmio...@crunchydata.com]
> > Attached is a patch which attempts to solve a few problems:
> >
> > 1) Filtering out partitions flexibly based on the results of an external
> > function call (supplied by an extension).
> > 2) Filtering out partitions from pg_inherits based on the same function
> > call.
> > 3) Filtering out partitions from a partitioned table BEFORE the partition
> > is actually opened on-disk.
>
> What concrete problems would you expect this patch to solve?  What kind of 
> extensions do you imagine?  I'd like to hear about the examples.  For 
> example, "PostgreSQL 12 will not be able to filter out enough partitions when 
> planning/executing SELECT ... WHERE ... statement.  But an extension like 
> this can extract just one partition early."

My only application of the patch thus far has been to apply an RLS
policy driven by the extension's results. For example:

CREATE TABLE test.partpar
(
  a int,
  b text DEFAULT (extension_default_bfield('test.partpar'::regclass::oid))
)  PARTITION BY LIST (extension_translate_bfield(b));

CREATE POLICY filter_select on test.partpar for SELECT
USING (extension_filter_by_bfield(b));

CREATE POLICY filter_select on test.partpar for INSERT
WITH CHECK (extension_generate_insert_bfield('test.partpar'::regclass::oid)
= b);

CREATE POLICY filter_update on test.partpar for UPDATE
USING (extension_filter_by_bfield(b))
WITH CHECK (extension_filter_by_bfield(b));

CREATE POLICY filter_delete on test.partpar for DELETE
USING (extension_filter_by_bfield(b));

The function would filter based on some external criteria relating to
the username and the contents of the b column.

The desired effect would be to have `SELECT * from test.partpar;`
return check only the partitions where username can see any row in the
table based on column b. This is applicable, for instance, when a
partition of test.partpar (say test.partpar_b2) is given a label with
`SECURITY LABEL on TABLE test.partpar_b2 IS 'foo';` which is exactly
the same as the b column for every row in said partition. Using this
hook, we can simply check the table label and kick the entire
partition out early on. This should greatly improve performance for
the case where you can enforce that the partition SECURITY LABEL and
the b column are the same.

>
> Would this help the following issues with PostgreSQL 12?
>
> * UPDATE/DELETE planning takes time in proportion to the number of 
> partitions, even when the actually accessed partition during query execution 
> is only one.
>
> * Making a generic plan takes prohibitably long time (IIRC, about 12 seconds 
> when the number of partitoons is 1,000 or 8,000.)

In theory, we'd be checking fewer items (the labels of the partitions,
instead of the b column for every row), so it may indeed help with
performance in these cases.

Admittedly, I haven't looked at either of these very closely. Do you
have any specific test cases I can try out on my end to verify?
--
Mike Palmiotto
Software Engineer
Crunchy Data Solutions
https://crunchydata.com



Re: pgbench MAX_ARGS

2019-02-26 Thread Andres Freund
On 2019-02-26 12:51:23 -0500, Tom Lane wrote:
> Simon Riggs  writes:
> > On Tue, 26 Feb 2019 at 17:38, Andres Freund  wrote:
> >> Why not just allocate it dynamically? Seems weird to have all these
> >> MAX_ARGS, MAX_SCRIPTS ... commands.
> 
> > For me, its a few minutes work to correct a problem and report to the
> > community.
> > Dynamic allocation, run-time errors is all getting too time consuming for a
> > small thing.
> 
> FWIW, I agree --- that's moving the goalposts further than seems
> justified.

I'm fine with applying a patch to just adjust the constant, but I'd also
appreciate somebody just being motivated by my message to remove the
constants ;)



Re: pgbench MAX_ARGS

2019-02-26 Thread Tom Lane
Simon Riggs  writes:
> On Tue, 26 Feb 2019 at 17:38, Andres Freund  wrote:
>> Why not just allocate it dynamically? Seems weird to have all these
>> MAX_ARGS, MAX_SCRIPTS ... commands.

> For me, its a few minutes work to correct a problem and report to the
> community.
> Dynamic allocation, run-time errors is all getting too time consuming for a
> small thing.

FWIW, I agree --- that's moving the goalposts further than seems
justified.

regards, tom lane



Re: pgbench MAX_ARGS

2019-02-26 Thread Simon Riggs
On Tue, 26 Feb 2019 at 17:38, Andres Freund  wrote:

> Hi,
>
> On 2019-02-26 12:57:14 +, Simon Riggs wrote:
> > On Tue, 26 Feb 2019 at 12:19, Fabien COELHO  wrote:
> > I've put it as 256 args now.
> >
> > The overhead of that is about 2kB, so not really an issue.
>
> Why not just allocate it dynamically? Seems weird to have all these
> MAX_ARGS, MAX_SCRIPTS ... commands.


For me, its a few minutes work to correct a problem and report to the
community.

Dynamic allocation, run-time errors is all getting too time consuming for a
small thing.


> The eighties want their constants back ;)
>

Made me smile, thanks. ;-)

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Remove Deprecated Exclusive Backup Mode

2019-02-26 Thread Fujii Masao
On Tue, Feb 26, 2019 at 3:49 PM David Steele  wrote:
>
> On 2/26/19 6:51 AM, Michael Paquier wrote:
> > On Mon, Feb 25, 2019 at 08:17:27PM +0200, David Steele wrote:
> >> Here's the really obvious bad thing: if users do not update their 
> >> procedures
> >> and we ignore backup_label.pending on startup then they will end up with a
> >> corrupt database because it will not replay from the correct checkpoint.  
> >> If
> >> we error on the presence of backup_label.pending then we are right back to
> >> where we started.
> >
> > Not really.  If we error on backup_label.pending, we can make the
> > difference between a backend which has crashed in the middle of an
> > exclusive backup without replaying anything and a backend which is
> > started based on a base backup, so an operator can take some action to
> > see what's wrong with the server.  If you issue an error, users can
> > also see that their custom backup script is wrong because they forgot
> > to rename the flag after taking a backup of the data folder(s).
>
> The operator still has a decision to make, manually, just as they do
> now.  The wrong decision may mean a corrupt database.

Even in non-exclusive backup mode, the wrong decision may mean
a corrupt database. For example, what if the user may forget to save
the backup_label in the backup taken by using non-exclusive backup
method? So I'm not sure if this is the matter only for an exclusive backup.

Regards,

-- 
Fujii Masao



Re: No-rewrite timestamp<->timestamptz conversions

2019-02-26 Thread Tom Lane
Noah Misch  writes:
> On Tue, Feb 26, 2019 at 10:46:29AM -0500, Tom Lane wrote:
>> It'd be nice to get the SupportRequestSimplify API correct from the first
>> release, so if there's even a slightly plausible reason for it to support
>> this, I'd be inclined to err in the direction of doing so.

> Is it problematic to add a volatility field later?

Not hugely so, no.  I'm thinking more in terms of support functions
having to pay attention to which version they're being compiled for
to know what they can do.  But I suppose that's little worse than
any other feature addition we make at the C API level.

>> ... is it really impossible for the timezone GUC to change during
>> the execution of the ALTER TABLE command?  You could probably break that
>> if you tried hard enough, though it seems unlikely that anyone would do so
>> accidentally.

> No, one certainly can change a GUC during ALTER TABLE execution.  The STABLE
> marking on current_setting is a fiction.  I interpret that fiction as a signal
> that a query has undefined behavior if you change GUCs mid-query, which seems
> like a prudent level of effort toward such queries.  Adding a volatility field
> to SupportRequestSimplify does invite C-language extension code to participate
> in deciding this undefined behavior, which would make it harder to verify that
> we like the undefined behavior of the moment (e.g. doesn't crash).  Perhaps
> best not to overturn that rock?

Probably not, unless we can come up with more convincing use-cases for
it.

For the moment, I'm content with the approach in the patch you posted.

regards, tom lane



Re: pgbench MAX_ARGS

2019-02-26 Thread Andres Freund
Hi,

On 2019-02-26 12:57:14 +, Simon Riggs wrote:
> On Tue, 26 Feb 2019 at 12:19, Fabien COELHO  wrote:
> I've put it as 256 args now.
> 
> The overhead of that is about 2kB, so not really an issue.

Why not just allocate it dynamically? Seems weird to have all these
MAX_ARGS, MAX_SCRIPTS ... commands. The eighties want their constants
back ;)

Greetings,

Andres Freund



Re: Remove Deprecated Exclusive Backup Mode

2019-02-26 Thread Robert Haas
On Tue, Feb 26, 2019 at 12:31 PM Christophe Pettus  wrote:
> I believe #1 is when backup_label (no .pending) exists, #2 is when 
> backup_label.pending (with .pending) exists.

Oh, whoops.  I get it now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Remove Deprecated Exclusive Backup Mode

2019-02-26 Thread Fujii Masao
On Wed, Feb 27, 2019 at 2:27 AM Robert Haas  wrote:
>
> On Tue, Feb 26, 2019 at 12:20 PM Fujii Masao  wrote:
> > So, let me clarify the situations;
> >
> > (1) If backup_label and recovery.signal exist, the recovery starts safely.
> >This is the normal case of recovery from the base backup.
> >
> > (2)If backup_label.pending and recovery.signal exist, as described above,
> >PANIC error happens at the start of recovery. This case can happen
> >if the operator forgets to rename backup_label.pending, i.e.,
> >operation mistake. So, after PANIC, the operator needs to fix her or
> >his mistake (i.e., rename backup_label.pending) and restart
> >the recovery.
> >
> > (3) If backup_label.pending exists but recovery.signal doesn't, the server
> >ignores (or removes) backup_label.pending and do the recovery
> >starting the pg_control's REDO location. This case can happen if
> >the server crashes while an exclusive backup is in progress.
> >So crash-in-the-middle-of-backup doesn't prevent the recovery from
> >starting in this idea.
>
> The if-conditions for 1 and 2 appear to be the same, which is confusing.

Using other name like backup_in_progress instead of backup_label.pending
may help a bit for your concern? That is,

(1) backup_label and recovery.signal
(2) backup_in_progress and recovery.signal

Regards,

-- 
Fujii Masao



Re: Remove Deprecated Exclusive Backup Mode

2019-02-26 Thread Christophe Pettus



> On Feb 26, 2019, at 09:26, Robert Haas  wrote:
> 
> On Tue, Feb 26, 2019 at 12:20 PM Fujii Masao  wrote:
>> So, let me clarify the situations;
>> 
>> (1) If backup_label and recovery.signal exist, the recovery starts safely.
>>   This is the normal case of recovery from the base backup.
>> 
>> (2)If backup_label.pending and recovery.signal exist, as described above,
>>   PANIC error happens at the start of recovery. This case can happen
>>   if the operator forgets to rename backup_label.pending, i.e.,
>>   operation mistake. So, after PANIC, the operator needs to fix her or
>>   his mistake (i.e., rename backup_label.pending) and restart
>>   the recovery.
>> 
>> (3) If backup_label.pending exists but recovery.signal doesn't, the server
>>   ignores (or removes) backup_label.pending and do the recovery
>>   starting the pg_control's REDO location. This case can happen if
>>   the server crashes while an exclusive backup is in progress.
>>   So crash-in-the-middle-of-backup doesn't prevent the recovery from
>>   starting in this idea.
> 
> The if-conditions for 1 and 2 appear to be the same, which is confusing.

I believe #1 is when backup_label (no .pending) exists, #2 is when 
backup_label.pending (with .pending) exists.

At the absolute minimum, this discussion has convinced me that we need to 
create a wiki page to accurately describe the failure scenarios for both 
exclusive and non-exclusive backups, and the recovery actions for them.  If it 
exists already, my search attempts weren't successful.  If it doesn't, I'm 
happy to start one.
--
-- Christophe Pettus
   x...@thebuild.com




Re: No-rewrite timestamp<->timestamptz conversions

2019-02-26 Thread Noah Misch
On Tue, Feb 26, 2019 at 10:46:29AM -0500, Tom Lane wrote:
> Noah Misch  writes:
> > Stepping back a bit, commit b8a18ad didn't provide a great UI.  I doubt 
> > folks
> > write queries this way spontaneously; to do so, they would have needed to
> > learn that such syntax enables this optimization.  If I'm going to do
> > something more invasive, it should optimize the idiomatic "alter table t 
> > alter
> > timestamptzcol type timestamp".  One could do that with a facility like
> > SupportRequestSimplify except permitted to consider STABLE facts.  I 
> > suppose I
> > could add a volatility field to SupportRequestSimplify.  So far, I can't 
> > think
> > of a second use case for such a facility, so instead I think
> > ATColumnChangeRequiresRewrite() should have a hard-wired call for
> > F_TIMESTAMPTZ_TIMESTAMP and F_TIMESTAMP_TIMESTAMPTZ.  Patch attached.  If we
> > find more applications of this concept, it shouldn't be hard to migrate this
> > logic into SupportRequestSimplify.  Does anyone think that's better to do 
> > from
> > the start?
> 
> It'd be nice to get the SupportRequestSimplify API correct from the first
> release, so if there's even a slightly plausible reason for it to support
> this, I'd be inclined to err in the direction of doing so.  On the other
> hand, if we really can't think of another use-case then a hard-wired fix
> might be the best way.

Is it problematic to add a volatility field later?  Older support functions
will have needed to assume IMMUTABLE, and a simplification valid for IMMUTABLE
is valid for all volatility levels.  Still, yes, it would be nice to have from
the start if we will indeed need it.

> One thing that we'd have to nail down a bit harder, if we're to add
> something to the SupportRequestSimplify API, is exactly what the semantics
> of the weaker check should be.  The notion of "stable" has always been a
> bit squishy, in that it's not totally clear what span of time stability
> of the function result is being promised over.  In the case at hand, for
> instance, is it really impossible for the timezone GUC to change during
> the execution of the ALTER TABLE command?  You could probably break that
> if you tried hard enough, though it seems unlikely that anyone would do so
> accidentally.

No, one certainly can change a GUC during ALTER TABLE execution.  The STABLE
marking on current_setting is a fiction.  I interpret that fiction as a signal
that a query has undefined behavior if you change GUCs mid-query, which seems
like a prudent level of effort toward such queries.  Adding a volatility field
to SupportRequestSimplify does invite C-language extension code to participate
in deciding this undefined behavior, which would make it harder to verify that
we like the undefined behavior of the moment (e.g. doesn't crash).  Perhaps
best not to overturn that rock?



Re: Remove Deprecated Exclusive Backup Mode

2019-02-26 Thread Robert Haas
On Tue, Feb 26, 2019 at 12:20 PM Fujii Masao  wrote:
> So, let me clarify the situations;
>
> (1) If backup_label and recovery.signal exist, the recovery starts safely.
>This is the normal case of recovery from the base backup.
>
> (2)If backup_label.pending and recovery.signal exist, as described above,
>PANIC error happens at the start of recovery. This case can happen
>if the operator forgets to rename backup_label.pending, i.e.,
>operation mistake. So, after PANIC, the operator needs to fix her or
>his mistake (i.e., rename backup_label.pending) and restart
>the recovery.
>
> (3) If backup_label.pending exists but recovery.signal doesn't, the server
>ignores (or removes) backup_label.pending and do the recovery
>starting the pg_control's REDO location. This case can happen if
>the server crashes while an exclusive backup is in progress.
>So crash-in-the-middle-of-backup doesn't prevent the recovery from
>starting in this idea.

The if-conditions for 1 and 2 appear to be the same, which is confusing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: ATTACH/DETACH PARTITION CONCURRENTLY

2019-02-26 Thread Robert Haas
On Fri, Dec 21, 2018 at 6:04 PM David Rowley
 wrote:
> On Fri, 21 Dec 2018 at 09:43, Robert Haas  wrote:
> > - I refactored expand_inherited_rtentry() to drive partition expansion
> > entirely off of PartitionDescs. The reason why this is necessary is
> > that it clearly will not work to have find_all_inheritors() use a
> > current snapshot to decide what children we have and lock them, and
> > then consult a different source of truth to decide which relations to
> > open with NoLock.  There's nothing to keep the lists of partitions
> > from being different in the two cases, and that demonstrably causes
> > assertion failures if you SELECT with an ATTACH/DETACH loop running in
> > the background. However, it also changes the order in which tables get
> > locked.  Possibly that could be fixed by teaching
> > expand_partitioned_rtentry() to qsort() the OIDs the way
> > find_inheritance_children() does.  It also loses the infinite-loop
> > protection which find_all_inheritors() has.  Not sure what to do about
> > that.
>
> I don't think you need to qsort() the Oids before locking. What the
> qsort() does today is ensure we get a consistent locking order. Any
> other order would surely do, providing we stick to it consistently. I
> think PartitionDesc order is fine, as it's consistent.  Having it
> locked in PartitionDesc order I think is what's needed for [1] anyway.
> [2] proposes to relax the locking order taken during execution.
>
> [1] https://commitfest.postgresql.org/21/1778/
> [2] https://commitfest.postgresql.org/21/1887/

Based on this feedback, I went ahead and committed the part of the
previously-posted patch set that makes this change.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Remove Deprecated Exclusive Backup Mode

2019-02-26 Thread Fujii Masao
On Tue, Feb 26, 2019 at 3:17 AM David Steele  wrote:
>
> On 2/25/19 7:50 PM, Fujii Masao wrote:
> > On Mon, Feb 25, 2019 at 10:49 PM Laurenz Albe  
> > wrote:
> >>
> >> I'm not playing devil's advocate here to annoy you.  I see the problems
> >> with the exclusive backup, and I see how it can hurt people.
> >> I just think that removing exclusive backup without some kind of help
> >> like Andres sketched above will make people unhappy.
> >
> > +1
> >
> > Another idea is to improve an exclusive backup method so that it will never
> > cause such issue. What about changing an exclusive backup mode of
> > pg_start_backup() so that it creates something like backup_label.pending 
> > file
> > instead of backup_label? Then if the database cluster has 
> > backup_label.pending
> > file but not recovery.signal (this is the case where the database is 
> > recovered
> > just after the server crashes while an exclusive backup is in progress),
> > in this idea, the recovery using that database cluster always ignores
> > (or removes) backup_label.pending file and start replaying WAL from
> > the REDO location that pg_control file indicates. So this idea enables us to
> > work around the issue that an exclusive backup could cause.
>
> It's an interesting idea.
>
> > On the other hand, the downside of this idea is that the users need to 
> > change
> > the recovery procedure. When they want to do PITR using the backup having
> > backup_label.pending, they need to not only create recovery.signal but also
> > rename backup_label.pending to backup_label. Rename of backup_label file
> > is brand-new step for their recovery procedure, and changing the recovery
> > procedure might be painful for some users. But IMO it's less painful than
> > removing an exclusive backup API at all.
>
> Well, given that we have invalidated all prior recovery procedures in
> PG12 I'm not sure how big a deal that is.  Of course, it's too late make
> a change like this for PG12.
>
> > Thought?
>
> Here's the really obvious bad thing: if users do not update their
> procedures and we ignore backup_label.pending on startup then they will
> end up with a corrupt database because it will not replay from the
> correct checkpoint.  If we error on the presence of backup_label.pending
> then we are right back to where we started.

No. In this case, since backup_label.pending and recovery.signal exist,
as I described in my previous post, the server stops the recovery with
PANIC error before corrupting the database. Then the operator can
rename backup_label.pending to backup_label and restart the recovery
safely.

So, let me clarify the situations;

(1) If backup_label and recovery.signal exist, the recovery starts safely.
   This is the normal case of recovery from the base backup.

(2)If backup_label.pending and recovery.signal exist, as described above,
   PANIC error happens at the start of recovery. This case can happen
   if the operator forgets to rename backup_label.pending, i.e.,
   operation mistake. So, after PANIC, the operator needs to fix her or
   his mistake (i.e., rename backup_label.pending) and restart
   the recovery.

(3) If backup_label.pending exists but recovery.signal doesn't, the server
   ignores (or removes) backup_label.pending and do the recovery
   starting the pg_control's REDO location. This case can happen if
   the server crashes while an exclusive backup is in progress.
   So crash-in-the-middle-of-backup doesn't prevent the recovery from
   starting in this idea.

Regards,

-- 
Fujii Masao



Re: crosstab/repivot...any interest?

2019-02-26 Thread Merlin Moncure
On Tue, Feb 26, 2019 at 8:31 AM Joe Conway  wrote:
> On 2/25/19 8:34 PM, Merlin Moncure wrote:
> > The interface I'm looking at is:
> > SELECT repivot(
> >   query TEXT,
> >   static_attributes INT,  /* number of static attributes that are
> > unchanged around key; we need this in our usages */
> >   attribute_query  TEXT); /* query that draws up the pivoted attribute
> > list */
> >
> > The driving query is expected to return 0+ static attributes which are
> > essentially simply pasted to the output. The next two columns are the
> > key column and the pivoting column.   So if you had three attributes,
> > the input set would be:
> >
> > a1, a2, a3, k1, p, v1...vn
> >
> > Where the coordinates v and p would exchange.  I need to get this done
> > quickly and so am trying to avoid more abstracted designs (maybe multi
> > part keys should be supported through...this is big limitation of
> > crosstab albeit with some obnoxious work arounds).
>
> Perhaps not enough coffee yet, but I am not sure I fully grok the design
> here. A fully fleshed out example would be useful.

Thanks for your interest.

A lot of our application data is organized in columnar type formats.
Typically, the analytical reports on the back of the database are
organized withe output columns being a KPI or a unit of time.

KPI Columns:
Country,Make,Model,TimeSlice,Sales,Units
US,Ford,Fusion,Y2018Q1,165MM$,250k
US,Ford,Fusion,Y2018Q2,172MM$,261k
US,Ford,Fusion,Y2018Q3,183MM$,268k
...

Time Columns:
Country,Make,Mode, KPI,Y2018Q1,Y2018Q2,Y2018Q3
US,Ford,Fusion,Y2018Q1,Sales,165MM$,172MM$,183MM$
US,Ford,Fusion,Y2018Q2,Units,250k,261k,268k

SELECT repivot(
  ,
  1, /* only one static attribute */
  2, /* number of key columns, only needed if multi part keys are supported */
  );

In this example supporting multi-part, the repivot column is the 4th,
assumed to be attributes first, keys second, pivot column third, data
block last.  Multi column pivots might be considered but I don't need
them and that's a significant expansion in scope, so I'm avoiding that
consideration.

What we need to do is convert from the first format above (which is
how data is materialized on table) to the second format.  Just like as
within crosstab, if the key column(s) are ordered into the function we
can exploit that structure for an efficient implementation.

'Ford' and 'Fusion' are key columns; 'US' is uninteresting attribute
column (it is unambiguously represented by 'Ford', 'Fusion') and so
would be simply be copied from input to output set.

Our data is materialized in KPI style (which is pretty typical) but we
have large analytical reports that want it with columns representing
units of time.  This is farily common in the industry IMO.

There are various pure SQL approaches to do this but they tend to
force you to build out the X,Y columns and then join it all back
together; this can spiral out of control quite quickly from a
performance standpoint.  A crosstab style crawling cursor over ordered
set ought to get the job done very efficiently.  tablefunc doesn't
support multi part keys today, and the workaround is that you have so
stuff a key into a formatted string and than parse it out for
downstream joining, some needs of having to do that joining might
hopefully be eliminated by allowing the attribute columns to be copied
through.

merlin



Re: Protect syscache from bloating with negative cache entries

2019-02-26 Thread Robert Haas
On Mon, Feb 25, 2019 at 1:27 AM Kyotaro HORIGUCHI
 wrote:
> > I'd like to see some evidence that catalog_cache_memory_target has any
> > value, vs. just always setting it to zero.
>
> It is artificial (or acutually wont't be repeatedly executed in a
> session) but anyway what can get benefit from
> catalog_cache_memory_target would be a kind of extreme.

I agree.  So then let's not have it.

We shouldn't add more mechanism here than actually has value.  It
seems pretty clear that keeping cache entries that go unused for long
periods can't be that important; even if we need them again
eventually, reloading them every 5 or 10 minutes can't hurt that much.
On the other hand, I think it's also pretty clear that evicting cache
entries that are being used frequently will have disastrous effects on
performance; as I noted in the other email I just sent, consider the
effects of CLOBBER_CACHE_ALWAYS.  No reasonable user is going to want
to incur a massive slowdown to save a little bit of memory.

I see that *in theory* there is a value to
catalog_cache_memory_target, because *maybe* there is a workload where
tuning that GUC will lead to better performance at lower memory usage
than any competing proposal.  But unless we can actually see an
example of such a workload, which so far I don't, we're adding a knob
that everybody has to think about how to tune when in fact we have no
idea how to tune it or whether it even needs to be tuned.  That
doesn't make sense.  We have to be able to document the parameters we
have and explain to users how they should be used.  And as far as this
parameter is concerned I think we are not at that point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Protect syscache from bloating with negative cache entries

2019-02-26 Thread Robert Haas
On Mon, Feb 25, 2019 at 3:50 AM Tsunakawa, Takayuki
 wrote:
> How can I make sure that this context won't exceed, say, 10 MB to avoid OOM?

As Tom has said before and will probably say again, I don't think you
actually want that.  We know that PostgreSQL gets roughly 100x slower
with the system caches disabled - try running with
CLOBBER_CACHE_ALWAYS.  If you are accessing the same system cache
entries repeatedly in a loop - which is not at all an unlikely
scenario, just run the same query or sequence of queries in a loop -
and if the number of entries exceeds 10MB even, perhaps especially, by
just a tiny bit, you are going to see a massive performance hit.
Maybe it won't be 100x because some more-commonly-used entries will
always stay cached, but it's going to be really big, I think.

Now you could say - well it's still better than running out of memory.
However, memory usage is quite unpredictable.  It depends on how many
backends are active and how many copies of work_mem and/or
maintenance_work_mem are in use, among other things.  I don't think we
can say that just imposing a limit on the size of the system caches is
going to be enough to reliably prevent an out of memory condition
unless the other use of memory on the machine happens to be extremely
stable.

So I think what's going to happen if you try to impose a hard-limit on
the size of the system cache is that you will cause some workloads to
slow down by 3x or more without actually preventing out of memory
conditions.  What you need to do is accept that system caches need to
grow as big as they need to grow, and if that causes you to run out of
memory, either buy more memory or reduce the number of concurrent
sessions you allow.  It would be fine to instead limit the cache
memory if those cache entries only had a mild effect on performance,
but I don't think that's the case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: No-rewrite timestamp<->timestamptz conversions

2019-02-26 Thread Tom Lane
Noah Misch  writes:
> Stepping back a bit, commit b8a18ad didn't provide a great UI.  I doubt folks
> write queries this way spontaneously; to do so, they would have needed to
> learn that such syntax enables this optimization.  If I'm going to do
> something more invasive, it should optimize the idiomatic "alter table t alter
> timestamptzcol type timestamp".  One could do that with a facility like
> SupportRequestSimplify except permitted to consider STABLE facts.  I suppose I
> could add a volatility field to SupportRequestSimplify.  So far, I can't think
> of a second use case for such a facility, so instead I think
> ATColumnChangeRequiresRewrite() should have a hard-wired call for
> F_TIMESTAMPTZ_TIMESTAMP and F_TIMESTAMP_TIMESTAMPTZ.  Patch attached.  If we
> find more applications of this concept, it shouldn't be hard to migrate this
> logic into SupportRequestSimplify.  Does anyone think that's better to do from
> the start?

It'd be nice to get the SupportRequestSimplify API correct from the first
release, so if there's even a slightly plausible reason for it to support
this, I'd be inclined to err in the direction of doing so.  On the other
hand, if we really can't think of another use-case then a hard-wired fix
might be the best way.

One thing that we'd have to nail down a bit harder, if we're to add
something to the SupportRequestSimplify API, is exactly what the semantics
of the weaker check should be.  The notion of "stable" has always been a
bit squishy, in that it's not totally clear what span of time stability
of the function result is being promised over.  In the case at hand, for
instance, is it really impossible for the timezone GUC to change during
the execution of the ALTER TABLE command?  You could probably break that
if you tried hard enough, though it seems unlikely that anyone would do so
accidentally.

I also kind of wonder whether this case arises often enough for us
to be expending so much effort optimizing it in the first place.
No doubt, where one lives colors one's opinion of how likely it is
that the timezone GUC is set to UTC ... but my guess is that that's
not true in very many installations.

regards, tom lane



Re: Offline enabling/disabling of data checksums

2019-02-26 Thread Michael Banck
Hi,

Am Dienstag, den 19.02.2019, 14:02 +0900 schrieb Michael Paquier:
> On Sun, Feb 17, 2019 at 07:31:38PM +0100, Michael Banck wrote:
> > New patch attached.
> 
> - * src/bin/pg_verify_checksums/pg_verify_checksums.c
> + * src/bin/pg_checksums/pg_checksums.c
> That's lacking a rename, or this comment is incorrect.

Right, I started the rename, but then backed off pending further
discussion whether I should submit that or whether the committer will
just do it.

I've backed those 4 in-file renames out for now.

> +#if PG_VERSION_NUM >= 10
> +   StaticAssertStmt(sizeof(ControlFileData) <= PG_CONTROL_MAX_SAFE_SIZE,
> +"pg_control is too large for atomic disk writes");
> +#endif
> This is compiled with only one version of the control file data, so
> you don't need that.

Oops, yeah.

> Any reason why we don't refactor updateControlFile() into
> controldata_utils.c?  This duplicates the code, at the exception of
> some details.

Ok, I've done that now, and migrated pg_rewind as well, do you know of
any other programs that might benefit here?

This could/should probably be committed separately beforehand.

New patch attached.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutzdiff --git a/doc/src/sgml/ref/pg_verify_checksums.sgml b/doc/src/sgml/ref/pg_verify_checksums.sgml
index 905b8f1222..a565cb52ae 100644
--- a/doc/src/sgml/ref/pg_verify_checksums.sgml
+++ b/doc/src/sgml/ref/pg_verify_checksums.sgml
@@ -16,7 +16,7 @@ PostgreSQL documentation
 
  
   pg_verify_checksums
-  verify data checksums in a PostgreSQL database cluster
+  enable, disable or verify data checksums in a PostgreSQL database cluster
  
 
  
@@ -25,6 +25,11 @@ PostgreSQL documentation
option

 
+ --check
+ --disable
+ --enable
+
+
  -D
  --pgdata
 
@@ -36,10 +41,18 @@ PostgreSQL documentation
  
   Description
   
-   pg_verify_checksums verifies data checksums in a
-   PostgreSQL cluster.  The server must be shut
-   down cleanly before running pg_verify_checksums.
-   The exit status is zero if there are no checksum errors, otherwise nonzero.
+   pg_verify_checksums enable, disable or verifies data
+   checksums in a PostgreSQL cluster.  The server
+   must be shut down cleanly before running
+   pg_verify_checksums .
+   The exit status is zero if there are no checksum errors or checksum
+   enabling/disabled was successful, otherwise nonzero.
+  
+
+  
+   While checking or enabling checksums needs to scan or write every file in
+   the cluster, disabling will only update the pg_control
+   file.  
   
  
 
@@ -61,6 +74,36 @@ PostgreSQL documentation
  
 
  
+  -c
+  --check
+  
+   
+Verify checksums.
+   
+  
+ 
+
+ 
+  -d
+  --disable
+  
+   
+Disable checksums.
+   
+  
+ 
+
+ 
+  -e
+  --enable
+  
+   
+Enable checksums.
+   
+  
+ 
+
+ 
   -v
   --verbose
   
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index aa753bb315..2420aef870 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -37,7 +37,6 @@ static void createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli,
 
 static void digestControlFile(ControlFileData *ControlFile, char *source,
   size_t size);
-static void updateControlFile(ControlFileData *ControlFile);
 static void syncTargetDirectory(void);
 static void sanityChecks(void);
 static void findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex);
@@ -377,7 +376,7 @@ main(int argc, char **argv)
 	ControlFile_new.minRecoveryPoint = endrec;
 	ControlFile_new.minRecoveryPointTLI = endtli;
 	ControlFile_new.state = DB_IN_ARCHIVE_RECOVERY;
-	updateControlFile(_new);
+	update_controlfile(datadir_target, progname, _new);
 
 	pg_log(PG_PROGRESS, "syncing target data directory\n");
 	syncTargetDirectory();
@@ -667,45 +666,6 @@ digestControlFile(ControlFileData *ControlFile, char *src, size_t size)
 }
 
 /*
- * Update the target's control file.
- */
-static void
-updateControlFile(ControlFileData *ControlFile)
-{
-	char		buffer[PG_CONTROL_FILE_SIZE];
-
-	/*
-	 * For good luck, apply the same static assertions as in backend's
-	 * WriteControlFile().
-	 */
-	StaticAssertStmt(sizeof(ControlFileData) <= PG_CONTROL_MAX_SAFE_SIZE,
-	 "pg_control is too large for atomic disk writes");
-	StaticAssertStmt(sizeof(ControlFileData) <= PG_CONTROL_FILE_SIZE,
-	 "sizeof(ControlFileData) exceeds PG_CONTROL_FILE_SIZE");
-
-	/* Recalculate CRC of control file */
-	

Re: WIP: Avoid creation of the free space map for small tables

2019-02-26 Thread Petr Jelinek
On 26/02/2019 16:20, Alvaro Herrera wrote:
> On 2019-Feb-23, John Naylor wrote:
> 
>> On Fri, Feb 22, 2019 at 3:59 AM Amit Kapila  wrote:
>>> The reason for not pushing much on making the test pass for
>>> nonstandard block sizes is that when I tried existing tests, there
>>> were already some failures.
>>
>> FWIW, I currently see 8 failures (attached).
> 
> Hmm, not great -- even the strings test fails, which seems to try to handle
> the case explicitly.  I did expect the plan shape ones to fail, but I'm
> surprised about the tablesample one.
> 

The SYSTEM table sampling is basically per-page sampling so it depends
heavily on which rows are on which page.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services



Re: WIP: Avoid creation of the free space map for small tables

2019-02-26 Thread Alvaro Herrera
On 2019-Feb-23, John Naylor wrote:

> On Fri, Feb 22, 2019 at 3:59 AM Amit Kapila  wrote:
> > The reason for not pushing much on making the test pass for
> > nonstandard block sizes is that when I tried existing tests, there
> > were already some failures.
> 
> FWIW, I currently see 8 failures (attached).

Hmm, not great -- even the strings test fails, which seems to try to handle
the case explicitly.  I did expect the plan shape ones to fail, but I'm
surprised about the tablesample one.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Early WIP/PoC for inlining CTEs

2019-02-26 Thread Tom Lane
Andrew Gierth  writes:
> Here, uncommenting that NOT actually changes the result, from 22 rows to
> 4 rows, because we end up generating multiple worktable scans and the
> recursion logic is not set up to handle that.

Ugh.

> So what I think we need to do here is to forbid inlining if (a) the
> refcount is greater than 1 and (b) the CTE in question contains,
> recursively anywhere inside its rtable or the rtables of any of its
> nested CTEs, a "self_reference" RTE.

That's kind of "ugh" too: it sounds expensive, and doing it in a way
that doesn't produce false positives would be even more complicated.

Idle uncaffeinated speculation: is it practical to fix the restriction
about multiple worktable scans?

Also, I thought of a somewhat-related scenario that the code isn't
accounting for: you can break the restrictions about single evaluation
with nested WITHs, like

with x as not materialized (with y as materialized (select random() r) select * 
from y)
select * from x, x x1;

In this particular example, we're saved from computing random() twice
by the checks for volatile functions.  But without that, y is inlined
and computed twice, e.g.

explain verbose with x as not materialized (with y as (select now() r) select * 
from y)
select * from x, x x1;
   QUERY PLAN   

 Nested Loop  (cost=0.00..0.06 rows=1 width=16)
   Output: (now()), (now())
   ->  Result  (cost=0.00..0.01 rows=1 width=8)
 Output: now()
   ->  Result  (cost=0.00..0.01 rows=1 width=8)
 Output: now()
(6 rows)

As a user I think I'd find that surprising, and bad if y were expensive.

Is it practical to inline the outer "x" level and still compute "y"
only once?  If not, I think we need to disallow inlining anything
that contains a "with".

regards, tom lane



Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2019-02-26 Thread Kuntal Ghosh
On Tue, Feb 26, 2019 at 6:46 PM Simon Riggs  wrote:
>
> On Thu, 21 Feb 2019 at 15:38, Kuntal Ghosh  wrote:
>
>>
>> Thank you for the patch. It seems to me that while performing COPY
>> FREEZE, if we've copied tuples in a previously emptied page
>
>
> There won't be any previously emptied pages because of the pre-conditions 
> required for FREEZE.
>
Right, I missed that part. Thanks for pointing that out. But, this
optimization is still possible for copying frozen tuples in a newly
created page, right? If current backend allocates a new page, copies a
bunch of frozen tuples in that page, it can set the PD_ALL_VISIBLE in
the same operation.



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: crosstab/repivot...any interest?

2019-02-26 Thread Joe Conway
On 2/25/19 8:34 PM, Merlin Moncure wrote:
> No worries, sir! Apologies on the late reply.  I've made some headway on
> this item.  Waiting for postgres to implement the SQL standard pivoting
> (regardless if it implements the cases I need) is not an option for my
> personal case. I can't use the SQL approach either as it's very slow and
> imposing some scaling limits that need to work around in the short run.
> 
> My strategy is to borrow [steal] from crosstab_hash and make a new
> version called repivot which takes an arleady pivoted data set and
> repivots it against an identified column.   Most of the code can be
> shared with tablefunc so ideally this could live as an amendment to that
> extension.  That might not happen though, so I'm going to package it as
> a separate extension (removing the majority of tablefunc that I don't
> need) and submit it to this group for consideration.


I can't promise there will be consensus to add to tablefunc, but I am
not opposed and will be happy to try to help you make that happen to the
extent I can find the spare cycles.

> If we punt, it'll end up as a private extension or live the life of an
> Orphan in pgxn.  If there's some interest here, we can consider a new
> contrib extension (which I personally rate very unlikely) or recasting
> as an extra routine to tablefunc.  Any way we slice it, huge thanks to
> Joe Conway for giving us such an awesome function to work with all
> these years (not to mention the strategic plr language).  SRF crosstab()
> is still somewhat baroque, but it still fills a niche that nothing else
> implements.
> 
> The interface I'm looking at is:
> SELECT repivot(
>   query TEXT,
>   static_attributes INT,  /* number of static attributes that are
> unchanged around key; we need this in our usages */
>   attribute_query  TEXT); /* query that draws up the pivoted attribute
> list */
> 
> The driving query is expected to return 0+ static attributes which are
> essentially simply pasted to the output. The next two columns are the
> key column and the pivoting column.   So if you had three attributes,
> the input set would be:
> 
> a1, a2, a3, k1, p, v1...vn
> 
> Where the coordinates v and p would exchange.  I need to get this done
> quickly and so am trying to avoid more abstracted designs (maybe multi
> part keys should be supported through...this is big limitation of
> crosstab albeit with some obnoxious work arounds).

Perhaps not enough coffee yet, but I am not sure I fully grok the design
here. A fully fleshed out example would be useful.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


  1   2   >