Re: Slow planning time for simple query

2018-06-16 Thread Amit Kapila
On Thu, Jun 14, 2018 at 4:34 AM, Maksim Milyutin  wrote:
> 13.06.2018 12:40, Maksim Milyutin wrote:
>
> On 09.06.2018 22:49, Tom Lane wrote:
>
> Maksim Milyutin  writes:
>
> On hot standby I faced with the similar problem.
> ...
> is planned 4.940 ms on master and *254.741* ms on standby.
>
> (I wonder though why, if you executed the same query on the master,
> its setting of the index-entry-is-dead bits didn't propagate to the
> standby.)
>
>
> I have verified the number dead item pointers (through pageinspect
> extension) in the first leaf page of index participating in query
> ('main.message_instance_pkey') on master and slave nodes and have noticed a
> big difference.
>
> SELECT * FROM monitoring.bt_page_stats('main.message_instance_pkey', 3705);
>
> On master:
>
>  blkno | type | live_items | dead_items | avg_item_size | page_size |
> free_size | btpo_prev | btpo_next | btpo | btpo_flags
> ---+--+++---+---+---+---+---+--+
>   3705 | l|  1 | 58 |24 |  8192 |
> 6496 | 0 |  3719 |0 | 65
>
> On standby:
>
>  blkno | type | live_items | dead_items | avg_item_size | page_size |
> free_size | btpo_prev | btpo_next | btpo | btpo_flags
> ---+--+++---+---+---+---+---+--+
>   3705 | l| 59 |  0 |24 |  8192 |
> 6496 | 0 |  3719 |0 |  1
>
>
>
> In this point I want to highlight the issue that the changes in lp_flags
> bits (namely, set items as dead) for index item pointers doesn't propagate
> from master to replica in my case. As a consequence, on standby I have live
> index items most of which on master are marked as dead. And my queries on
> planning stage are forced to descent to heap pages under
> get_actual_variable_range execution that considerately slows down planning.
>
> Is it bug or restriction of implementation or misconfiguration of
> WAL/replication?
>

It is not a misconfiguration issue.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Slow planning time for simple query

2018-06-16 Thread Amit Kapila
On Sun, Jun 10, 2018 at 1:19 AM, Tom Lane  wrote:
> Maksim Milyutin  writes:
>> On hot standby I faced with the similar problem.
>> ...
>> is planned 4.940 ms on master and *254.741* ms on standby.
>
> Presumably the problem is that the standby isn't authorized to change
> the btree index's "entry is dead" bits,
>

I don't see anything like that in the code.  We use _bt_killitems to
mark the items as dead and neither that function or any of its caller
has any such assumption.

> so it's unable to prune index
> entries previously detected as dead, and thus the logic that intends
> to improve this situation doesn't work on the standby.
>

If my above understanding is correct, then one thing that could lead
to such behavior is the computation of RecentGlobalXmin on standby.
Basically, if the RecentGlobalXmin has a different value on standby,
then it is possible that the decision whether a particular item is
dead differs on master and standby.

> (I wonder though why, if you executed the same query on the master,
> its setting of the index-entry-is-dead bits didn't propagate to the
> standby.)
>

Because we don't WAL log it.  See _bt_killitems.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Removing "Included attributes in B-tree indexes" section from docs

2018-06-16 Thread Alvaro Herrera
On 2018-Jun-15, Peter Geoghegan wrote:

> I propose removing the "Included attributes in B-tree indexes"
> top-level section of chapter 63 from the user facing documentation.

Hi Peter,

I don't necessarily object to the proposed change, but I think you
should generally wait a bit longer for others to react.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WAL prefetch

2018-06-16 Thread Andres Freund
On 2018-06-16 23:31:49 +0300, Konstantin Knizhnik wrote:
> 
> 
> On 16.06.2018 22:23, Andres Freund wrote:
> > Hi,
> > 
> > On 2018-06-13 16:09:45 +0300, Konstantin Knizhnik wrote:
> > > Usage:
> > > 1. At master: create extension wal_prefetch
> > > 2. At replica: Call pg_wal_prefetch() function: it will not return until 
> > > you
> > > interrupt it.
> > FWIW, I think the proper design would rather be a background worker that
> > does this work. Forcing the user to somehow coordinate starting a
> > permanently running script whenever the database restarts isn't
> > great. There's also some issues around snapshots preventing vacuum
> > (which could be solved, but not nicely).
> 
> As I already wrote, the current my approach with extension and
> pg_wal_prefetch function called by user can be treated only as prototype
> implementation which can be used to estimate efficiency of prefetch. But in
> case of prefetching in shared buffers, one background worker will not be
> enough. Prefetch can can speedup recovery process if it performs reads in
> parallel or background. So more than once background worker will be needed
> for prefetch if we read data to Postgres shared buffers rather then using
> posix_prefetch to load page in OS cache.

Sure, we'd need more than one to get the full benefit, but that's not
really hard.  You'd see benefit even with a single process, because WAL
replay often has a lot of other bottlenecks too. But no reason to not
have multiple ones.

Greetings,

Andres Freund



Re: WAL prefetch

2018-06-16 Thread Andres Freund
On 2018-06-16 23:25:34 +0300, Konstantin Knizhnik wrote:
> 
> 
> On 16.06.2018 22:02, Andres Freund wrote:
> > On 2018-06-16 11:38:59 +0200, Tomas Vondra wrote:
> > > 
> > > On 06/15/2018 08:01 PM, Andres Freund wrote:
> > > > On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:
> > > > > 
> > > > > On 14.06.2018 09:52, Thomas Munro wrote:
> > > > > > On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
> > > > > >  wrote:
> > > > > > > pg_wal_prefetch function will infinitely traverse WAL and 
> > > > > > > prefetch block
> > > > > > > references in WAL records
> > > > > > > using posix_fadvise(WILLNEED) system call.
> > > > > > Hi Konstantin,
> > > > > > 
> > > > > > Why stop at the page cache...  what about shared buffers?
> > > > > > 
> > > > > It is good question. I thought a lot about prefetching directly to 
> > > > > shared
> > > > > buffers.
> > > > I think that's definitely how this should work.  I'm pretty strongly
> > > > opposed to a prefetching implementation that doesn't read into s_b.
> > > > 
> > > Could you elaborate why prefetching into s_b is so much better (I'm sure 
> > > it
> > > has advantages, but I suppose prefetching into page cache would be much
> > > easier to implement).
> > I think there's a number of issues with just issuing prefetch requests
> > via fadvise etc:
> > 
> > - it leads to guaranteed double buffering, in a way that's just about
> >guaranteed to *never* be useful. Because we'd only prefetch whenever
> >there's an upcoming write, there's simply no benefit in the page
> >staying in the page cache - we'll write out the whole page back to the
> >OS.
> 
> Sorry, I do not completely understand this.

> Prefetch is only needed for partial update of a page - in this case we need
> to first read page from the disk

Yes.


> before been able to perform update. So before "we'll write out the whole
> page back to the OS" we have to read this page.
> And if page is in OS cached (prefetched) then is can be done much faster.

Yes.


> Please notice that at the moment of prefetch there is no double
> buffering.

Sure, but as soon as it's read there is.


> As far as page is not accessed before, it is not present in shared buffers.
> And once page is updated,  there is really no need to keep it in shared
> buffers.  We can use cyclic buffers (like in case  of sequential scan or
> bulk update) to prevent throwing away useful pages from shared  buffers by
> redo process. So once again there will no double buffering.

That's a terrible idea. There's a *lot* of spatial locality of further
WAL records arriving for the same blocks.


> I am not so familiar with current implementation of full page writes
> mechanism in Postgres.
> So may be my idea explained below is stupid or already implemented (but I
> failed to find any traces of this).
> Prefetch is needed only for WAL records performing partial update. Full page
> write doesn't require prefetch.
> Full page write has to be performed when the page is update first time after
> checkpoint.
> But what if slightly extend this rule and perform full page write also when
> distance from previous full page write exceeds some delta
> (which somehow related with size of OS cache)?
> 
> In this case even if checkpoint interval is larger than OS cache size, we
> still can expect that updated pages are present in OS cache.
> And no WAL prefetch is needed at all!

We could do so, but I suspect the WAL volume penalty would be
prohibitive in many cases. Worthwhile to try though.

Greetings,

Andres Freund



Re: WAL prefetch

2018-06-16 Thread Konstantin Knizhnik




On 16.06.2018 22:23, Andres Freund wrote:

Hi,

On 2018-06-13 16:09:45 +0300, Konstantin Knizhnik wrote:

Usage:
1. At master: create extension wal_prefetch
2. At replica: Call pg_wal_prefetch() function: it will not return until you
interrupt it.

FWIW, I think the proper design would rather be a background worker that
does this work. Forcing the user to somehow coordinate starting a
permanently running script whenever the database restarts isn't
great. There's also some issues around snapshots preventing vacuum
(which could be solved, but not nicely).


As I already wrote, the current my approach with extension and 
pg_wal_prefetch function called by user can be treated only as prototype 
implementation which can be used to estimate efficiency of prefetch. But 
in case of prefetching in shared buffers, one background worker will not 
be enough. Prefetch can can speedup recovery process if it performs 
reads in parallel or background. So more than once background worker 
will be needed for prefetch if we read data to Postgres shared buffers 
rather then using posix_prefetch to load page in OS cache.




Greetings,

Andres Freund





Re: WAL prefetch

2018-06-16 Thread Konstantin Knizhnik




On 16.06.2018 22:02, Andres Freund wrote:

On 2018-06-16 11:38:59 +0200, Tomas Vondra wrote:


On 06/15/2018 08:01 PM, Andres Freund wrote:

On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:


On 14.06.2018 09:52, Thomas Munro wrote:

On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
 wrote:

pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.

Hi Konstantin,

Why stop at the page cache...  what about shared buffers?


It is good question. I thought a lot about prefetching directly to shared
buffers.

I think that's definitely how this should work.  I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.


Could you elaborate why prefetching into s_b is so much better (I'm sure it
has advantages, but I suppose prefetching into page cache would be much
easier to implement).

I think there's a number of issues with just issuing prefetch requests
via fadvise etc:

- it leads to guaranteed double buffering, in a way that's just about
   guaranteed to *never* be useful. Because we'd only prefetch whenever
   there's an upcoming write, there's simply no benefit in the page
   staying in the page cache - we'll write out the whole page back to the
   OS.


Sorry, I do not completely understand this.
Prefetch is only needed for partial update of a page - in this case we 
need to first read page from the disk
before been able to perform update. So before "we'll write out the whole 
page back to the OS" we have to read this page.

And if page is in OS cached (prefetched) then is can be done much faster.

Please notice that at the moment of prefetch there is no double 
buffering. As far as page is not accessed before, it is not present in 
shared buffers. And once page is updated,  there is really no need to 
keep it in shared buffers.  We can use cyclic buffers (like in case  of 
sequential scan or bulk update) to prevent throwing away useful pages 
from shared  buffers by redo process. So once again there will no double 
buffering.

- reading from the page cache is far from free - so you add costs to the
   replay process that it doesn't need to do.
- you don't have any sort of completion notification, so you basically
   just have to guess how far ahead you want to read. If you read a bit
   too much you suddenly get into synchronous blocking land.
- The OS page is actually not particularly scalable to large amounts of
   data either. Nor are the decisions what to keep cached likley to be
   particularly useful.
- We imo need to add support for direct IO before long, and adding more
   and more work to reach feature parity strikes meas a bad move.


I am not so familiar with current implementation of full page writes 
mechanism in Postgres.
So may be my idea explained below is stupid or already implemented (but 
I failed to find any traces of this).
Prefetch is needed only for WAL records performing partial update. Full 
page write doesn't require prefetch.
Full page write has to be performed when the page is update first time 
after checkpoint.
But what if slightly extend this rule and perform full page write also 
when distance from previous full page write exceeds some delta

(which somehow related with size of OS cache)?

In this case even if checkpoint interval is larger than OS cache size, 
we still can expect that updated pages are present in OS cache.

And no WAL prefetch is needed at all!




Re: [HACKERS] GUC for cleanup indexes threshold.

2018-06-16 Thread Komяpa
Hi!

It is cool to see this in Postgres 11. However:


> 4) vacuum_cleanup_index_scale_factor can be set either by GUC or
> reloption.
> Default value is 0.1.  So, by default cleanup scan is triggered after
> increasing of
> table size by 10%.
>

vacuum_cleanup_index_scale_factor can be set to the maximum of 100.
I imagine that on a large append-only table with IOPS storage system budget
it may happen that I would want to never perform a full scan on index.
Roughly, with parameter set to 100, if we vacuum the table first time with
1 tuple and 130 byte wide rows, we'll have a full scan at 130 bytes, 12
kbytes, 1.2MB, 123MB, 12 GB, 1.2TB.

If we happen to perform the first vacuum when there are 4 tuples in the
table, it becomes 52kb, 5MB, 495MB, 48GB - and both 12GB and 48GB will
exhaust any storage spike IOPS budget, slowing everything down rather
suddenly.

Can the upper limit for this GUC be lifted, or have a value for "never"?


Re: WAL prefetch

2018-06-16 Thread Andres Freund
Hi,

On 2018-06-16 21:34:30 +0200, Tomas Vondra wrote:
> > - it leads to guaranteed double buffering, in a way that's just about
> >   guaranteed to *never* be useful. Because we'd only prefetch whenever
> >   there's an upcoming write, there's simply no benefit in the page
> >   staying in the page cache - we'll write out the whole page back to the
> >   OS.
> 
> How does reading directly into shared buffers substantially change the
> behavior? The only difference is that we end up with the double
> buffering after performing the write. Which is expected to happen pretty
> quick after the read request.

Random reads directly as a response to a read() request can be cached
differently - and we trivially could force that with another fadvise() -
than posix_fadvise(WILLNEED).  There's pretty much no other case - so
far - where we know as clearly that we won't re-read the page until
write as here.


> > - you don't have any sort of completion notification, so you basically
> >   just have to guess how far ahead you want to read. If you read a bit
> >   too much you suddenly get into synchronous blocking land.
> > - The OS page is actually not particularly scalable to large amounts of
> >   data either. Nor are the decisions what to keep cached likley to be
> >   particularly useful.
> 
> The posix_fadvise approach is not perfect, no doubt about that. But it
> works pretty well for bitmap heap scans, and it's about 13249x better
> (rough estimate) than the current solution (no prefetching).

Sure, but investing in an architecture we know might not live long also
has it's cost. Especially if it's not that complicated to do better.


> My point was that I don't think this actually adds a significant amount
> of work to the direct IO patch, as we already do prefetch for bitmap
> heap scans. So this needs to be written anyway, and I'd expect those two
> places to share most of the code. So where's the additional work?

I think it's largely entirely separate from what we'd do for bitmap
index scans.

Greetings,

Andres Freund



Re: WAL prefetch

2018-06-16 Thread Tomas Vondra
On 06/16/2018 09:02 PM, Andres Freund wrote:
> On 2018-06-16 11:38:59 +0200, Tomas Vondra wrote:
>>
>>
>> On 06/15/2018 08:01 PM, Andres Freund wrote:
>>> On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:


 On 14.06.2018 09:52, Thomas Munro wrote:
> On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
>  wrote:
>> pg_wal_prefetch function will infinitely traverse WAL and prefetch block
>> references in WAL records
>> using posix_fadvise(WILLNEED) system call.
> Hi Konstantin,
>
> Why stop at the page cache...  what about shared buffers?
>

 It is good question. I thought a lot about prefetching directly to shared
 buffers.
>>>
>>> I think that's definitely how this should work.  I'm pretty strongly
>>> opposed to a prefetching implementation that doesn't read into s_b.
>>>
>>
>> Could you elaborate why prefetching into s_b is so much better (I'm sure it
>> has advantages, but I suppose prefetching into page cache would be much
>> easier to implement).
> 
> I think there's a number of issues with just issuing prefetch requests
> via fadvise etc:
> 
> - it leads to guaranteed double buffering, in a way that's just about
>   guaranteed to *never* be useful. Because we'd only prefetch whenever
>   there's an upcoming write, there's simply no benefit in the page
>   staying in the page cache - we'll write out the whole page back to the
>   OS.

How does reading directly into shared buffers substantially change the
behavior? The only difference is that we end up with the double
buffering after performing the write. Which is expected to happen pretty
quick after the read request.

> - reading from the page cache is far from free - so you add costs to the
>   replay process that it doesn't need to do.
> - you don't have any sort of completion notification, so you basically
>   just have to guess how far ahead you want to read. If you read a bit
>   too much you suddenly get into synchronous blocking land.
> - The OS page is actually not particularly scalable to large amounts of
>   data either. Nor are the decisions what to keep cached likley to be
>   particularly useful.

The posix_fadvise approach is not perfect, no doubt about that. But it
works pretty well for bitmap heap scans, and it's about 13249x better
(rough estimate) than the current solution (no prefetching).

> - We imo need to add support for direct IO before long, and adding more
>   and more work to reach feature parity strikes meas a bad move.
> 

IMHO it's unlikely to happen in PG12, but I might be over-estimating the
invasiveness and complexity of the direct I/O change. While this patch
seems pretty doable, and the improvements are pretty significant.

My point was that I don't think this actually adds a significant amount
of work to the direct IO patch, as we already do prefetch for bitmap
heap scans. So this needs to be written anyway, and I'd expect those two
places to share most of the code. So where's the additional work?

I don't think we should reject patches just because it might add a bit
of work to some not-yet-written future patch ... (which I however don't
think is this case).


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WAL prefetch

2018-06-16 Thread Andres Freund
Hi,

On 2018-06-13 16:09:45 +0300, Konstantin Knizhnik wrote:
> Usage:
> 1. At master: create extension wal_prefetch
> 2. At replica: Call pg_wal_prefetch() function: it will not return until you
> interrupt it.

FWIW, I think the proper design would rather be a background worker that
does this work. Forcing the user to somehow coordinate starting a
permanently running script whenever the database restarts isn't
great. There's also some issues around snapshots preventing vacuum
(which could be solved, but not nicely).

Greetings,

Andres Freund



Re: GCC 8 warnings

2018-06-16 Thread Tom Lane
Andres Freund  writes:
> On 2018-06-16 13:29:55 -0400, Tom Lane wrote:
>> +  # Similarly disable useless truncation warnings from gcc 8+
>> +  PGAC_PROG_CC_VAR_OPT(NOT_THE_CFLAGS, [-Wformat-truncation])
>> +  if test -n "$NOT_THE_CFLAGS"; then
>> +CFLAGS="$CFLAGS -Wno-format-truncation"
>> +  fi
>> +  PGAC_PROG_CC_VAR_OPT(NOT_THE_CFLAGS, [-Wstringop-truncation])
>> +  if test -n "$NOT_THE_CFLAGS"; then
>> +CFLAGS="$CFLAGS -Wno-stringop-truncation"
>> +  fi

> I've not had a lot of coffee yet this morning, but couldn't there be an
> issue where a compiler supported one of these flags? Because
> NOT_THE_CFLAGS is reused, it'd trigger lateron as well, right?  Seems to
> me we'd need to reset it.

Yeah, I found that out as soon as I checked this on another platform ;-).
Will fix.

regards, tom lane



Re: GCC 8 warnings

2018-06-16 Thread Andres Freund
Hi,

On 2018-06-16 13:29:55 -0400, Tom Lane wrote:
> I propose the attached patch to disable these warnings if the compiler
> knows the switch for them.  I did not turn them off for CXX though;
> anyone think there's a need to?

No, not for now.  I don't think it's likely that the amount of C++ will
grow significantly anytime soon. And it seems unlikely that the llvm
interfacing code will use C stringops.  Not that I think it'd hurt much
to add it.


> Probably all of this ought to be back-patched as applicable, since
> people will doubtless try to compile back branches with gcc 8.

Yea, It's already pretty annoying.


> diff --git a/configure.in b/configure.in
> index 862d8b128d..dae29a4ab1 100644
> --- a/configure.in
> +++ b/configure.in
> @@ -502,6 +502,15 @@ if test "$GCC" = yes -a "$ICC" = no; then
>if test -n "$NOT_THE_CFLAGS"; then
>  CFLAGS="$CFLAGS -Wno-unused-command-line-argument"
>fi
> +  # Similarly disable useless truncation warnings from gcc 8+
> +  PGAC_PROG_CC_VAR_OPT(NOT_THE_CFLAGS, [-Wformat-truncation])
> +  if test -n "$NOT_THE_CFLAGS"; then
> +CFLAGS="$CFLAGS -Wno-format-truncation"
> +  fi
> +  PGAC_PROG_CC_VAR_OPT(NOT_THE_CFLAGS, [-Wstringop-truncation])
> +  if test -n "$NOT_THE_CFLAGS"; then
> +CFLAGS="$CFLAGS -Wno-stringop-truncation"
> +  fi

I've not had a lot of coffee yet this morning, but couldn't there be an
issue where a compiler supported one of these flags? Because
NOT_THE_CFLAGS is reused, it'd trigger lateron as well, right?  Seems to
me we'd need to reset it.

Greetings,

Andres Freund



Re: GCC 8 warnings

2018-06-16 Thread Tom Lane
Andres Freund  writes:
> On 2018-04-28 12:16:39 -0400, Tom Lane wrote:
>> In the meantime, I think our response to GCC 8 should just be to
>> enable -Wno-format-truncation.  Certainly so in the back branches.
>> There isn't one of these changes that is actually fixing a bug,
>> which to me says that that warning is unhelpful.

> Agreed. Or at least turn down its aggressiveness to the cases where it's
> actually sure truncation happens.

I got around to poking into this today.  Unfortunately, it doesn't seem
that there's any more-conservative level of -Wformat-truncation available.
Likewise for -Wstringop-truncation, which is the other rich new source
of useless warnings (it appears to be predicated on the assumption that
you never actually want the defined semantics of strncpy).  Hence,
I propose the attached patch to disable these warnings if the compiler
knows the switch for them.  I did not turn them off for CXX though;
anyone think there's a need to?

Testing with Fedora 28's gcc (currently 8.1.1), this leaves three new
warnings, which seem worth fixing.  Two of them are gratuitous use of
strncpy where memcpy would serve, in ecpg, and one is an sprintf in
pg_waldump that should be snprintf for safety's sake:

common.c: In function 'pgtypes_fmt_replace':
common.c:48:5: warning: 'strncpy' specified bound depends on the length of the 
source argument [-Wstringop-overflow=]
 strncpy(*output, replace_val.str_val, i + 1);
 ^~~~
common.c:42:8: note: length computed here
i = strlen(replace_val.str_val);
^~~
In function 'get_char_item',
inlined from 'ECPGget_desc' at descriptor.c:334:10:
descriptor.c:221:6: warning: 'strncpy' specified bound depends on the length of 
the source argument [-Wstringop-overflow=]
  strncpy(variable->arr, value, strlen(value));
  ^~~~
compat.c: In function 'timestamptz_to_str':
compat.c:61:19: warning: '%06d' directive writing between 6 and 7 bytes into a 
region of size between 0 and 128 [-Wformat-overflow=]
  sprintf(buf, "%s.%06d %s", ts, (int) (dt % USECS_PER_SEC), zone);
   ^~~~
compat.c:61:15: note: directive argument in the range [-99, 99]
  sprintf(buf, "%s.%06d %s", ts, (int) (dt % USECS_PER_SEC), zone);
   ^~~~
compat.c:61:2: note: 'sprintf' output between 9 and 266 bytes into a 
destination of size 129
  sprintf(buf, "%s.%06d %s", ts, (int) (dt % USECS_PER_SEC), zone);
  ^~~~

Probably all of this ought to be back-patched as applicable, since
people will doubtless try to compile back branches with gcc 8.

regards, tom lane

diff --git a/configure.in b/configure.in
index 862d8b128d..dae29a4ab1 100644
--- a/configure.in
+++ b/configure.in
@@ -502,6 +502,15 @@ if test "$GCC" = yes -a "$ICC" = no; then
   if test -n "$NOT_THE_CFLAGS"; then
 CFLAGS="$CFLAGS -Wno-unused-command-line-argument"
   fi
+  # Similarly disable useless truncation warnings from gcc 8+
+  PGAC_PROG_CC_VAR_OPT(NOT_THE_CFLAGS, [-Wformat-truncation])
+  if test -n "$NOT_THE_CFLAGS"; then
+CFLAGS="$CFLAGS -Wno-format-truncation"
+  fi
+  PGAC_PROG_CC_VAR_OPT(NOT_THE_CFLAGS, [-Wstringop-truncation])
+  if test -n "$NOT_THE_CFLAGS"; then
+CFLAGS="$CFLAGS -Wno-stringop-truncation"
+  fi
 elif test "$ICC" = yes; then
   # Intel's compiler has a bug/misoptimization in checking for
   # division by NAN (NaN == 0), -mp1 fixes it, so add it to the CFLAGS.


Re: POC: GROUP BY optimization

2018-06-16 Thread Tomas Vondra



On 06/13/2018 06:56 PM, Teodor Sigaev wrote:
>> I.e. we already do reorder the group clauses to match ORDER BY, to only
>> require a single sort. This happens in preprocess_groupclause(), which
>> also explains the reasoning behind that.
> Huh. I missed that. That means group_keys_reorder_by_pathkeys() call
> inside get_cheapest_group_keys_order() duplicates work of
> preprocess_groupclause().
>  And so it's possible to replace it to simple counting number of the
> same pathkeys in beginning of lists. Or remove most part of
> preprocess_groupclause() - but it seems to me we should use first
> option, preprocess_groupclause() is simpler, it doesn't operate with
> pathkeys only with  SortGroupClause which is simpler.
> 
> BTW, incremental sort path provides pathkeys_common(), exactly what we
> need.
> 
>> I wonder if some of the new code reordering group pathkeys could/should
>> be moved here (not sure, maybe it's too early for those decisions). In
>> any case, it might be appropriate to update some of the comments before
>> preprocess_groupclause() which claim we don't do certain things added by
>> the proposed patches.
> 
> preprocess_groupclause() is too early step to use something like
> group_keys_reorder_by_pathkeys() because pathkeys is not known here yet.
> 
> 
>> This probably also somewhat refutes my claim that the order of
>> grouping keys is currently fully determined by users (and so they
>> may pick the most efficient order), while the reorder-by-ndistinct
>> patch would make that impossible. Apparently when there's ORDER BY,
>> we already mess with the order of group clauses - there are ways to
>> get around it (subquery with OFFSET 0) but it's much less clear.
> 
> I like a moment when objections go away :)
> 

I'm afraid that's a misunderstanding - my objections did not really go
away. I'm just saying my claim that we're not messing with order of
grouping keys is not entirely accurate, because we do that in one
particular case.

I still think we need to be careful when introducing new optimizations
in this area - reordering the grouping keys by ndistinct, ORDER BY or
whatever. In particular I don't think we should commit these patches
that may quite easily cause regressions, and then hope some hypothetical
future patch fixes the costing. Those objections still stand.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: row_to_json(), NULL values, and AS

2018-06-16 Thread Tom Lane
I wrote:
> Neil Conway  writes:
>> On Fri, Jun 15, 2018 at 7:53 AM Tom Lane  wrote:
>>> I'm a bit hesitant to muck with this behavior, given that it's stood
>>> for ~20 years.  However, if we did want to touch it, maybe the right
>>> thing would be to give up the absolutist position that f(x) and x.f
>>> are exactly interchangeable.  We could say instead that we prefer the
>>> function interpretation if function syntax is used, and the column
>>> interpretation if column syntax is used.

>> Interesting! Your proposed change seems quite reasonable to me.

> Here's a proposed patch for that.  (It needs to be applied over the
> fixes in <14497.1529089...@sss.pgh.pa.us>, which are unrelated but
> touch some of the same code.)  I didn't add any test cases yet,
> but probably it's desirable to have one.

It occurred to me this morning that there's actually a dump/reload
hazard in the current behavior.  Consider

regression=# create table t1 (f1 int, f2 text);
CREATE TABLE
regression=# create view v1 as select row_to_json(t1) as j from t1;
CREATE VIEW
regression=# alter table t1 add column row_to_json text;
ALTER TABLE
regression=# \d+ v1
View "public.v1"
 Column | Type | Collation | Nullable | Default | Storage  | Description 
+--+---+--+-+--+-
 j  | json |   |  | | extended | 
View definition:
 SELECT row_to_json(t1.*) AS j
   FROM t1;

At this point, pg_dump will naively dump the view as

CREATE VIEW public.v1 AS
 SELECT row_to_json(t1.*) AS j
   FROM public.t1;

which, with the historical behavior, will be read as a reference to
the subsequently-added column, giving a view definition that means
something else entirely:

regression=# \d+ v1
View "public.v1"
 Column | Type | Collation | Nullable | Default | Storage  | Description 
+--+---+--+-+--+-
 j  | text |   |  | | extended | 
View definition:
 SELECT t1.row_to_json AS j
   FROM t1;

The proposed change fixes this on the parsing side, with no need
for any hacking in ruleutils.c (hence it will fix it for pre-existing
dump files too).

So I'm now pretty well convinced that this is a good change and we
should slip it into v11.  I'm still afraid to back-patch though.
I vaguely recall one or two prior complaints in this area, but not
so many that it's worth taking a chance of breaking applications
in minor releases.

regards, tom lane



Query Rewrite for Materialized Views (Postgres Extension)

2018-06-16 Thread John Dent
Hi folks,

I thought I’d share an update to my pet project, which dynamically rewrites 
queries to target materialized views when they are available and can satisfy a 
query (or part of it) with a lower cost plan.

The extension is now a regular EXTENSION and no longer tied in to the FDW 
mechanism. As a result, it may now be more generally usable, and less 
complicated to integrate into an existing system. (The FDW approach was an easy 
way for me to get started, but it ultimately added complexity and was rather 
limiting.)

Same caution as before applies:

**NOTE: this is not "production ready" code — if it works for you, then great, 
but use it after thorough testing, and with appropriate caution.**

Built, and has rudimentary testing against Postgres 10.1..10.3.

Source code: https://github.com/d-e-n-t-y/pg_fdw_mv_rewrite
README: https://github.com/d-e-n-t-y/pg_fdw_mv_rewrite/blob/master/README.md

Hope it is useful or interesting for someone! Questions or comments are very 
welcome.

denty.

> Begin original message:
> 
> From: Dent John 
> Subject: Query Rewrite for Materialized Views (FDW Extension)
> Date: 5 April 2018 at 14:41:15 BST
> To: pgsql-hackers@lists.postgresql.org
> 
> Hi,
> 
> I wanted to share the project I've been working on which dynamically rewrites 
> queries to target materialized views when views are available that can 
> satisfy part of a query with lower cost plans.
> 
> I embarked upon as an interesting side project. It took me a bit more time 
> than I anticipated, but the result works for my use case. Because of that, I 
> thought it worth sharing. However I would caution that my use case is not 
> exactly of a commercial scale... so please heed the following obligatory 
> warning:
> 
> **NOTE: this is not "production ready" code — if it works for you, then 
> great, but use it after thorough testing, and with appropriate caution.**
> 
> There are some limitations to the rewrite opportunities it takes up, and it 
> will almost certainly fail on complex materialized views composed of deeply 
> nested queries.
> 
> The extension does not have extensive (actually: any) documentation, but the 
> few test cases should make obvious to the inclined reader how it works. This 
> is deliberate at this early a stage: I don't want to encourage uninformed 
> adoption because of the possibility of data loss or incorrect query rewrites.
> 
> The extension is written against a Postgres 10.1 source tree.
> 
> Source code: https://github.com/d-e-n-t-y/pg_fdw_mv_rewrite
> 
> Questions or comments are very welcome.
> 
> denty.



Re: server crashed with TRAP: FailedAssertion("!(!parallel_aware || pathnode->path.parallel_safe)"

2018-06-16 Thread Amit Kapila
On Sat, Jun 16, 2018 at 10:44 AM, Amit Kapila  wrote:
> On Thu, Jun 14, 2018 at 9:54 PM, Tom Lane  wrote:
>> Rajkumar Raghuwanshi  writes:
>>> I am getting a server crash for below test case.
>>
>>> postgres=# select (select max((select t1.c2 from test t1 where t1.c1 =
>>> t2.c1))) from test t2;
>>> server closed the connection unexpectedly
>>
>> Reproduced here.  The assert seems to be happening because
>> add_paths_to_append_rel is trying to create a parallel path for
>> an appendrel that is marked consider_parallel = false.
>>
>> This appears to be the fault of commit ab7271677, whose authors I've cc'd:
>> the stanza starting at about allpaths.c:1672 is bullheadedly creating a
>> parallel path whether that's allowed or not.  Fixing it might be as simple
>> as adding "rel->consider_parallel" to the conditions there.
>>
>
> Yeah, or perhaps disallow creation of any partial paths at the first
> place like in attached.   This will save us some work as well.
>

Attached patch contains test case as well.  I have tried to come up
with some simple test, but couldn't come up with anything much simpler
than reported by Rajkumar, so decided to use the test case provided by
him.

Added to Open Items list.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


fix_pa_path_generation_v2.patch
Description: Binary data


Re: WAL prefetch

2018-06-16 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On 06/16/2018 12:06 PM, Thomas Munro wrote:
> >On Sat, Jun 16, 2018 at 9:38 PM, Tomas Vondra
> > wrote:
> >>On 06/15/2018 08:01 PM, Andres Freund wrote:
> >>>On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:
> On 14.06.2018 09:52, Thomas Munro wrote:
> >Why stop at the page cache...  what about shared buffers?
> 
> It is good question. I thought a lot about prefetching directly to shared
> buffers.
> >>>
> >>>I think that's definitely how this should work.  I'm pretty strongly
> >>>opposed to a prefetching implementation that doesn't read into s_b.
> >>
> >>Could you elaborate why prefetching into s_b is so much better (I'm sure it 
> >>has advantages, but I suppose prefetching into page cache would be much 
> >>easier to implement).
> >
> >posix_fadvise(POSIX_FADV_WILLNEED) might already get most of the
> >speed-up available here in the short term for this immediate
> >application, but in the long term a shared buffers prefetch system is
> >one of the components we'll need to support direct IO.
> >
> 
> Sure. Assuming the switch to direct I/O will happen (it probably will,
> sooner or later), my question is whether this patch should be required to
> introduce the prefetching into s_b. Or should we use posix_fadvise for now,
> get most of the benefit, and leave the prefetch into s_b as an improvement
> for later?
> 
> The thing is - we're already doing posix_fadvise prefetching in bitmap heap
> scans, it would not be putting additional burden on the direct I/O patch
> (hypothetical, so far).

This was my take on it also.  Prefetching is something we've come to
accept in other parts of the system and if it's beneficial to add it
here then we should certainly do so and it seems like it'd keep the
patch nice and simple and small.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: [HACKERS] Statement-level rollback

2018-06-16 Thread Simon Riggs
On 15 June 2018 at 21:23, Alvaro Herrera  wrote:

> I think the main objectionable point is that of making servers behave in
> a way that could lose data, if applications assume that transactions
> behave in the way they do today.  I propose that we solve this by
> allowing this feature to be enabled only via one of:
>
> * a PGOPTIONS connection-time option
> * ALTER USER SET (transaction_rollback_scope)
>
> but it can be *disabled* normally via SET.  In other words, changing the
> scope from transaction to statement in a running session is forbidden,
> but changing it the other way around is allowed (if app is unsure
> whether env is unsafe, it can set the scope to "transaction" to ensure
> it's safe from that point onwards).

If that allows us to annotate a function with SET
transaction_rollback_scope = whatever
then it works. We probably need to be able to identify code that
will/will not work in the new mode.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WAL prefetch

2018-06-16 Thread Tomas Vondra




On 06/16/2018 12:06 PM, Thomas Munro wrote:

On Sat, Jun 16, 2018 at 9:38 PM, Tomas Vondra
 wrote:

On 06/15/2018 08:01 PM, Andres Freund wrote:

On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:

On 14.06.2018 09:52, Thomas Munro wrote:

Why stop at the page cache...  what about shared buffers?


It is good question. I thought a lot about prefetching directly to shared
buffers.


I think that's definitely how this should work.  I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.


Could you elaborate why prefetching into s_b is so much better (I'm sure it has 
advantages, but I suppose prefetching into page cache would be much easier to 
implement).


posix_fadvise(POSIX_FADV_WILLNEED) might already get most of the
speed-up available here in the short term for this immediate
application, but in the long term a shared buffers prefetch system is
one of the components we'll need to support direct IO.



Sure. Assuming the switch to direct I/O will happen (it probably will, 
sooner or later), my question is whether this patch should be required 
to introduce the prefetching into s_b. Or should we use posix_fadvise 
for now, get most of the benefit, and leave the prefetch into s_b as an 
improvement for later?


The thing is - we're already doing posix_fadvise prefetching in bitmap 
heap scans, it would not be putting additional burden on the direct I/O 
patch (hypothetical, so far).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WAL prefetch

2018-06-16 Thread Thomas Munro
On Sat, Jun 16, 2018 at 9:38 PM, Tomas Vondra
 wrote:
> On 06/15/2018 08:01 PM, Andres Freund wrote:
>> On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:
>>> On 14.06.2018 09:52, Thomas Munro wrote:
 Why stop at the page cache...  what about shared buffers?
>>>
>>> It is good question. I thought a lot about prefetching directly to shared
>>> buffers.
>>
>> I think that's definitely how this should work.  I'm pretty strongly
>> opposed to a prefetching implementation that doesn't read into s_b.
>
> Could you elaborate why prefetching into s_b is so much better (I'm sure it 
> has advantages, but I suppose prefetching into page cache would be much 
> easier to implement).

posix_fadvise(POSIX_FADV_WILLNEED) might already get most of the
speed-up available here in the short term for this immediate
application, but in the long term a shared buffers prefetch system is
one of the components we'll need to support direct IO.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: WAL prefetch

2018-06-16 Thread Tomas Vondra




On 06/15/2018 08:01 PM, Andres Freund wrote:

On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:



On 14.06.2018 09:52, Thomas Munro wrote:

On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
 wrote:

pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.

Hi Konstantin,

Why stop at the page cache...  what about shared buffers?



It is good question. I thought a lot about prefetching directly to shared
buffers.


I think that's definitely how this should work.  I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.



Could you elaborate why prefetching into s_b is so much better (I'm sure 
it has advantages, but I suppose prefetching into page cache would be 
much easier to implement).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: ON CONFLICT DO NOTHING on pg_dump

2018-06-16 Thread Dilip Kumar
On Thu, Jun 14, 2018 at 4:09 PM, Surafel Temesgen  wrote:
>
>
> thank you for pointing me that i add basic test and it seems to me the rest
> of the test is covered by column_inserts test

@@ -172,6 +172,7 @@ typedef struct _dumpOptions
  char*outputSuperuser;

  int sequence_data; /* dump sequence data even in schema-only mode */
+ int do_nothing;
 } DumpOptions;

The new structure member appears out of place, can you move up along
with other "command-line long options" ?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com