Re: Speed up the removal of WAL files

2018-02-20 Thread Michael Paquier
On Wed, Feb 21, 2018 at 07:20:00AM +, Tsunakawa, Takayuki wrote:
> Right.  Then I thought of doing the following to avoid making a new
> function only for RemoveNonParentXlogFiles() which is similar to
> RemoveXlogFile(). 
> 
> * Add an argument "bool durable" to RemoveXlogFile().  Based on its
> value, RemoveXlogFile() calls either durable_xx() or xx(). 
> 
> * Likewise, add an argument "bool durable" to InstallXLogFileSegment()
> and do the same.
>
> * RemoveNonParentXlogFiles() calls RemoveXlogFile() with
> durable=false.  At the end of the function, sync the pg_wal directory
> with fsync_fname(). 
> 
> * RemoveOldXlogFiles() does the same thing.  One difference is that it
> passes false to RemoveXlogFile() during recovery (InRecovery == true)
> and true otherwise. 

It seems to me that you would reintroduce partially the problems that
1d4a0ab1 has fixed.  In short, if a crash happens in the code paths
calling RemoveXlogFile with durable = false before fsync'ing pg_wal,
then a rename has no guarantee to be durable, so you could finish again
with a file that as an old name, but new contents.  A crucial thing
which matters for a rename to be durable is that the old file is sync'ed
as well.
--
Michael


signature.asc
Description: PGP signature


Re: pg_get_functiondef forgets about most GUC_LIST_INPUT GUCs

2018-02-20 Thread Michael Paquier
On Tue, Feb 20, 2018 at 06:46:57PM +0300, Arthur Zakirov wrote:
> Just 2 cents from me. It seems that there is a problem with extensions
> GUCs.
>
> [...]
>
> =# SELECT pg_get_functiondef('func_with_set_params'::regproc);
> ERROR:  unrecognized configuration parameter "plpgsql.extra_errors"

You have an excellent point here, and I did not consider it.
For pg_dump and pg_dumpall, something like what I described in
https://www.postgresql.org/message-id/20180112012440.ga2...@paquier.xyz
to extend pg_settings so as GUC types are visible via SQL could make
sense, now it is really a lot for just being able to quote parameters
correctly.  On top of that, what I suggested previously would not work
reliably except if we have pg_get_functiondef load the library
associated to a given parameter.  Even in this case there could
perfectly be a custom parameter from another plugin and not a parameter
associated to the function language itself.

It seems to me that this brings us back to a best-effort for the backend
and the frontend by maintaining a list of hardcoded parameter names,
which could be refined a maximum by considering lists for in-core
extensions and perhaps the most famous extensions around :(

Thoughts?
--
Michael


signature.asc
Description: PGP signature


RE: Speed up the removal of WAL files

2018-02-20 Thread Tsunakawa, Takayuki
From: Fujii Masao [mailto:masao.fu...@gmail.com]
> On Fri, Nov 17, 2017 at 5:20 PM, Tsunakawa, Takayuki
>  wrote:
> > Yes, I noticed it after submitting the patch and was wondering what to
> do.  Thinking simply, I think it would be just enough to replace
> durable_unlink/durable_rename in RemoveXLogFile() with unlink/rename, and
> sync the pg_wal directory once in RemoveNonParentXlogFiles() and
> RemoveOldXlogFiles().  This will benefit the failover time when fast
> promotion is not performed.  What do you think?
> 
> It seems not good idea to just replace durable_rename() with rename() in
> RemoveOldXlogFiles()->RemoveXlogFiles()->InstallXLogFileSegment().
> Because that change seems to be able to cause the following problem.
> 
> 1. Checkpoint calls RemoveOldXlogFiles().
> 2. It recycles the WAL file AAA to BBB. pg_wal directory has not fsync'd
> yet.
> 3. Another transaction (TX1) writes its WAL data into the (recycled) file
> BBB.
> 4. CRASH and RESTART
> 5. The WAL file BBB disappears and you can see AAA,
> but AAA is not used in recovery. This causes data loss of transaction
> by Tx1.

Right.  Then I thought of doing the following to avoid making a new function 
only for RemoveNonParentXlogFiles() which is similar to RemoveXlogFile().

* Add an argument "bool durable" to RemoveXlogFile().  Based on its value, 
RemoveXlogFile() calls either durable_xx() or xx().

* Likewise, add an argument "bool durable" to InstallXLogFileSegment() and do 
the same.

* RemoveNonParentXlogFiles() calls RemoveXlogFile() with durable=false.  At the 
end of the function, sync the pg_wal directory with fsync_fname().

* RemoveOldXlogFiles() does the same thing.  One difference is that it passes 
false to RemoveXlogFile() during recovery (InRecovery == true) and true 
otherwise.

But this requires InstallXLogFileSegment() to also have an argument "bool 
durable."  That means InstallXLogFileSegment() and RemoveXlogFile() have to 
include something like below in several places, which is dirty:

if (durable)
rc = durable_xx(path, elevel);
else
rc = xx(path);
if (rc != 0)
{
/* durable_xx() output the message */
if (!durable)
ereport(elevel, ...);
}

Do you think of a cleaner way?


From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
> The orinal code recycles some of the to-be-removed files, but the
> patch removes all the victims.  This may impact on performance.

At most only 10 WAL files are recycled.  If there's no good solution to the 
above matter, can we compromise with the current patch?

if (PriorRedoPtr == InvalidXLogRecPtr)
recycleSegNo = endlogSegNo + 10;

> Likewise the original code is using durable_unlink to actually
> remove a file so separating unlink and fsync might resurrect the
> problem that should have been fixed by
> 1b02be21f271db6bd3cd43abb23fa596fcb6bac3 (I'm not sure what it
> was but you are one of the reviwers of it). I suppose that you
> need to explain the reason why this change doesn't risk anything.

If the host crashes while running RemoveNonParentXlogFiles() and some effects 
of the function are lost, the next recovery will do them again.

Regards
Takayuki Tsunakawa





Re: Hash Joins vs. Bloom Filters / take 2

2018-02-20 Thread Thomas Munro
On Wed, Feb 21, 2018 at 10:23 AM, Tomas Vondra
 wrote:
> In 2015/2016 I've been exploring if we could improve hash joins by
> leveraging bloom filters [1], and I was reminded about this idea in a
> thread about amcheck [2]. I also see that bloom filters were briefly
> mentioned in the thread about parallel hash [3].
>
> So I've decided to revive the old patch, rebase it to current master,
> and see if we can resolve the issues that killed it in 2016.

Nice!

> Opinions?

I'm definitely following this and interested in helping in some way if
I can.  I have wondered about this subject and discussed it a bit with
Peter Geoghegan off-list.

Some assorted thoughts:

In the old thread, Peter pointed at a curious undergrad student
project from 2008[1] evaluating Bloom filters for hash joins in
PostgreSQL 8.3, inspired by a couple of older papers[2][3].  While
your patch uses a Bloom filter to short-circuit the regular bucket
probe in ExecHashJoinImpl(), these approach push the Bloom filter down
into the outer relation scan.  I suspect you're right about the fixed
sizing being a problem, but the general idea seems pretty interesting
to me and there seems to be no reason you couldn't make the filter
size dynamic as you have it and then share it via a parameter or
something.  But is there any point?

On the one hand, pushing down Bloom filters requires the hash value to
be computed by the lower scan, and then computed again if the tuple
survives the filter and makes it into the Hash Join node (unless there
is some way to attach it to the tuple...).  On the other hand,
throwing away tuples sooner can avoid more work, particularly in the
case of multi-joins.

Based on a hint from Jim Finnerty at PGCon 2017 and also some light
reading of ancient stuff on join pipelining, I've been wondering how
the potential effectiveness of Bloom filters is affected by the fact
that we prefer left-deep nested hash joins and for us build relation =
right/inner relation.  That policy means that we build hash tables for
all the joins up front, which means we could potentially push all
their Bloom filters down to the ultimate outer scan (or as far down as
possible depending on the qual) as I now find described in [2].  So:

H1
/\
   H2 u
   /\
  H3 t
  /\
 r  s

You might be able to push filters B1, B2, B3 constructed from H1, H2,
H3 all the way down to the scan of r and then probe them in order of
selectivity (a bit like order_qual_clauses does with any other kind of
qual).  (Because of the empty-outer optimisation's preliminary attempt
to pull a tuple on the left I guess you might need to push down a
placeholder filter first and then later replace it with the real
filter if/when you eventually build it.)

In contrast to that situation, suppose we introduced whole-query
memory budgets as Peter Geoghegan and several others have proposed.
Then we'd suddenly have a potential reason to prefer right-deep plans:

H1
/\
   r  H2
  /\
 s  H3
/\
   t  u

Many other RDBMSs would consider this (modulo some confusion about
hash join polarity: other systems and literature have build = inner =
left, probe = outer = right so they'd draw that the other way around
and call it left deep, somewhat distractingly...) because it only
requires two hash tables to be loaded into memory at a time, a useful
property if you want to stay inside a whole-query memory budget.
That's because the output of each hash join is the input to the hash
table above it, so in theory we only need the one we're building and
the one we're probing at each moment.  Whatever other pros and cons
might exist with this plan shape compared to the left-deep variant, my
point is that a right-deep join could only push each filter down to
the immediate scan of r, s, t, which wouldn't be much of a potential
improvement over the way your current patch does it (ie entirely
inside the hash join, no push down), because it's not able to move the
Bloom filter very far away from the hash join anyway.  At best it
could perhaps move it before some more expensive/less selective other
quals in the scan.

In other words, I wonder if our left-deep plans stand to gain more
from Bloom filter push down.

[1] 
http://www.nus.edu.sg/nurop/2010/Proceedings/SoC/NUROP_Congress_Cheng%20Bin.pdf
[2] 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.73.8069=rep1=pdf
[3] 
https://pdfs.semanticscholar.org/ec99/6093805012b9e0ff06bb2050436091671091.pdf

-- 
Thomas Munro
http://www.enterprisedb.com



Re: pgsql: Avoid valgrind complaint about write() of uninitalized bytes.

2018-02-20 Thread Andres Freund
On 2018-02-06 19:25:04 +, Robert Haas wrote:
> Avoid valgrind complaint about write() of uninitalized bytes.
> 
> LogicalTapeFreeze() may write out its first block when it is dirty but
> not full, and then immediately read the first block back in from its
> BufFile as a BLCKSZ-width block.  This can only occur in rare cases
> where very few tuples were written out, which is currently only
> possible with parallel external tuplesorts.  To avoid valgrind
> complaints, tell it to treat the tail of logtape.c's buffer as
> defined.
> 
> Commit 9da0cc35284bdbe8d442d732963303ff0e0a40bc exposed this problem
> but did not create it.  LogicalTapeFreeze() has always tended to write
> out some amount of garbage bytes, but previously never wrote less than
> one block of data in total, so the problem was masked.
> 
> Per buildfarm members lousyjack and skink.
> 
> Peter Geoghegan, based on a suggestion from Tom Lane and me.  Some
> comment revisions by me.

Doesn't appear to have fixed the problem entirely:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2018-02-20%2017%3A10%3A01

relevant excerpt:
==12452== Syscall param write(buf) points to uninitialised byte(s)
==12452==at 0x4E49C64: write (write.c:26)
==12452==by 0x4BF8BF: FileWrite (fd.c:2017)
==12452==by 0x4C1B69: BufFileDumpBuffer (buffile.c:513)
==12452==by 0x4C1C61: BufFileFlush (buffile.c:657)
==12452==by 0x4C21D6: BufFileRead (buffile.c:561)
==12452==by 0x63ADA8: ltsReadBlock (logtape.c:274)
==12452==by 0x63AEF6: ltsReadFillBuffer (logtape.c:304)
==12452==by 0x63B560: LogicalTapeRewindForRead (logtape.c:771)
==12452==by 0x640567: mergeruns (tuplesort.c:2671)
==12452==by 0x645122: tuplesort_performsort (tuplesort.c:1866)
==12452==by 0x23357D: _bt_parallel_scan_and_sort (nbtsort.c:1626)
==12452==by 0x234F71: _bt_parallel_build_main (nbtsort.c:1527)
==12452==  Address 0xc3cd368 is 744 bytes inside a block of size 8,272 
client-defined
==12452==at 0x6362D1: palloc (mcxt.c:858)
==12452==by 0x4C1DA6: BufFileCreateShared (buffile.c:249)
==12452==by 0x63B062: LogicalTapeSetCreate (logtape.c:571)
==12452==by 0x64489A: inittapes (tuplesort.c:2419)
==12452==by 0x644B1C: puttuple_common (tuplesort.c:1695)
==12452==by 0x644E01: tuplesort_putindextuplevalues (tuplesort.c:1545)
==12452==by 0x233391: _bt_spool (nbtsort.c:514)
==12452==by 0x2333CA: _bt_build_callback (nbtsort.c:574)
==12452==by 0x286004: IndexBuildHeapRangeScan (index.c:2879)
==12452==by 0x286366: IndexBuildHeapScan (index.c:2419)
==12452==by 0x23356F: _bt_parallel_scan_and_sort (nbtsort.c:1615)
==12452==by 0x234F71: _bt_parallel_build_main (nbtsort.c:1527)
==12452==  Uninitialised value was created by a heap allocation
==12452==at 0x6362D1: palloc (mcxt.c:858)
==12452==by 0x63B20E: LogicalTapeWrite (logtape.c:634)
==12452==by 0x63EA46: writetup_index (tuplesort.c:4206)
==12452==by 0x643769: dumptuples (tuplesort.c:2994)
==12452==by 0x64511A: tuplesort_performsort (tuplesort.c:1865)
==12452==by 0x23357D: _bt_parallel_scan_and_sort (nbtsort.c:1626)
==12452==by 0x234F71: _bt_parallel_build_main (nbtsort.c:1527)
==12452==by 0x252E2A: ParallelWorkerMain (parallel.c:1397)
==12452==by 0x4614DF: StartBackgroundWorker (bgworker.c:841)
==12452==by 0x46F400: do_start_bgworker (postmaster.c:5741)
==12452==by 0x46F541: maybe_start_bgworkers (postmaster.c:5954)
==12452==by 0x4700A6: sigusr1_handler (postmaster.c:5134)
==12452== 
==12452== VALGRINDERROR-END

Note that the above path doesn't appear to go through
LogicalTapeFreeze(), therfore not hitting the VALGRIND_MAKE_MEM_DEFINED
added in the above commit.

Greetings,

Andres Freund



Re: pgsql: Do execGrouping.c via expression eval machinery, take two.

2018-02-20 Thread Andres Freund
Hi,

On 2018-02-16 16:03:37 -0800, Andres Freund wrote:
> This triggered a failure on rhinoceros, in the sepgsql test:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rhinoceros=2018-02-16%2023%3A45%3A02
> 
> The relevant diff is:
> + LOG:  SELinux: allowed { execute } 
> scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255 
> tcontext=system_u:object_r:sepgsql_proc_exec_t:s0 tclass=db_procedure 
> name="pg_catalog.int4eq(integer,integer)"
> and that's because we now invoke the function access hook for grouping
> equal, which we previously didn't.
> 
> I personally think the new behaviour makes more sense, but if somebody
> wants to argue differently? The only argument against I can see is that
> there's some other cases where also don't yet invoke it, but that seems
> weak.
> 
> I never fully grasped the exact use-case for the function execute hook
> is, so maybe Kaigai and/or Robert could comment?

Fixed by adjusting the test output:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rhinoceros=2018-02-21%2002%3A45%3A01

Greetings,

Andres Freund



Re: Duplicate Item Pointers in Gin index

2018-02-20 Thread Masahiko Sawada
On Wed, Feb 21, 2018 at 12:31 PM, Peter Geoghegan  wrote:
> On Tue, Feb 20, 2018 at 7:19 PM, Masahiko Sawada  
> wrote:
>> IIUC, ginInsertCleanup() holds ExclusiveLock on metapage during adding
>> tuples in the pending list to the accumulator. And inserting entries
>> to the pending list also requires ExclusiveLock on metapage. This
>> behavior is not relevant with that bug fix. So I don't think that
>> backend2 can inserts a tuple while backend1 is processing the pending
>> list.
>
> You mean that you think that the problem that Siva described cannot
> happen once Postgres has commit
> 3b2787e1f8f1eeeb6bd9565288ab210262705b56?

It's true the problem cannot happen once postgres has that commit.
Since the commit 3b2787e fixed so as not to happen the #1 event that
Siva described, so the problem cannot happen once postgres has it. But
what I meant is that I think the event #3 and #4 cannot happen even
without that commit.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: SHA-2 functions

2018-02-20 Thread Michael Paquier
On Tue, Feb 20, 2018 at 05:09:48PM -0500, Tom Lane wrote:
> Peter Eisentraut  writes:
>> On 2/19/18 21:07, Michael Paquier wrote:
>>> varlena.c is already large and messy.  I would suggest to split into a
>>> new file all the user-facing cryptographic functions, including md5 and
>>> hex functions, say in src/backend/utils/adt/crypt.c.
> 
>> I had originally started a new file called hash.c, but I figured that
>> would be quite confusing.  I can use crypt.c or a similar name.
>> Although crypt.c sounds a lot like crypt().
> 
> cryptohashes.c or some such?  I concur with Michael that dropping this
> into varlena.c isn't a great plan.

I think that crypto_hash.c or hash_crypt.c would be adapted as well.
crypt.c is too much generic, so including both concepts in the name is
the way to go.  The name given by Tom here sounds actually nice.
--
Michael


signature.asc
Description: PGP signature


Re: Duplicate Item Pointers in Gin index

2018-02-20 Thread Peter Geoghegan
On Tue, Feb 20, 2018 at 7:19 PM, Masahiko Sawada  wrote:
> IIUC, ginInsertCleanup() holds ExclusiveLock on metapage during adding
> tuples in the pending list to the accumulator. And inserting entries
> to the pending list also requires ExclusiveLock on metapage. This
> behavior is not relevant with that bug fix. So I don't think that
> backend2 can inserts a tuple while backend1 is processing the pending
> list.

You mean that you think that the problem that Siva described cannot
happen once Postgres has commit
3b2787e1f8f1eeeb6bd9565288ab210262705b56?

-- 
Peter Geoghegan



Re: Duplicate Item Pointers in Gin index

2018-02-20 Thread Masahiko Sawada
On Wed, Feb 21, 2018 at 8:30 AM, R, Siva  wrote:
> User backend (let us call this backend 1) has released the lock on the
> metapage and is processing the pending list pages, adding the items to the
> build accumulator. Let us assume that there are currently 2 pages in the
> pending list and metadata->head is currently being processed (metadata->tail
> is not locked yet) and the old tuple id that was deleted is added to the
> accumulator.
> Another user backend (let us call this backend 2) uses the tid that was
> marked re-usable for a row that contains the same key datum as before, takes
> metapage lock and inserts into the pending list (metadata->tail) and returns
> (assume it does not do cleanup).

IIUC, ginInsertCleanup() holds ExclusiveLock on metapage during adding
tuples in the pending list to the accumulator. And inserting entries
to the pending list also requires ExclusiveLock on metapage. This
behavior is not relevant with that bug fix. So I don't think that
backend2 can inserts a tuple while backend1 is processing the pending
list.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



file cloning in pg_upgrade and CREATE DATABASE

2018-02-20 Thread Peter Eisentraut
Here is another attempt at implementing file cloning for pg_upgrade and
CREATE DATABASE.  The idea is to take advantage of file systems that can
make copy-on-write clones, which would make the copy run much faster.
For pg_upgrade, this will give the performance of --link mode without
the associated drawbacks.

There have been patches proposed previously [0][1].  The concerns there
were mainly that they required a Linux-specific ioctl() call and only
worked for Btrfs.

Some new things have happened since then:

- XFS has (optional) reflink support.  This file system is probably more
widely used than Btrfs.

- Linux and glibc have a proper function to do this now.

- APFS on macOS supports file cloning.

So altogether this feature will be more widely usable and less ugly to
implement.  Note, however, that you will currently need literally the
latest glibc release, so it probably won't be accessible right now
unless you are using Fedora 28 for example.  (This is the
copy_file_range() function that had us recently rename the same function
in pg_rewind.)

Some example measurements:

6 GB database, pg_upgrade unpatched 30 seconds, patched 3 seconds (XFS
and APFS)

similar for a CREATE DATABASE from a large template

Even if you don't have a file system with cloning support, the special
library calls make copying faster.  For example, on APFS, in this
example, an unpatched CREATE DATABASE takes 30 seconds, with the library
call (but without cloning) it takes 10 seconds.

For amusement/bewilderment, without the recent flush optimization on
APFS, this takes 2 minutes 30 seconds.  I suppose this optimization will
now actually obsolete, since macOS will no longer hit that code.


[0]:
https://www.postgresql.org/message-id/flat/513C0E7C.5080606%40socialserve.com

[1]:
https://www.postgresql.org/message-id/flat/20140213030731.GE4831%40momjian.us
-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 56b5b574f6d900d5eb4932be499cf3bae0e7ba86 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 20 Feb 2018 10:41:16 -0500
Subject: [PATCH] Use file cloning in pg_upgrade and CREATE DATABASE

For file copying in pg_upgrade and CREATE DATABASE, use special file
cloning calls if available.  This makes the copying faster and more
space efficient.  For pg_upgrade, this achieves speed similar to --link
mode without the associated drawbacks.

On Linux, use copy_file_range().  This supports file cloning
automatically on Btrfs and XFS (if formatted with reflink support).  On
macOS, use copyfile(), which supports file cloning on APFS.

Even on file systems without cloning/reflink support, this is faster
than the existing code, because it avoids copying the file contents out
of kernel space and allows the OS to apply other optimizations.
---
 configure  |  2 +-
 configure.in   |  2 +-
 doc/src/sgml/ref/pgupgrade.sgml| 11 
 src/backend/storage/file/copydir.c | 55 +-
 src/bin/pg_upgrade/file.c  | 37 -
 src/include/pg_config.h.in |  6 +
 6 files changed, 104 insertions(+), 9 deletions(-)

diff --git a/configure b/configure
index 7dcca506f8..eb8b321723 100755
--- a/configure
+++ b/configure
@@ -13079,7 +13079,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred 
getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np 
readlink setproctitle setsid shm_open symlink sync_file_range utime utimes 
wcstombs_l
+for ac_func in cbrt clock_gettime copy_file_range copyfile dlopen fdatasync 
getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat 
pthread_is_threaded_np readlink setproctitle setsid shm_open symlink 
sync_file_range utime utimes wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 4d26034579..dfe3507b25 100644
--- a/configure.in
+++ b/configure.in
@@ -1425,7 +1425,7 @@ PGAC_FUNC_WCSTOMBS_L
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred 
getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np 
readlink setproctitle setsid shm_open symlink sync_file_range utime utimes 
wcstombs_l])
+AC_CHECK_FUNCS([cbrt clock_gettime copy_file_range copyfile dlopen fdatasync 
getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat 
pthread_is_threaded_np readlink setproctitle setsid shm_open symlink 
sync_file_range utime utimes wcstombs_l])
 
 AC_REPLACE_FUNCS(fseeko)
 case $host_os in
diff --git a/doc/src/sgml/ref/pgupgrade.sgml 

support parameters in CALL

2018-02-20 Thread Peter Eisentraut
The current CALL implementation doesn't support parameters, which was a
bad oversight.  This patch fixes that.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From f3e78c7d9ef2fe21102cc8a08d244eda69aaa141 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 20 Feb 2018 18:03:31 -0500
Subject: [PATCH] Support parameters in CALL

To support parameters in CALL, move the parse analysis of the procedure
and arguments into the global transformation phase, so that the parser
hooks can be applied.  And then at execution time pass the parameters
from ProcessUtility on to ExecuteCallStmt.
---
 src/backend/commands/functioncmds.c| 25 +++---
 src/backend/nodes/copyfuncs.c  |  1 +
 src/backend/nodes/equalfuncs.c |  1 +
 src/backend/parser/analyze.c   | 45 ++
 src/backend/tcop/utility.c |  2 +-
 src/include/commands/defrem.h  |  3 +-
 src/include/nodes/parsenodes.h |  3 +-
 src/pl/plpgsql/src/expected/plpgsql_call.out   | 19 +++
 src/pl/plpgsql/src/sql/plpgsql_call.sql| 18 +++
 src/test/regress/expected/create_procedure.out | 16 +
 src/test/regress/sql/create_procedure.sql  | 15 +
 11 files changed, 124 insertions(+), 24 deletions(-)

diff --git a/src/backend/commands/functioncmds.c 
b/src/backend/commands/functioncmds.c
index a027b19744..abdfa249c0 100644
--- a/src/backend/commands/functioncmds.c
+++ b/src/backend/commands/functioncmds.c
@@ -2212,11 +2212,9 @@ ExecuteDoStmt(DoStmt *stmt, bool atomic)
  * commits that might occur inside the procedure.
  */
 void
-ExecuteCallStmt(ParseState *pstate, CallStmt *stmt, bool atomic)
+ExecuteCallStmt(CallStmt *stmt, ParamListInfo params, bool atomic)
 {
-   List   *targs;
ListCell   *lc;
-   Node   *node;
FuncExpr   *fexpr;
int nargs;
int i;
@@ -2228,24 +2226,8 @@ ExecuteCallStmt(ParseState *pstate, CallStmt *stmt, bool 
atomic)
ExprContext *econtext;
HeapTuple   tp;
 
-   /* We need to do parse analysis on the procedure call and its arguments 
*/
-   targs = NIL;
-   foreach(lc, stmt->funccall->args)
-   {
-   targs = lappend(targs, transformExpr(pstate,
-   
 (Node *) lfirst(lc),
-   
 EXPR_KIND_CALL_ARGUMENT));
-   }
-
-   node = ParseFuncOrColumn(pstate,
-
stmt->funccall->funcname,
-targs,
-pstate->p_last_srf,
-stmt->funccall,
-true,
-
stmt->funccall->location);
-
-   fexpr = castNode(FuncExpr, node);
+   fexpr = stmt->funcexpr;
+   Assert(fexpr);
 
aclresult = pg_proc_aclcheck(fexpr->funcid, GetUserId(), ACL_EXECUTE);
if (aclresult != ACLCHECK_OK)
@@ -2289,6 +2271,7 @@ ExecuteCallStmt(ParseState *pstate, CallStmt *stmt, bool 
atomic)
 * we can't free this context till the procedure returns.
 */
estate = CreateExecutorState();
+   estate->es_param_list_info = params;
econtext = CreateExprContext(estate);
 
i = 0;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 82255b0d1d..266a3ef8ef 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -3231,6 +3231,7 @@ _copyCallStmt(const CallStmt *from)
CallStmt   *newnode = makeNode(CallStmt);
 
COPY_NODE_FIELD(funccall);
+   COPY_NODE_FIELD(funcexpr);
 
return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index b9bc8e38d7..bbffc87842 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1206,6 +1206,7 @@ static bool
 _equalCallStmt(const CallStmt *a, const CallStmt *b)
 {
COMPARE_NODE_FIELD(funccall);
+   COMPARE_NODE_FIELD(funcexpr);
 
return true;
 }
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 5b3a610cf9..c3a9617f67 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -36,6 +36,8 @@
 #include "parser/parse_coerce.h"
 #include "parser/parse_collate.h"
 #include "parser/parse_cte.h"
+#include "parser/parse_expr.h"
+#include "parser/parse_func.h"
 #include "parser/parse_oper.h"
 #include "parser/parse_param.h"
 #include "parser/parse_relation.h"
@@ -74,6 +76,8 @@ static Query 

Re: Duplicate Item Pointers in Gin index

2018-02-20 Thread Peter Geoghegan
On Tue, Feb 20, 2018 at 3:30 PM, R, Siva  wrote:
> We are currently investigating an issue with a Gin index containing
> duplicate item pointers for one of its keys. While we are trying to
> reproduce this issue, we wanted to reach out to the community to validate
> our current working theory.

Please define "a GIN index containing duplicate item pointers". Are
the keys duplicated, the item pointers, or both? Where are the
duplicates contained (what part of the GIN structure)?

> Consider a tuple that is inserted and just updated/deleted and passes the
> dead-to-all horizon. At this point, the old tuple id (corresponding to the
> insert) is eligible for vacuum.
> Due to the above bug, vacuum can skip clean up of pending list if a user
> backend already has a lock on it before proceeding to vacuum the gin posting
> tree.
> At this point, it is possible that the user backend flushes the pending list
> containing the dead tuple to the main gin index structure after vacuum has
> already passed over. Vacuum has marked that tuple id slot as re-usable when
> there is actually an index tuple pointing to it.

It sounds like you're talking about a case where a TID value is
recycled by VACUUM, despite the fact that the GIN structure still
contains a TID intended to point to the original tuple -- an index
tuple that should have been killed before VACUUM recycled the
TID/killed the heap tuple. The TID points to a valid heap tuple, but
the wrong one. (This is a corruption scenario that we saw from time to
time in past cases where VACUUMing of indexes had bugs).

Is that general understanding of what you've said correct?

> Given the above, our question is to find out if the above is a valid case
> for having duplicate tids in the index tree structure without the fix for
> the above bug (in non-debug builds where "Assert" is disabled) with the
> following sequence of events:

I think that you'll be very lucky to get a useful answer to this
specific question. The recent bugfix was based on an observation that
Sawada-san made about the code not doing what it claimed to do, and
clearly needs to do. It's very hard to give you any assurance that
"duplicate TIDs" could be explained by the bug.

FWIW, I continue to have misgivings about pending list VACUUMing. I
plan to look at it again.

-- 
Peter Geoghegan



Re: [HACKERS] path toward faster partition pruning

2018-02-20 Thread David Rowley
v30-0004-Faster-partition-pruning.patch contains:

+create table coll_pruning_multi (a text) partition by range
(substr(a, 1) collate "en_GB", substr(a, 1) collate "en_US");

This'll likely work okay on Linux. Other collate tests seem to use
COLLATE "POSIX or "C" so that work cross-platform.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Hash Joins vs. Bloom Filters / take 2

2018-02-20 Thread Peter Geoghegan
On Tue, Feb 20, 2018 at 3:54 PM, Tomas Vondra
 wrote:
>> I suspect that it could make sense to use a Bloom filter to
>> summarize the entire inner side of the join all at once, even when
>> there are multiple batches. I also suspect that this is particularly
>> beneficial with parallel hash joins, where IPC/synchronization
>> overhead can be a big problem.
>>
>
> But that's what the patch does, currently - the filter is built during
> the initial pass through the data, and then used for all batches.

I misunderstood. I would probably do something like double or triple
the original rows estimate instead, though. The estimate must be at
least slightly inaccurate when we get to this point, but I don't think
that that's a good enough reason to give up on the estimate
completely.

> Actually, now that I think about it - I think the patch should throw
> away the filter away after the initial pass over the outer relation,
> because at that point we've used all the information in the filter.

Makes sense.

> I'm not sure it would make sense to then build smaller bloom filters for
> individual batches, but maybe it would?

I doubt it.

> Yeah, I admit those are rather crude rules.

You have to start somewhere.

> The trouble is that when we start with a single batch and then find out
> the estimate was wrong and we need to start batching, all bets are off.
> At that point it seems reasonable to just say "Here is X MBs of RAM, do
> what you can".

As I said above, I wouldn't say all bets are off when this happens --
not at all. Estimates are likely to often be somewhat wrong. If
they're completely wrong, we can probably swallow the cost of giving
up on a Bloom filter relatively late.

As I said, X should not be a portion of work_mem, because that has
only a weak relationship to what really matters.

>> You should try to exploit the fact that a Bloom filter can summarize a
>> large set reasonably well with a very compact, simple representation.
>> A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but
>> for cases where Bloom probes will save a lot of work, it probably
>> doesn't make all that much difference -- the hash join is still much
>> faster.

> But the problem is that I don't know what is the total size of the hash
> table, because we're building the bloom filter for all the batches at
> once. And we don't know how many batches will be there - if we knew
> that, we could estimate the number of distinct values and we could use
> that to size the filter instead of doing this. (All of this only applies
> to the state where we start with a single batch and then find out we
> need to start batching, of course.)

I don't think that the number of batches should matter that much to
the costing/sizing/estimation logic, even if it's an interesting thing
to talk about when considering the upsides of a Bloom filter. My sense
is that we should focus on making sure that using a Bloom filter is
going to win, and not worry so much about whether that's going to be a
huge win or a modest win.

Suppose that you thought you were going to have a 10% false positive
rate with a 22.85MB Bloom filter for 40 million elements (my earlier
example). Further suppose that it turns out to be 80 million elements.
This means that you're going to get a false positive rate of 30%. This
could still be a big win, though, so it's not really a bad situation.
With 120 million elements, it's about 45%, but even then it could
still work out, especially because the hash join will be very slow
anyway. You also have to bear in mind that the 40 million estimate is
much more likely to be too high than too low, because you're assuming
distinct key values for the hash table.

You're taking a risk, in a sense, but a lot of things have to go wrong
for you to lose, and even then you can cut your losses before the
extra cost is too high.

Do you have a test query for this patch, that gives us some idea of the upside?

-- 
Peter Geoghegan



Re: non-bulk inserts and tuple routing

2018-02-20 Thread Amit Langote
Fujita-san,

On 2018/02/20 19:40, Etsuro Fujita wrote:
> (2018/02/19 13:19), Amit Langote wrote:
>> Attached rebased patch.
> 
> Thanks for the rebased patch!
> 
> One thing I noticed while updating the
> tuple-routing-for-foreign-partitions patch on top of this is: we should
> switch into the per-query memory context in ExecInitPartitionInfo.

Good catch!

> Attached is an updated version for that.

Thanks for updating the patch.

Thanks,
Amit




Re: Typo in procarray.c

2018-02-20 Thread Masahiko Sawada
On Tue, Feb 20, 2018 at 8:04 PM, Magnus Hagander  wrote:
>
>
> On Tue, Feb 20, 2018 at 2:47 AM, Masahiko Sawada 
> wrote:
>>
>> Hi,
>>
>> Attached patch for fixing $subject.
>>
>> s/replicaton/replication/
>
>
> Applied, thanks.
>

Thank you!

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] Runtime Partition Pruning

2018-02-20 Thread David Rowley
On 20 February 2018 at 23:46, Rajkumar Raghuwanshi
 wrote:
> I have applied v10 patch on Amit's v27 over head ad7dbee36. I got "ERROR:
> partition missing from Append subplans" with the patch. on head and only
> with Amit's patches query is working fine, Please find test case below.

Thanks for the test case. I can recreate locally. This is down to the
fact that make_partition_pruneinfo() only makes sub-PartitionPruneInfo
for actual subpaths found in the Append. Your test case happens to
match both the part_p1 and part_p2 partitions on the first level of
iteration, but since no child of part_p2 is found in
make_partition_pruneinfo, that element in the subpartindex never gets
set.

The fix might be to just remove the error and silently ignore those
cases, but I was hoping to keep that around as it might catch other
bugs. I'm just not sure yet how to do both.

I'll rebase this on Amit's latest patch and have a think about it
while doing that.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Hash Joins vs. Bloom Filters / take 2

2018-02-20 Thread Peter Geoghegan
On Tue, Feb 20, 2018 at 3:48 PM, Claudio Freire  wrote:
>> Do we need to eliminate 99% of all hash join probes (that find nothing
>> to join on) to make this Bloom filter optimization worthwhile?
>> Personally, I doubt it.
>
> Even for 90% it's about 4.6 bits per element.

4.6 bits is vastly less than typical hash join hash table entry sizes,
which are often comprised of several attributes. Even the skinniest
possible hash table elements have HJTUPLE_OVERHEAD() overhead, as well
as MinimalTuple overhead. To say nothing of all the serialization and
synchronization overhead that you could also have.

Whatever precise per-element overhead you come up with for your Bloom
filter, the *ratio* of those two things is what makes using a Bloom
filter potentially very effective. It could end up being something
like 1:100, which is a huge win for a highly selective hash join that
still has a fairly large hash table (that cannot fit in L3
cache/work_mem).

> What I'm saying is that you can't assume you can fit the whole thing
> in work_mem. If it could be said that any set worth a hash join will
> fit, you can build a work_mem-sized bloom filter and just accept that
> if you exceed its capacity its filtering efficiency will drop
> benignly. But IMO that can't be taken for granted, so at some point
> you should just give up, the overhead of querying a low-quality BF
> won't be worth it.

It's probably true that we will need to bail out of using a Bloom
filter, and it is certainly true that the optimization won't always
work out. Still, once your Bloom filter proves useless, chances are
good that the plan isn't a very good one, with or without that extra
optimization. Even if the plan is truly the fastest possible plan, and
even if you avoid wasting extra effort on the Bloom filter, it will
still be very slow.

Join selectivity is what really matters with this optimization. Of
course the optimization might not work out, but you can say that about
almost anything that uses hashing -- that's why safety critical
realtime systems (e.g. avionics systems) often completely avoid using
hash tables. Bloom filters are fairly resilient to somewhat inaccurate
estimates, but nothing will save you from terrible estimates. I think
that it's worth risking a small, predictable regression when the join
turns out to not be selective, in order to get a potentially very
large improvement in performance when the planner estimates join
selectivity reasonably accurately (and it's a selective hash join that
won't simply have a small hash table to begin with).

-- 
Peter Geoghegan



Re: Hash Joins vs. Bloom Filters / take 2

2018-02-20 Thread Tomas Vondra


On 02/21/2018 12:06 AM, Peter Geoghegan wrote:
> On Tue, Feb 20, 2018 at 1:23 PM, Tomas Vondra
>  wrote:
>> In 2015/2016 I've been exploring if we could improve hash joins by
>> leveraging bloom filters [1], and I was reminded about this idea in a
>> thread about amcheck [2]. I also see that bloom filters were briefly
>> mentioned in the thread about parallel hash [3].
>>
>> So I've decided to revive the old patch, rebase it to current master,
>> and see if we can resolve the issues that killed it in 2016.
> 
> Cool.
> 
>> The issues are essentially about predicting when the bloom filter can
>> improve anything, which is more a question of the hash join selectivity
>> than the bloom filter parameters.
> 
> Absolutely. A Bloom filter could be considered an opportunistic thing.
> The false positive rate that you estimate is going to be based on the
> planner's rowcount estimate for the inner side of the join, which is
> liable to be wrong anyway. It's important to be resilient to
> misestimations, which Bloom filters are good at.
> 

It's a bit more difficult than that, because rowcount is the total
number of rows, but it may not be the same as the number of distinct
values in the join key. But it's an estimate of the upper boundary.

But yeah, that's how the bloom filter is sized in the patch.

>> This is not the same as "false positive rate" which is a feature
>> of the bloom filter itself, and can be measured by simply counting 
>> bits set in the filter. That is important too, of course, but 
>> simple to determine.
> 
> Perhaps I'm being pedantic, but it's not exactly true that you can 
> measure the false positive rate by counting the bits. I do agree with
> what I think you meant here, though, which is that the precise false
> positive rate is not necessarily that important, while the 
> selectivity of the join is very important.
> 

My reasoning is that given a bloom filter with K out of M bits set, a
probability of the bloom filter saying "true" to a random value is

(K/M)^H

where H is the number of hash functions. I believe that's pretty much
the definition of false positive rate, but I have to admit I haven't
really checked the exact math.

The important takeaway here is that many bits set -> bad.

> In general, hash joins work best when the join qual is highly 
> selective. This is not the same thing as having a small hash table,
> of course.
> 
>> After thinking about this a bit more, I think this is pretty much
>> what we do during join cardinality estimation - we estimate how
>> many of the rows in "fact" have a matching row in "dim". I do think
>> we can reuse that to decide if to use a bloom filter or not.
>>
>> Of course, the decision may need to be more elaborate (and perhaps 
>> formulated as part of costing), but the filter selectivity is the 
>> crucial piece. The attached patch does not do that yet, though.
> 
> I suspect that it could make sense to use a Bloom filter to
> summarize the entire inner side of the join all at once, even when
> there are multiple batches. I also suspect that this is particularly
> beneficial with parallel hash joins, where IPC/synchronization
> overhead can be a big problem.
> 

But that's what the patch does, currently - the filter is built during
the initial pass through the data, and then used for all batches. This
comes from the original idea to throw away rows from the outer relation
(usually the larger one) that have no match in the hash table. Now we
stash them into a batch file, possibly shuffling them between the
batches if we happen to need to add more batches, only to throw them
away much later when we get to that batch.

Actually, now that I think about it - I think the patch should throw
away the filter away after the initial pass over the outer relation,
because at that point we've used all the information in the filter.

I'm not sure it would make sense to then build smaller bloom filters for
individual batches, but maybe it would?

>> 4) Initially, the patch was aimed at hashjoins with batches, on the
>> premise that it can save some of the serialize/deserialize costs. But as
>> mentioned in the discussion, bloom filters can be beneficial even in the
>> single-batch case, when three conditions are met:
>>
>> a) join is selective (i.e. some rows don't have match in hash table)
>>
>> b) hash table > CPU cache
>>
>> c) bloom filter < CPU cache
> 
> Seems plausible. CPU cache efficiency is also affected by how many
> hash functions you end up using -- use too many, and it slows things
> down.
> 

Yeah, I admit those are rather crude rules.

>> We don't have a good way to determine size of the CPU cache, but I
>> think we can use some crude arbitrary limits. E.g. require that
>> hash hash table is larger than 8MB and bloom filter is smaller than
>> 4MB, or something like that.
> 
> FWIW, the logical way to size the Bloom filter is based on the number
> of (distinct) values, not based on the projected size of 

Re: Hash Joins vs. Bloom Filters / take 2

2018-02-20 Thread Claudio Freire
On Tue, Feb 20, 2018 at 8:23 PM, Peter Geoghegan  wrote:
> On Tue, Feb 20, 2018 at 3:17 PM, Claudio Freire  
> wrote:
>> I've worked a lot with bloom filters, and for large false positive
>> rates and large sets (multi-million entries), you get bloom filter
>> sizes of about 10 bits per distinct item.
>
> It's generally true that you need 9.6 bits per element to get a 1%
> false positive rate. 1% could be considered much too low here.
>
> Do we need to eliminate 99% of all hash join probes (that find nothing
> to join on) to make this Bloom filter optimization worthwhile?
> Personally, I doubt it.

Even for 90% it's about 4.6 bits per element.

What I'm saying is that you can't assume you can fit the whole thing
in work_mem. If it could be said that any set worth a hash join will
fit, you can build a work_mem-sized bloom filter and just accept that
if you exceed its capacity its filtering efficiency will drop
benignly. But IMO that can't be taken for granted, so at some point
you should just give up, the overhead of querying a low-quality BF
won't be worth it.

The HLL is good for that. You can keep adding to the BF until the HLL
tells you you're way past the capacity of a work_mem-sized BF, then
you free that memory and go on without it.

You might also want to consider scalable BFs. The smaller you can keep
the BF, the better chance you'll have of fitting it into the L3 cache
(you can fit quite a lot of entries into a decen-sized L3 cache).



Re: ALTER TABLE ADD COLUMN fast default

2018-02-20 Thread Tom Lane
David Rowley  writes:
> It seems the difference between these two cases is down to
> slot_getsomeattrs being asked to deform up to attnum 1000 for the
> create-alter.sql case, and only up to attnum 10 for the create.sql
> case. Both plans are using physical tlists per EXPLAIN VERBOSE. I've
> not managed to narrow down the reason for the difference yet.

There's logic in the execExpr compiler that detects the last attnum
we actually reference, if memory serves.

regards, tom lane



Re: extern keyword incorrectly used in some function definitions

2018-02-20 Thread David Rowley
On 20 February 2018 at 06:09, Tom Lane  wrote:
> But the rest
> of these look good; pushed.

Great. Thanks!

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: ALTER TABLE ADD COLUMN fast default

2018-02-20 Thread David Rowley
On 21 February 2018 at 00:38, David Rowley  wrote:
> Using: select sum(c10) from t;
>
...
> v11 + create.sql: tps = 3330.131437
> v11 + create-alter.sql: tps = 1398.635251

It seems the difference between these two cases is down to
slot_getsomeattrs being asked to deform up to attnum 1000 for the
create-alter.sql case, and only up to attnum 10 for the create.sql
case. Both plans are using physical tlists per EXPLAIN VERBOSE. I've
not managed to narrow down the reason for the difference yet. Looks
like it might take a bit of time with a debugger to find the point
where the code paths of the two cases diverge.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Hash Joins vs. Bloom Filters / take 2

2018-02-20 Thread Peter Geoghegan
On Tue, Feb 20, 2018 at 3:17 PM, Claudio Freire  wrote:
> I've worked a lot with bloom filters, and for large false positive
> rates and large sets (multi-million entries), you get bloom filter
> sizes of about 10 bits per distinct item.

It's generally true that you need 9.6 bits per element to get a 1%
false positive rate. 1% could be considered much too low here.

Do we need to eliminate 99% of all hash join probes (that find nothing
to join on) to make this Bloom filter optimization worthwhile?
Personally, I doubt it.

> That's efficient, but it's not magic. It can still happen that the
> whole set can't fit in work_mem with an acceptable false positive
> rate.

A merge join is always going to be the better choice when extremely
memory constrained.

-- 
Peter Geoghegan



Re: Hash Joins vs. Bloom Filters / take 2

2018-02-20 Thread Claudio Freire
On Tue, Feb 20, 2018 at 8:06 PM, Peter Geoghegan  wrote:
> You should try to exploit the fact that a Bloom filter can summarize a
> large set reasonably well with a very compact, simple representation.
> A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but
> for cases where Bloom probes will save a lot of work, it probably
> doesn't make all that much difference -- the hash join is still much
> faster. If you're willing to accept a 10% false positive rate, then
> you can summarize a set of 40 million distinct values with only
> 22.85MB of memory and 3 hash functions. I think that the smallest
> possible amount of memory that any hash table of 40 million elements
> would require a much greater amount of memory than 22.85MB --
> certainly closer to 20x than to 8x. Even that is pretty conservative.
> I bet there are plenty of hash joins were the ratio could be 30x or
> more. Not to mention cases with multiple batches.

I've worked a lot with bloom filters, and for large false positive
rates and large sets (multi-million entries), you get bloom filter
sizes of about 10 bits per distinct item.

That's efficient, but it's not magic. It can still happen that the
whole set can't fit in work_mem with an acceptable false positive
rate.



Re: Hash Joins vs. Bloom Filters / take 2

2018-02-20 Thread Peter Geoghegan
On Tue, Feb 20, 2018 at 1:23 PM, Tomas Vondra
 wrote:
> In 2015/2016 I've been exploring if we could improve hash joins by
> leveraging bloom filters [1], and I was reminded about this idea in a
> thread about amcheck [2]. I also see that bloom filters were briefly
> mentioned in the thread about parallel hash [3].
>
> So I've decided to revive the old patch, rebase it to current master,
> and see if we can resolve the issues that killed it in 2016.

Cool.

> The issues are essentially about predicting when the bloom filter can
> improve anything, which is more a question of the hash join selectivity
> than the bloom filter parameters.

Absolutely. A Bloom filter could be considered an opportunistic thing.
The false positive rate that you estimate is going to be based on the
planner's rowcount estimate for the inner side of the join, which is
liable to be wrong anyway. It's important to be resilient to
misestimations, which Bloom filters are good at.

> This is not the same as "false positive rate" which is a feature of the
> bloom filter itself, and can be measured by simply counting bits set in
> the filter. That is important too, of course, but simple to determine.

Perhaps I'm being pedantic, but it's not exactly true that you can
measure the false positive rate by counting the bits. I do agree with
what I think you meant here, though, which is that the precise false
positive rate is not necessarily that important, while the selectivity
of the join is very important.

In general, hash joins work best when the join qual is highly
selective. This is not the same thing as having a small hash table, of
course.

> After thinking about this a bit more, I think this is pretty much what
> we do during join cardinality estimation - we estimate how many of the
> rows in "fact" have a matching row in "dim". I do think we can reuse
> that to decide if to use a bloom filter or not.
>
> Of course, the decision may need to be more elaborate (and perhaps
> formulated as part of costing), but the filter selectivity is the
> crucial piece. The attached patch does not do that yet, though.

I suspect that it could make sense to use a Bloom filter to summarize
the entire inner side of the join all at once, even when there are
multiple batches. I also suspect that this is particularly beneficial
with parallel hash joins, where IPC/synchronization overhead can be a
big problem.

> 4) Initially, the patch was aimed at hashjoins with batches, on the
> premise that it can save some of the serialize/deserialize costs. But as
> mentioned in the discussion, bloom filters can be beneficial even in the
> single-batch case, when three conditions are met:
>
> a) join is selective (i.e. some rows don't have match in hash table)
>
> b) hash table > CPU cache
>
> c) bloom filter < CPU cache

Seems plausible. CPU cache efficiency is also affected by how many
hash functions you end up using -- use too many, and it slows things
down.

> We don't have a good way to determine size of the CPU cache, but I think
> we can use some crude arbitrary limits. E.g. require that hash hash
> table is larger than 8MB and bloom filter is smaller than 4MB, or
> something like that.

FWIW, the logical way to size the Bloom filter is based on the number
of (distinct) values, not based on the projected size of the hash
table. The lossy nature of Bloom filters makes the average size of the
original, hashed elements irrelevant to a sizing calculation that
targets a certain final false positive rate. Making Bloom filter size
any fixed fraction of hash table size seems wrong to me for that
reason alone.

You should try to exploit the fact that a Bloom filter can summarize a
large set reasonably well with a very compact, simple representation.
A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but
for cases where Bloom probes will save a lot of work, it probably
doesn't make all that much difference -- the hash join is still much
faster. If you're willing to accept a 10% false positive rate, then
you can summarize a set of 40 million distinct values with only
22.85MB of memory and 3 hash functions. I think that the smallest
possible amount of memory that any hash table of 40 million elements
would require a much greater amount of memory than 22.85MB --
certainly closer to 20x than to 8x. Even that is pretty conservative.
I bet there are plenty of hash joins were the ratio could be 30x or
more. Not to mention cases with multiple batches.

-- 
Peter Geoghegan



Re: SHA-2 functions

2018-02-20 Thread Tom Lane
Peter Eisentraut  writes:
> On 2/19/18 21:07, Michael Paquier wrote:
>> varlena.c is already large and messy.  I would suggest to split into a
>> new file all the user-facing cryptographic functions, including md5 and
>> hex functions, say in src/backend/utils/adt/crypt.c.

> I had originally started a new file called hash.c, but I figured that
> would be quite confusing.  I can use crypt.c or a similar name.
> Although crypt.c sounds a lot like crypt().

cryptohashes.c or some such?  I concur with Michael that dropping this
into varlena.c isn't a great plan.

regards, tom lane



Re: SHA-2 functions

2018-02-20 Thread Peter Eisentraut
On 2/19/18 21:07, Michael Paquier wrote:
> +   sha224('abc')
> +   
> \x23097d223405d8228642a477bda255b32aadbce4bda0b3f7e36c9da7
> Some bytea characters from the hash are not able to show up correctly?
> This does not result in spaces.

U+200B is a zero-width space, used here to hint for possible line breaks.

> +   Note that for historic reasons, the function md5
> +   returns a hex-encoded value of type text whereas the SHA-2
> +   functions return type bytea.  Use the functions
> +   encode and decode to convert
> +   between the two.
> Adding an example would be nice.

OK

> varlena.c is already large and messy.  I would suggest to split into a
> new file all the user-facing cryptographic functions, including md5 and
> hex functions, say in src/backend/utils/adt/crypt.c.

I had originally started a new file called hash.c, but I figured that
would be quite confusing.  I can use crypt.c or a similar name.
Although crypt.c sounds a lot like crypt().

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Hash Joins vs. Bloom Filters / take 2

2018-02-20 Thread Tomas Vondra
Hi,

In 2015/2016 I've been exploring if we could improve hash joins by
leveraging bloom filters [1], and I was reminded about this idea in a
thread about amcheck [2]. I also see that bloom filters were briefly
mentioned in the thread about parallel hash [3].

So I've decided to revive the old patch, rebase it to current master,
and see if we can resolve the issues that killed it in 2016.

The issues are essentially about predicting when the bloom filter can
improve anything, which is more a question of the hash join selectivity
than the bloom filter parameters.

Consider a join on tables with FK on the join condition. That is,
something like this:

CREATE TABLE dim (id serial primary key, ...);
CREATE TABLE fact (dim_id int references dim(id), ...);

Then if you do a join

SELECT * FROM fact JOIN dim (id = dim_id);

the bloom filter can't really help, because all the dim_id values are
guaranteed to be in the hash table. So it can only really help with
queries like this:

SELECT * FROM fact JOIN dim (id = dim_id) WHERE (dim.x = 1000);

where the additional conditions on dim make the hash table incomplete in
some sense (i.e. the bloom will return false, saving quite a bit of
expensive work - lookup in large hash table or spilling tuple to disk).

This is not the same as "false positive rate" which is a feature of the
bloom filter itself, and can be measured by simply counting bits set in
the filter. That is important too, of course, but simple to determine.

The bloom filter "selectivity" is more a feature of the join, i.e. what
fraction of the values looked up in the bloom filter are expected to be
rejected. If many are rejected, the bloom filter helps. If few, the
bloom filter is a waste of time.

After thinking about this a bit more, I think this is pretty much what
we do during join cardinality estimation - we estimate how many of the
rows in "fact" have a matching row in "dim". I do think we can reuse
that to decide if to use a bloom filter or not.

Of course, the decision may need to be more elaborate (and perhaps
formulated as part of costing), but the filter selectivity is the
crucial piece. The attached patch does not do that yet, though.

The attached patch:

1) Only works in non-parallel case for now. Fixing this should not be a
big deal, once the costing/planning gets addressed.


2) Adds details about the bloom filter to EXPLAIN ANALYZE output, with
various important metrics (size of the filter, number of hash functions,
lookups/matches, fill factor).


3) Removes the HLL counter. I came to the conclusion that using HLL to
size the bloom filter makes very little sense, because there are about
three cases:

a) no batching (hash table fits into work_mem)

We can easily walk the hash table and compute "good enough" ndistinct
estimate by counting occupied buckets (or just using number of tuples in
the hash table, assuming the values are distinct).

At this point, the patch does not build the bloom filter in single-batch
cases, unless forced to do that by setting force_hashjoin_bloom=true.
More about this later.


b) batching from the beginning

The HLL is useless, because we need to initialize the bloom filter
before actually seeing any values. Instead, the patch simply uses the
expected number of rows (assuming those are distinct).


c) starting in single-batch mode, switching to batches later

We could use HLL to estimate number of distinct values in the initial
batch (when starting to batch), but it's unclear how is that useful.
This case means our estimates are off, and we don't really know how many
batches will be there. We could use some arbitrary multiple, I guess.

What I decided to do instead in the patch is sizing the bloom filter for
1/8 of work_mem. That seems like a viable compromise, as it's unlikely
to increase the number of batches.


4) Initially, the patch was aimed at hashjoins with batches, on the
premise that it can save some of the serialize/deserialize costs. But as
mentioned in the discussion, bloom filters can be beneficial even in the
single-batch case, when three conditions are met:

a) join is selective (i.e. some rows don't have match in hash table)

b) hash table > CPU cache

c) bloom filter < CPU cache

We don't have a good way to determine size of the CPU cache, but I think
we can use some crude arbitrary limits. E.g. require that hash hash
table is larger than 8MB and bloom filter is smaller than 4MB, or
something like that.

Opinions?


regards


[1] https://www.postgresql.org/message-id/5670946E.8070705%402ndquadrant.com

[2]
https://www.postgresql.org/message-id/9b9fd273-18e7-2b07-7aa1-4b00ab59b8d1%402ndquadrant.com

[3]
https://www.postgresql.org/message-id/9b9fd273-18e7-2b07-7aa1-4b00ab59b8d1%402ndquadrant.com

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 610fed13c7e2f418c8d574e85e9fa6fef97dbef6 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2018-02-20 Thread Robert Haas
On Tue, Feb 13, 2018 at 5:42 AM, Masahiko Sawada  wrote:
>> The fdw-transactions section of the documentation seems to imply that
>> henceforth every FDW must call FdwXactRegisterForeignServer, which I
>> think is an unacceptable API break.
>>
>> It doesn't seem advisable to make this behavior depend on
>> max_prepared_foreign_transactions.  I think that it should be an
>> server-level option: use 2PC for this server, or not?  FDWs that don't
>> have 2PC default to "no"; but others can be set to "yes" if the user
>> wishes.  But we shouldn't force that decision to be on a cluster-wide
>> basis.
>
> Since I've added a new option two_phase_commit to postgres_fdw we need
> to ask FDW whether the foreign server is 2PC-capable server or not in
> order to register the foreign server information. That's why the patch
> asked FDW to call  FdwXactRegisterForeignServer. However we can
> register a foreign server information automatically by executor (e.g.
> at  BeginDirectModify and at BeginForeignModify) if a foreign server
> itself has that information. We can add two_phase_commit_enabled
> column to pg_foreign_server system catalog and that column is set to
> true if the foriegn server is 2PC-capable (i.g. has enough functions)
> and user want to use it.

I don't see why this would need a new catalog column.

>> But that brings up another issue: why is MyFdwConnections named that
>> way and why does it have those semantics?  That is, why do we really
>> need a list of every FDW connection? I think we really only need the
>> ones that are 2PC-capable writes.  If a non-2PC-capable foreign server
>> is involved in the transaction, then we don't really to keep it in a
>> list.  We just need to flag the transaction as not locally prepareable
>> i.e. clear TwoPhaseReady.  I think that this could all be figured out
>> in a much less onerous way: if we ever perform a modification of a
>> foreign table, have nodeModifyTable.c either mark the transaction
>> non-preparable by setting XACT_FLAGS_FDWNOPREPARE if the foreign
>> server is not 2PC capable, or otherwise add the appropriate
>> information to MyFdwConnections, which can then be renamed to indicate
>> that it contains only information about preparable stuff.  Then you
>> don't need each individual FDW to be responsible about calling some
>> new function; the core code just does the right thing automatically.
>
> I could not get this comment. Did you mean that the foreign
> transaction on not 2PC-capable foreign server should be end in the
> same way as before (i.g. by XactCallback)?
>
> Currently, because there is not FDW API to end foreign transaction,
> almost FDWs use XactCallbacks to end the transaction. But after
> introduced new FDW APIs, I think it's better to use FDW APIs to end
> transactions rather than using XactCallbacks. Otherwise we end up with
> having FDW APIs for 2PC (prepare and resolve) and XactCallback for
> ending the transaction, which would be hard to understand. So I've
> changed the foreign transaction management so that core code
> explicitly asks FDW to end/prepare a foreign transaction instead of
> ending it by individual FDWs. After introduced new FDW APIs, core code
> can have the information of all foreign servers involved with the
> transaction and call each APIs at appropriate timing.

Well, it's one thing to introduce a new API.  It's another thing to
require existing FDWs to be updated to use it.  There are a lot of
existing FDWs out there, and I think that it is needlessly unfriendly
to force them all to be updated for v11 (or whenever this gets
committed) even if we think the new API is clearly better.  FDWs that
work today should continue working after this patch is committed.

Separately, I think there's a question of whether the new API is in
fact better -- I'm not sure I have a completely well-formed opinion
about that yet.

> In FdwXactResolveForeignTranasction(), resolver concludes the fate of
> transaction by seeing the status of fdwxact entry and the state of
> local transaction in clog. what I need to do is making that function
> log a complaint in commit case if couldn't find the prepared
> transaction, and not do that in abort case.

+1.

> Also, postgres_fdw don't
> raise an error even if we could not find prepared transaction on
> foreign server because it might have been resolved by other process.

+1.

> But this is now responsible by FDW. I should change it to resolver
> side. That is, FDW can raise error in ordinarily way but core code
> should catch and process it.

I don't understand exactly what you mean here.

> You're right. Perhaps we can deal with it by PrescanFdwXacts until
> reach consistent point, and then have vac_update_datfrozenxid check
> local xids of un-resolved fdwxact to determine the new datfrozenxid.
> Since the local xids of un-resolved fdwxacts would not be relevant
> with vacuuming, we don't need to include it to snapshot and
> GetOldestXmin etc. Also we hint to 

Re: pgsql: Allow UNIQUE indexes on partitioned tables

2018-02-20 Thread David G. Johnston
On Tue, Feb 20, 2018 at 8:36 AM, Alvaro Herrera 
wrote:

> Many thanks for reading through it!
>
> David G. Johnston wrote:
> > I found the following change to be confusing.
>  [...]
> > I was expecting the doc for ADD CONSTRAINT USING INDEX to note the
> > limitation explicitly - in lieu of the above paragraph.
>
> Agreed.  I moved the note to ADD CONSTRAINT and added a different on to
> ADD CONSTRAINT USING INDEX.
>
> > Also, I cannot reason out what the following limitation means:
> >
> > /doc/src/sgml/ref/create_table.sgml
> > +  If any partitions are in turn partitioned, all columns of each
> > partition
> > +  key are considered at each level below the
> UNIQUE
> > +  constraint.
>
> I can see that being unclear.  I tried to be very concise, to avoid
> spending too many words on what is mostly a fringe feature; but that
> probably didn't work very well.  Wording suggestions welcome.
>
​[...]​


> then you may create a unique or PK constraint on t only if you include
> both columns (a,b).  You may not create a PK on t (a), which is a bit
> surprising since (b) is not part of the partition key of t directly,
> only of t_1.
>
> Of course, if you create a unique constraint on t_1 (i.e. it doesn't
> cover all of t) then you may use (b) alone -- that's what "each level
> below the UNIQUE constraint" supposed to convey.
>

Something like:

When establishing a unique constraint for a multi-level partition hierarchy
all the "partition by" columns of the target partitioned table, as well as
those of all its descendant partitioned tables, must be included in the
constraint definition.

If I understand the above then the following failing test would be a worthy
addition to memorialize the behavior of ALTER TABLE ATTACH under this
constraint.

create table idxpart (a int primary key, b int) partition by range (a);
create table idxpart1 (a int not null, b int, primary key (a, b)) partition
by range (a, b);
alter table idxpart attach partition idxpart1 for values from (1) to (1000);
​

>
> I have trouble coming up with a real-world example where you would run
> into this limitation in practice.
>

​Indeed
​
David J.


Re: Option to ensure monotonic timestamps

2018-02-20 Thread Brent Kerby
The issue is that presence of timestamps is fundamental to the
functionality of temporal tables. The users need to.be able to make queries
on temporal tables in terms of timestamps; LSNs won't mean anything to
them. It would be an option to implement the temporal tables using LSNs
under the hood, but then it is still required to construct a monotonic
mapping between LSNs and timestamps in order for it to be usable. The most
natural method for constructing such a mapping would be to use the stored
commit timestamps; if these are monotonic, then that works, but we gain
little by storing the LSNs, since they will just be converted to timestamps
anyway when they're used. But if the timestamps aren't monotonic, then
we're faced with the same problem as before; we could try to patch up the
non-monotonicity in the mapping after-the-fact, using clamping or possibly
some more sophisticated method, but this is inefficient and could get ugly
fast. It would seem preferable to just ensure that the timestamps are
monotonic to begin with. And based on your observations of why we shouldn't
try to enforce this on every application of GetCurrentTimestamp, I think
maybe it would be cleaner to just enforce this on commit timestamps (and
only if enabled by a configuration option, of course), since this is all
that is needed and should be simpler and less expensive. And if a violation
occurs, we can just abort the transaction with an error, rather than
clamping the timestamp. This way we don't end up with an ugly scenario like
you pointed out, where we have one set of timestamps that are clamped
(i.e., for commit timestamps) and others that are not; and also this way,
if the $idiotsysadmin sets the clock back an hour, then we get errors
immediately instead of a whole hour of temporal table history being messed
up.

On Tue, Feb 20, 2018 at 11:36 AM, Tom Lane  wrote:

> Andres Freund  writes:
> > On 2018-02-20 12:32:22 -0500, Tom Lane wrote:
> >> The "global" variable would actually need to be cluster-wide, ie in
> shared
> >> memory, which would imply contention and the need for locks.  I think
> the
> >> overhead of this would be mighty high, and the return pretty low.
>
> > I think if we wanted to go for something like this (which I doubt), we'd
> > have that global variable as an atomic 64bit variable in shmem, and
> > *only* use it for stuff where the ordering actually matters. I.e. not
> > for transaction start times etc...
>
> Then you've got problems with figuring out which usages "matter" and which
> don't, and being sure you don't ever compare timestamps from the two
> different sources.  Seems mighty fragile to me, and reminiscent of the
> replication problems that forced us to drop support for float timestamps.
>
> In any case I'd not much like a system that mostly reported in system
> clock time except transaction commit timestamps are on some other
> timescale.
>
> But really, the killer point here is your upthread comment that even if
> GetCurrentTimestamp were guaranteed to return monotonic time, that would
> not guarantee that commit timestamps match physical commit order, which
> was the OP's goal.  At least not unless we read the clock while holding
> WALWriteLock, which I'm pretty sure everyone will say Ain't Happening.
>
> I think your not-very-explicit suggestion that he should work in
> commit LSNs, not timestamps at all, is a far superior answer.
>
> regards, tom lane
>


Re: Option to ensure monotonic timestamps

2018-02-20 Thread Brent Kerby
Right, I'm talking about temporal tables in the sense of the SQL:2011
standard. I know there's a Postgres extension temporal_tables by Vlad
Arkhipov (https://github.com/arkhipov/temporal_tables/) that approximates
this. There's also a way of doing it using only triggers written in
pgplsql, by Paolo Chiodi (
https://www.nearform.com/blog/time-travel-with-postgresql-on-amazon-rds/).
In these solutions, however, as well as in the SQL Server 2016 and many
other implementations, the transaction start time (as opposed to commit
time) is used as the time at which the data is considered to have changed,
which does not ensure consistency of the historical data: for instance, you
can end up with a situation where, when viewed "AS OF" certain time points,
the database will appear to have had non-unique primary keys and broken
foreign key references (e.g., see
https://dba.stackexchange.com/questions/143241/why-do-temporal-tables-log-the-begin-time-of-the-transaction/198204#198204
).

I wasn't aware of that recent work. The "AS OF" syntax seems useful,
although if I understand it correctly it doesn't provide the full power of
the temporal tables. With a full implementation of temporal tables, for
each temporal table there's a corresponding history table that can be
directly accessed by queries, making it possible for instance to see a list
of all changes that have affected rows satisfying certain conditions, or to
see the data "AS OF" not just constant times but "AS OF" some variable time
given by a column in another table that is being joined with (The ability
to do this is important in my application).

I agree with Tom's points and don't think that what I originally proposed
is a very good solution, but it still makes me uncomfortable to trust
blindly in the kernel clock when the integrity of the data hangs in the
balance. How about the following alternative proposal?: Instead of trying
to enforce monotonicity of all Postgres-generated timestamps, we look only
at the commit timestamps, and if at the time that we are about to commit we
detect that a violation occurs, instead of clamping (which I agree is ugly)
we abort the transaction with an error. And this should happen only if a
configuration option, say 'monotonic_commit_timestamp', is enabled. With
this approach, we only need to keep track of the previous commit timestamp,
which is already being done if "track_commit_timestamp" is enabled (which
should probably be a prerequisite for enabling
'monotonic_commit_timestamp'), so that should impose minimal overhead -- no
need for any additional locking or including anything more in the WAL,
right?


On Tue, Feb 20, 2018 at 11:09 AM, Patrick Krecker 
wrote:

> On Tue, Feb 20, 2018 at 9:51 AM, Andres Freund  wrote:
> > Hi,
> >
> > Leaving Tom's concerns aside:
> >
> > On 2018-02-19 13:42:31 -0700, Brent Kerby wrote:
> >> Hi, I'm new to Postgres hacking, and I'm interested in the possibility
> of a
> >> new feature to make it possible to ensure that Postgres-generated
> >> timestamps never decrease even if the system clock may step backwards.
> My
> >> use case is that I'm implementing a form of temporal tables based on
> >> transaction commit timestamps (as returned by pg_xact_commit_timestamp),
> >> and to ensure the integrity of the system I need to know that the
> ordering
> >> of the commit timestamps will always be consistent with the order in
> which
> >> the transactions actually committed.
> >
> > The acquiration of the commit timestamp and the actual visibility of the
> > commit will not necessarily be sufficient for many things. A backend can
> > theoretically sleep for an hour between
> >
> > static TransactionId
> > RecordTransactionCommit(void)
> > {
> > ...
> > SetCurrentTransactionStopTimestamp();
> > /* here */
> > XactLogCommitRecord(xactStopTimestamp,
> > nchildren,
> children, nrels, rels,
> > nmsgs,
> invalMessages,
> >
>  RelcacheInitFileInval, forceSyncCommit,
> > MyXactFlags,
> >
>  InvalidTransactionId /* plain commit */ );
> > }
> >
> > static void
> > CommitTransaction(void)
> > {
> > ...
> > /*
> >  * We need to mark our XIDs as committed in pg_xact.
> This is where we
> >  * durably commit.
> >  */
> > latestXid = RecordTransactionCommit();
> >
> > /* here */
> >
> > /*
> >  * Let others know about no transaction in progress by me. Note
> that this
> >  * must be done _before_ releasing locks we hold and _after_
> >  * RecordTransactionCommit.
> >  */
> > ProcArrayEndTransaction(MyProc, latestXid);
> >
> > whether that affects your approach I do not know.
> >
> >
> >> Any thoughts?
> >
> > Why are you looking to do something 

Re: ALTER TABLE ADD COLUMN fast default

2018-02-20 Thread Andres Freund
Hi,

On 2018-02-20 21:28:40 +0100, Tomas Vondra wrote:
> I don't quite understand why would this case need the TPC-H tests, or
> why would TPC-H give us more than the very focused tests we've already
> done.

Because a more complex query shows the cost of changing cache access
costs better than a trivial query. Simplistic queries will often
e.g. not show cost of additional branch predictor usage, because the
branch history is large enough to fit the simple query. But once you go
to a more complex query, and that's not necessarily the case anymore.


> The first test was testing a fairly short query where any such
> additional overhead would be much more obvious, compared to the TPC-H
> queries that usually do a lot of other expensive stuff.

Unfortunately such reasoning IME doesn't work well with cpu-bound stuff.

Greetings,

Andres Freund



Re: ALTER TABLE ADD COLUMN fast default

2018-02-20 Thread Tomas Vondra

On 02/20/2018 09:14 PM, Tom Lane wrote:
> Andres Freund  writes:
>> On 2018-02-20 20:57:36 +0100, Tomas Vondra wrote:
>>> The question is how should the schema for TPC-H look like. Because if
>>> you just do the usual test without any ALTER TABLE ADD COLUMN, then you
>>> really won't get any difference at all. The fast default stuff will be
>>> completely "inactive".
> 
>> I think just verifying that there's no meaningful effect with/without
>> the patch is good enough. As soon as there's actual new columns that
>> take advantage of the new functionality I think some degradation is fair
>> game, you got some benefit for it.
> 
> Yeah, I think what we want to know is how much penalty people are paying
> for the feature to exist even though they aren't using it.  That seems
> to break down into two cases:
> 
> (1) use of a table that's never had any added columns;
> 
> (2) use of a table that has had columns added, but they just default
> to null the same as before.  Is there more relcache setup time anyway?
> Is deforming slower even if there are no fast defaults?
> 
> Case (1) could be answered by a straight TPC-H test with/without patch.

I don't quite understand why would this case need the TPC-H tests, or
why would TPC-H give us more than the very focused tests we've already
done. The first test was testing a fairly short query where any such
additional overhead would be much more obvious, compared to the TPC-H
queries that usually do a lot of other expensive stuff.

I'm fine with doing the tests, of course.

> Case (2) seems a bit harder, since presumably you want to measure
> accesses to tuples that are "short" compared to the current table
> tupdesc, but for which the extra columns are still default-null.
> 

Yeah. I think we can add a column or two to the "fact" tables
(particularly lineitem), and then tweak the queries to also compute
aggregates on this new column.

Unless someone has better ideas, I'll do such tests once the machine
completes the stuff that's running now.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: ALTER TABLE ADD COLUMN fast default

2018-02-20 Thread Tom Lane
Andres Freund  writes:
> On 2018-02-20 20:57:36 +0100, Tomas Vondra wrote:
>> The question is how should the schema for TPC-H look like. Because if
>> you just do the usual test without any ALTER TABLE ADD COLUMN, then you
>> really won't get any difference at all. The fast default stuff will be
>> completely "inactive".

> I think just verifying that there's no meaningful effect with/without
> the patch is good enough. As soon as there's actual new columns that
> take advantage of the new functionality I think some degradation is fair
> game, you got some benefit for it.

Yeah, I think what we want to know is how much penalty people are paying
for the feature to exist even though they aren't using it.  That seems
to break down into two cases:

(1) use of a table that's never had any added columns;

(2) use of a table that has had columns added, but they just default
to null the same as before.  Is there more relcache setup time anyway?
Is deforming slower even if there are no fast defaults?

Case (1) could be answered by a straight TPC-H test with/without patch.
Case (2) seems a bit harder, since presumably you want to measure accesses
to tuples that are "short" compared to the current table tupdesc, but for
which the extra columns are still default-null.

regards, tom lane



Re: ALTER TABLE ADD COLUMN fast default

2018-02-20 Thread Andres Freund
Hi,

On 2018-02-20 20:57:36 +0100, Tomas Vondra wrote:
> The question is how should the schema for TPC-H look like. Because if
> you just do the usual test without any ALTER TABLE ADD COLUMN, then you
> really won't get any difference at all. The fast default stuff will be
> completely "inactive".

Well, there'll still be changed behaviour due to the changed sizes of
structures and new out of line function calls. The deforming code is
quite sensitive to such changes.

I think just verifying that there's no meaningful effect with/without
the patch is good enough. As soon as there's actual new columns that
take advantage of the new functionality I think some degradation is fair
game, you got some benefit for it.

Greetings,

Andres Freund



Re: ALTER TABLE ADD COLUMN fast default

2018-02-20 Thread Tomas Vondra


On 02/20/2018 06:43 PM, Andres Freund wrote:
> 
> 
> On February 20, 2018 5:03:58 AM PST, Petr Jelinek 
>  wrote:
>> On 20/02/18 07:42, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2018-02-17 00:23:40 +0100, Tomas Vondra wrote:
 Anyway, I consider the performance to be OK. But perhaps Andres
>> could
 comment on this too, as he requested the benchmarks.
>>>
>>> My performance concerns were less about CREATE TABLE related things
>> than
>>> about analytics workloads or such, where deforming is the primary
>>> bottleneck.  I think it should be ok, but doing a before/after tpc-h
>> of
>>> scale 5-10 or so wouldn't be a bad thing to verify.
>>>
>>
>> The test Tomas is doing is analytical query, it's running sum on the
>> new
>> fast default column.
>>
>> He uses create and create-alter names as comparison between when
>> the table was created with the columns and when the columns were
>> added using fast default.
> 
> It's still a fairly simplistic test case. Running some queries with 
> reasonably well known characteristics seems like a good idea
> regardless. It's not like a scale 5 run takes that long.
>

I can do that, although I don't really expect any surprises there.

The simple tests I did were supposed to test two corner cases - best
case, when there are no columns with "fast default", and worst case when
all columns (and tuples) have "fast default". Per David's findings, the
second test was not perfectly right, but that's a separate issue.

The question is how should the schema for TPC-H look like. Because if
you just do the usual test without any ALTER TABLE ADD COLUMN, then you
really won't get any difference at all. The fast default stuff will be
completely "inactive".

So, we need to tweak the TPC-H schema somehow ... We need (a) to add
some new columns with default values, (b) tweak some of the queries to
use those new columns. Suggestions?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: ALTER TABLE ADD COLUMN fast default

2018-02-20 Thread Andres Freund
Hi,

On 2018-02-20 13:50:54 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2018-02-20 13:07:30 -0500, Tom Lane wrote:
> >> I can easily think of problems that will ensue if we try to support that
> >> case, because right now the toast mechanisms assume that OOL toasted
> >> values can only be referenced from the associated table.
> 
> > What problem are you seeing with treating the toasted value to be from
> > pg_attribute? I'm only drinking my first coffee just now, so I might be
> > missing something...
> 
> Locking/vacuuming is exactly what I'm worried about.  Right now, a
> transaction cannot reference a toast value without holding at least
> AccessShare on the parent table.

Is that actually reliably true? Doesn't e.g. use a variable from a
rolled-back subtransaction in plpgsql circumvent this already?  There
are a few bugs around this though, so that's not really a convincing
argument of there not being any danger :/


> That would not be true of OOL fast defaults that are living in
> pg_attribute's toast table (if it had one).  If you think this isn't
> potentially problematic, see the problems that forced us into hacks
> like 08e261cbc.

I don't think the equivalent problem quite applies here, the OOL default
should afaics be immutable. But yea, it's a concern.


> I guess the fix equivalent to 08e261cbc would be to require any toasted
> fast-default value to be detoasted into line whenever it's copied into
> a tupledesc, rather than possibly being fetched later.

Yea, thats probably a good idea. Correctness aside, it's probably better
to detoast once rather than do so for every single row for performance
reasons (detoasting the same row over and over in multiple backends
would lead to quite the contention on used buffer headers and relation
locks).


> > Now we certainly would need to make sure that the corresponding
> > pg_attribute row containing the default value doesn't go away too early,
> > but that shouldn't be much of a problem given that we never remove
> > them. I wondered for a second if there's problematic cases where the
> > default value is referenced by an index, and then the default-adding
> > transaction rolls back. But I can't construct anything realistically
> > problematic.
> 
> Oooh ... but no, because we don't allow toasted values to be copied into
> indexes, for (I think) exactly this reason.  See index_form_tuple.

I don't think that'd necessarily provide full protection - what about a
index recheck during a lossy bitmap index scan for example? But given
that any such index creation would also be rolled back I'm not too
concerned.

Greetings,

Andres Freund



Re: ALTER TABLE ADD COLUMN fast default

2018-02-20 Thread Tom Lane
Andres Freund  writes:
> On 2018-02-20 13:07:30 -0500, Tom Lane wrote:
>> I can easily think of problems that will ensue if we try to support that
>> case, because right now the toast mechanisms assume that OOL toasted
>> values can only be referenced from the associated table.

> What problem are you seeing with treating the toasted value to be from
> pg_attribute? I'm only drinking my first coffee just now, so I might be
> missing something...

Locking/vacuuming is exactly what I'm worried about.  Right now, a
transaction cannot reference a toast value without holding at least
AccessShare on the parent table.  That would not be true of OOL fast
defaults that are living in pg_attribute's toast table (if it had one).
If you think this isn't potentially problematic, see the problems that
forced us into hacks like 08e261cbc.

I guess the fix equivalent to 08e261cbc would be to require any toasted
fast-default value to be detoasted into line whenever it's copied into
a tupledesc, rather than possibly being fetched later.

> Now we certainly would need to make sure that the corresponding
> pg_attribute row containing the default value doesn't go away too early,
> but that shouldn't be much of a problem given that we never remove
> them. I wondered for a second if there's problematic cases where the
> default value is referenced by an index, and then the default-adding
> transaction rolls back. But I can't construct anything realistically
> problematic.

Oooh ... but no, because we don't allow toasted values to be copied into
indexes, for (I think) exactly this reason.  See index_form_tuple.

regards, tom lane



Re: Option to ensure monotonic timestamps

2018-02-20 Thread Tom Lane
Andres Freund  writes:
> On 2018-02-20 12:32:22 -0500, Tom Lane wrote:
>> The "global" variable would actually need to be cluster-wide, ie in shared
>> memory, which would imply contention and the need for locks.  I think the
>> overhead of this would be mighty high, and the return pretty low.

> I think if we wanted to go for something like this (which I doubt), we'd
> have that global variable as an atomic 64bit variable in shmem, and
> *only* use it for stuff where the ordering actually matters. I.e. not
> for transaction start times etc...

Then you've got problems with figuring out which usages "matter" and which
don't, and being sure you don't ever compare timestamps from the two
different sources.  Seems mighty fragile to me, and reminiscent of the
replication problems that forced us to drop support for float timestamps.

In any case I'd not much like a system that mostly reported in system
clock time except transaction commit timestamps are on some other
timescale.

But really, the killer point here is your upthread comment that even if
GetCurrentTimestamp were guaranteed to return monotonic time, that would
not guarantee that commit timestamps match physical commit order, which
was the OP's goal.  At least not unless we read the clock while holding
WALWriteLock, which I'm pretty sure everyone will say Ain't Happening.

I think your not-very-explicit suggestion that he should work in
commit LSNs, not timestamps at all, is a far superior answer.

regards, tom lane



Re: Option to ensure monotonic timestamps

2018-02-20 Thread Andres Freund
Hi,

On 2018-02-20 12:32:22 -0500, Tom Lane wrote:
> The "global" variable would actually need to be cluster-wide, ie in shared
> memory, which would imply contention and the need for locks.  I think the
> overhead of this would be mighty high, and the return pretty low.

I think if we wanted to go for something like this (which I doubt), we'd
have that global variable as an atomic 64bit variable in shmem, and
*only* use it for stuff where the ordering actually matters. I.e. not
for transaction start times etc...


> It's also worth pointing out that if you don't trust the kernel clock,
> simply clamping to the last returned value isn't likely to be terribly
> satisfactory.  What if $idiotsysadmin steps the clock back an hour?
> We've had actual problems of that sort, for example with the stats
> collector going AWOL for awhile because it thought it'd already written a
> sufficiently new stats file.  There's now an explicit check for clock-
> went-backwards in pgstat_recv_inquiry, which will be broken in that sort
> of scenario if you cause GetCurrentTimestamp to do clamping internally.

I guess you could hack something together with CLOCK_MONOTONIC or such,
but b.

Greetings,

Andres Freund



Re: master check fails on Windows Server 2008

2018-02-20 Thread Tom Lane
Marina Polyakova  writes:
> On 20-02-2018 3:37, Tom Lane wrote:
>> 4. Try to tweak the stats_ext.sql test conditions in some more refined
>> way to get the test to pass everywhere.  This'd be a lot of work with
>> no guarantee of success, so I'm not too excited about it.

> Thank you for your explanations! I'll try to do something in this 
> direction..

OK.  The least painful fix might be to establish a different work_mem
setting just for that one query.

However, if you're intent on putting work into continued support of
--disable-float8-byval, I would *strongly* suggest setting up a buildfarm
member that runs that way, because otherwise we're pretty much guaranteed
to break it again.  I continue to wonder if it's not better to just remove
the option and thereby simplify our lives.  What's the actual value of
having it anymore?

regards, tom lane



Re: ALTER TABLE ADD COLUMN fast default

2018-02-20 Thread Andres Freund
Hi,

On 2018-02-20 13:07:30 -0500, Tom Lane wrote:
> ... btw, I've not read this patch in any detail, but the recent thread
> about toast tables for system catalogs prompts me to wonder what happens
> if a "fast default" value is large enough to require out-of-line toasting.

Hm, interesting.


> I can easily think of problems that will ensue if we try to support that
> case, because right now the toast mechanisms assume that OOL toasted
> values can only be referenced from the associated table.

What problem are you seeing with treating the toasted value to be from
pg_attribute? I'm only drinking my first coffee just now, so I might be
missing something...

Now we certainly would need to make sure that the corresponding
pg_attribute row containing the default value doesn't go away too early,
but that shouldn't be much of a problem given that we never remove
them. I wondered for a second if there's problematic cases where the
default value is referenced by an index, and then the default-adding
transaction rolls back. But I can't construct anything realistically
problematic.

Greetings,

Andres Freund



Re: master check fails on Windows Server 2008

2018-02-20 Thread Marina Polyakova

On 20-02-2018 3:37, Tom Lane wrote:

Ah-hah.  I can reproduce the described failure if I configure with
--disable-float8-byval on an otherwise 64-bit machine.  It appears that
the minimum work_mem setting that will allow this query to use a 
hashagg

plan on such a configuration is about 155kB (which squares with the
results you show).  On the other hand, in a normal 64-bit 
configuration,

work_mem settings of 160kB or more cause other failures (due to other
plans switching to hashagg), and on a 32-bit machine I see such 
failures

with work_mem of 150kB or more.  So there's basically no setting of
work_mem that would make all these tests pass everywhere.

I see several things we could do about this:

1. Nothing; just say "sorry, we don't promise that the regression tests
pass with no plan differences on nonstandard configurations".  Given 
that
--disable-float8-byval has hardly any real-world use, there is not a 
lot

of downside to that.

2. Decide that --disable-float8-byval, and for that matter
--disable-float4-byval, have no real-world use at all and take them 
out.

There was some point in those options back when we cared about binary
compatibility with version-zero C functions, but now I'm not sure why
anyone would use them.

3. Drop that one test case from stats_ext.sql; I'm not sure how much
additional test value it's actually bringing.

4. Try to tweak the stats_ext.sql test conditions in some more refined
way to get the test to pass everywhere.  This'd be a lot of work with
no guarantee of success, so I'm not too excited about it.


Thank you for your explanations! I'll try to do something in this 
direction..



5. Something else?

regards, tom lane


--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Option to ensure monotonic timestamps

2018-02-20 Thread Patrick Krecker
On Tue, Feb 20, 2018 at 9:51 AM, Andres Freund  wrote:
> Hi,
>
> Leaving Tom's concerns aside:
>
> On 2018-02-19 13:42:31 -0700, Brent Kerby wrote:
>> Hi, I'm new to Postgres hacking, and I'm interested in the possibility of a
>> new feature to make it possible to ensure that Postgres-generated
>> timestamps never decrease even if the system clock may step backwards. My
>> use case is that I'm implementing a form of temporal tables based on
>> transaction commit timestamps (as returned by pg_xact_commit_timestamp),
>> and to ensure the integrity of the system I need to know that the ordering
>> of the commit timestamps will always be consistent with the order in which
>> the transactions actually committed.
>
> The acquiration of the commit timestamp and the actual visibility of the
> commit will not necessarily be sufficient for many things. A backend can
> theoretically sleep for an hour between
>
> static TransactionId
> RecordTransactionCommit(void)
> {
> ...
> SetCurrentTransactionStopTimestamp();
> /* here */
> XactLogCommitRecord(xactStopTimestamp,
> nchildren, children, 
> nrels, rels,
> nmsgs, invalMessages,
> 
> RelcacheInitFileInval, forceSyncCommit,
> MyXactFlags,
> InvalidTransactionId 
> /* plain commit */ );
> }
>
> static void
> CommitTransaction(void)
> {
> ...
> /*
>  * We need to mark our XIDs as committed in pg_xact.  This is 
> where we
>  * durably commit.
>  */
> latestXid = RecordTransactionCommit();
>
> /* here */
>
> /*
>  * Let others know about no transaction in progress by me. Note that 
> this
>  * must be done _before_ releasing locks we hold and _after_
>  * RecordTransactionCommit.
>  */
> ProcArrayEndTransaction(MyProc, latestXid);
>
> whether that affects your approach I do not know.
>
>
>> Any thoughts?
>
> Why are you looking to do something timestamp based in the first place?
> It's a bit hard to give good advice without further information...
>
> Greetings,
>
> Andres Freund
>

Hi Brent --

I haven't heard of temporal tables before, but I guess it's a feature
of SQL Server 2016. It sounds similar to some recent work in progress
to add "AS OF" to SELECT statements:
https://www.postgresql.org/message-id/78aadf6b-86d4-21b9-9c2a-51f1efb8a...@postgrespro.ru

Patrick



Re: ALTER TABLE ADD COLUMN fast default

2018-02-20 Thread Tom Lane
... btw, I've not read this patch in any detail, but the recent thread
about toast tables for system catalogs prompts me to wonder what happens
if a "fast default" value is large enough to require out-of-line toasting.

I can easily think of problems that will ensue if we try to support that
case, because right now the toast mechanisms assume that OOL toasted
values can only be referenced from the associated table.

regards, tom lane



Re: Option to ensure monotonic timestamps

2018-02-20 Thread Andres Freund
Hi,

Leaving Tom's concerns aside:

On 2018-02-19 13:42:31 -0700, Brent Kerby wrote:
> Hi, I'm new to Postgres hacking, and I'm interested in the possibility of a
> new feature to make it possible to ensure that Postgres-generated
> timestamps never decrease even if the system clock may step backwards. My
> use case is that I'm implementing a form of temporal tables based on
> transaction commit timestamps (as returned by pg_xact_commit_timestamp),
> and to ensure the integrity of the system I need to know that the ordering
> of the commit timestamps will always be consistent with the order in which
> the transactions actually committed.

The acquiration of the commit timestamp and the actual visibility of the
commit will not necessarily be sufficient for many things. A backend can
theoretically sleep for an hour between

static TransactionId
RecordTransactionCommit(void)
{
...
SetCurrentTransactionStopTimestamp();
/* here */
XactLogCommitRecord(xactStopTimestamp,
nchildren, children, 
nrels, rels,
nmsgs, invalMessages,
RelcacheInitFileInval, 
forceSyncCommit,
MyXactFlags,
InvalidTransactionId /* 
plain commit */ );
}

static void
CommitTransaction(void)
{
...
/*
 * We need to mark our XIDs as committed in pg_xact.  This is 
where we
 * durably commit.
 */
latestXid = RecordTransactionCommit();

/* here */

/*
 * Let others know about no transaction in progress by me. Note that 
this
 * must be done _before_ releasing locks we hold and _after_
 * RecordTransactionCommit.
 */
ProcArrayEndTransaction(MyProc, latestXid);

whether that affects your approach I do not know.


> Any thoughts?

Why are you looking to do something timestamp based in the first place?
It's a bit hard to give good advice without further information...

Greetings,

Andres Freund



Re: [PATCH] Add a few suppression rules for Valgrind

2018-02-20 Thread Tom Lane
Andres Freund  writes:
>> I decided to run the code from master branch under Valgrind and
>> discovered that it reports some errors.

> On my systems I just include a global valgrind suppression file which
> includes libc specific things and then the postgres valgrind.supp. I'm
> very hesitant to add suppressions to the codebase for this, seems too
> likely to actually hide bugs.

Yeah, I think it's a much better idea to add a local suppression file
for problems in your local libc or other libraries.  That's what I do.
Looking at mine, it also suppresses some weirdnesses in RHEL6's version
of Perl, which would surely be inappropriate on other platforms.

regards, tom lane



Re: [PATCH] Add a few suppression rules for Valgrind

2018-02-20 Thread Andres Freund
Hi,

> I decided to run the code from master branch under Valgrind and
> discovered that it reports some errors.
> 
> There are multiple reports like this one (seems to be a false alarm):
> 
> ```
> Invalid read of size 16
>at 0x605F488: __wcsnlen_sse4_1 (in /usr/lib/libc-2.26.so)
>by 0x604F5C2: wcsrtombs (in /usr/lib/libc-2.26.so)
>by 0x5FE4C41: wcstombs (in /usr/lib/libc-2.26.so)
>by 0x6FFAFC: wchar2char (pg_locale.c:1641)
>by 0x6937A0: str_tolower (formatting.c:1599)
>by 0x6F88E8: lower (oracle_compat.c:50)
>by 0x43097D: ExecInterpExpr (execExprInterp.c:678)
>by 0x48B201: ExecEvalExpr (executor.h:286)
>by 0x48BBD6: tfuncInitialize (nodeTableFuncscan.c:369)
>by 0x48B91E: tfuncFetchRows (nodeTableFuncscan.c:299)
>by 0x48B245: TableFuncNext (nodeTableFuncscan.c:65)
>by 0x4462E6: ExecScanFetch (execScan.c:95)
> Address 0x513adab0 is 7,808 bytes inside a recently re-allocated
>   block of size 16,384 alloc'd
>at 0x4C2CEDF: malloc (vg_replace_malloc.c:299)
>by 0x7E2807: AllocSetAlloc (aset.c:934)
>by 0x7EF481: palloc (mcxt.c:848)
>by 0x42D2A7: ExprEvalPushStep (execExpr.c:2131)
>by 0x42D7DB: ExecPushExprSlots (execExpr.c:2286)
>by 0x42D723: ExecInitExprSlots (execExpr.c:2265)
>by 0x429B77: ExecBuildProjectionInfo (execExpr.c:370)
>by 0x44A39B: ExecAssignProjectionInfo (execUtils.c:457)
>by 0x4733FE: ExecInitNestLoop (nodeNestloop.c:308)
>by 0x4442F0: ExecInitNode (execProcnode.c:290)
>by 0x43C5EB: InitPlan (execMain.c:1043)
>by 0x43B31C: standard_ExecutorStart (execMain.c:262)

These are libc issues, I don't think we should carry exceptions for
those. Normally valgrind after a while will catch up and internally
ignore them. Here the issue is that the SSE implementation can read a
few trailing bytes at the end of the string, which is harmless.

Normally valgrind fixes this issue by "intercepting" such libc symbols,
but I suspect that some libc optimizations broke that.


On my systems I just include a global valgrind suppression file which
includes libc specific things and then the postgres valgrind.supp. I'm
very hesitant to add suppressions to the codebase for this, seems too
likely to actually hide bugs.


> ```
> 
> Also there are many reports like this one (definitely false alarm):
> 
> ```
> Syscall param epoll_pwait(sigmask) points to unaddressable byte(s)
> at 0x60A2F90: epoll_pwait (in /usr/lib/libc-2.26.so)
> by 0x5F7B24: WaitEventSetWaitBlock (latch.c:1048)
> by 0x5F79E8: WaitEventSetWait (latch.c:1000)
> by 0x493A79: secure_read (be-secure.c:170)
> by 0x4A3DE2: pq_recvbuf (pqcomm.c:963)
> by 0x4A3EAA: pq_getbyte (pqcomm.c:1006)
> by 0x627AD0: SocketBackend (postgres.c:339)
> by 0x628032: ReadCommand (postgres.c:512)
> by 0x62D243: PostgresMain (postgres.c:4086)
> by 0x581240: BackendRun (postmaster.c:4412)
> by 0x5808BE: BackendStartup (postmaster.c:4084)
> by 0x57CA20: ServerLoop (postmaster.c:1757)
> Address 0x0 is not stack'd, malloc'd or (recently) free'd

Yea, this one is just valgrind not really understanding the syscall.

It's been fixed in git's source tree:

commit 3ac87cf9277964802ddd9af9747a10ff0b838c29
Author: Mark Wielaard 
Date:   2017-06-17 13:49:22 +

epoll_pwait can have a NULL sigmask.

we just need a new valgrind release.


- Andres



Re: Contention preventing locking

2018-02-20 Thread Konstantin Knizhnik



On 20.02.2018 19:39, Simon Riggs wrote:

On 20 February 2018 at 16:07, Konstantin Knizhnik
 wrote:


On 20.02.2018 14:26, Simon Riggs wrote:

Try locking the root tid rather than the TID, that is at least unique
per page for a chain of tuples, just harder to locate.


As far as I understand, it is necessary to traverse the whole page to locate
root tuple, isn't it?
If so, then I expect it to be too expensive operation. Scanning the whole
page on tuple update seems to be not an acceptable solution.

Probably.

It occurs to me that you can lock the root tid in index_fetch_heap().
I hear other DBMS lock via the index.

However, anything you do with tuple locking could interact badly with
heap_update and the various lock modes, so be careful.

You also have contention for heap_page_prune_opt() and with SELECTs to
consider, so I think you need to look at both page and tuple locking.



So, if I correctly understand the primary goal of setting tuple lock in 
heapam.c is to avoid contention caused

by concurrent release of all waiters.
But my transaction lock chaining patch eliminates this problem in other way.
So what about combining your patch (do not lock Snapshot.xmax) + with my 
xlock patch and ... completely eliminate tuple lock in heapam?
In this case update of tuple will require obtaining just one heavy 
weight lock.


I made such experiment and didn't find any synchronization problems with 
my pgrw test.

Performance is almost the same as with vanilla+xlock patch:

https://docs.google.com/spreadsheets/d/1QOYfUehy8U3sdasMjGnPGQJY8JiRfZmlS64YRBM0YTo/edit?usp=sharing 



I wonder why instead of chaining transaction locks (which can be done 
quite easily) approach with extra tuple lock was chosen?

May be I missed something?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Contention preventing locking

2018-02-20 Thread Simon Riggs
On 20 February 2018 at 16:07, Konstantin Knizhnik
 wrote:
>
>
> On 20.02.2018 14:26, Simon Riggs wrote:
>>
>> Try locking the root tid rather than the TID, that is at least unique
>> per page for a chain of tuples, just harder to locate.
>>
> As far as I understand, it is necessary to traverse the whole page to locate
> root tuple, isn't it?
> If so, then I expect it to be too expensive operation. Scanning the whole
> page on tuple update seems to be not an acceptable solution.

Probably.

It occurs to me that you can lock the root tid in index_fetch_heap().
I hear other DBMS lock via the index.

However, anything you do with tuple locking could interact badly with
heap_update and the various lock modes, so be careful.

You also have contention for heap_page_prune_opt() and with SELECTs to
consider, so I think you need to look at both page and tuple locking.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Contention preventing locking

2018-02-20 Thread Konstantin Knizhnik



On 20.02.2018 14:26, Simon Riggs wrote:

Try locking the root tid rather than the TID, that is at least unique
per page for a chain of tuples, just harder to locate.

As far as I understand, it is necessary to traverse the whole page to 
locate root tuple, isn't it?
If so, then I expect it to be too expensive operation. Scanning the 
whole page on tuple update seems to be not an acceptable solution.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [PATCH] Add support for ON UPDATE/DELETE actions on ALTER CONSTRAINT

2018-02-20 Thread Matheus de Oliveira
On Tue, Feb 20, 2018 at 12:38 PM, Fabrízio de Royes Mello <
fabriziome...@gmail.com> wrote:

>
>
> ...
>
> I didn't read your patch yet but make sure to register it to the next open
> commitfest.
>

Thanks a lot Fabrízio, I've done that already [1].

Please let me know if I did something wrong, and if you see improvements on
the patch ;)


[1] https://commitfest.postgresql.org/17/1533/

Regards,
-- 
Matheus de Oliveira


Re: pg_get_functiondef forgets about most GUC_LIST_INPUT GUCs

2018-02-20 Thread Arthur Zakirov
Hello,

On Fri, Jan 12, 2018 at 10:24:40AM +0900, Michael Paquier wrote:
> OK, I can live with that. What do you think about the attached? I'll be
> happy to produce patches for back-branches as necessary. When an option
> is not found, I have made the function return 0 as value for the flags,
> which is consistent with flatten_set_variable_args(). To make things
> behave more consistently with GUC_LIST_QUOTE GUCs, it seems to me that
> those should not be quoted as well (ALTER SYSTEM shares the same
> compatibility). And attached is a patch.

Just 2 cents from me. It seems that there is a problem with extensions
GUCs. For example:

=# CREATE FUNCTION func_with_set_params() RETURNS integer
AS 'select 1;'
LANGUAGE SQL
set plpgsql.extra_errors to 'shadowed_variables';
=# SELECT pg_get_functiondef('func_with_set_params'::regproc);
pg_get_functiondef
--
 CREATE OR REPLACE FUNCTION public.func_with_set_params()+
  RETURNS integer+
  LANGUAGE sql   +
  SET "plpgsql.extra_errors" TO 'shadowed_variables' +
 AS $function$select 1;$function$+

It is good while plpgsql is loaded. But when we exit the session and try
it again in another:

=# SELECT pg_get_functiondef('func_with_set_params'::regproc);
ERROR:  unrecognized configuration parameter "plpgsql.extra_errors"

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: [bug fix] Produce a crash dump before main() on Windows

2018-02-20 Thread Magnus Hagander
On Tue, Feb 20, 2018 at 3:18 PM, Craig Ringer  wrote:

> On 20 February 2018 at 21:47, Magnus Hagander  wrote:
>
>>
>>
>> On Fri, Feb 16, 2018 at 8:28 AM, Tsunakawa, Takayuki <
>> tsunakawa.ta...@jp.fujitsu.com> wrote:
>>
>>> Hello,
>>>
>>> postgres.exe on Windows doesn't output a crash dump when it crashes
>>> before main() is called.  The attached patch fixes this.  I'd like this to
>>> be back-patched.  I'll add this to the next CF.
>>>
>>> The original problem happened on our customer's production system.
>>> Their application sometimes failed to connect to the database.  That was
>>> because postgres.exe crashed due to access violation (exception code
>>> C005).  But there was no crash dump, so we had difficulty in finding
>>> the cause.  The frequency was low -- about ten times during half a year.
>>>
>>> What caused the access violation was Symantec's antivirus software.  It
>>> seems that sysfer.dll of the software intercepts registry access, during C
>>> runtime library initialization,  before main() is called.  So, the direct
>>> cause of this problem is not PostgreSQL.
>>>
>>> On the other hand, it's PostgreSQL's problem that we can't get the crash
>>> dump, which makes the investigation difficult.  The cause is that
>>> postmaster calls SetErrorMode() to disable the outputing of crash dumps by
>>> WER (Windows Error Reporting).  This error mode is inherited from
>>> postmaster to its children.  If a crash happens before the child sets up
>>> the exception handler, no crash dump is produced.
>>>
>>
>> The original call to SetErrorMode() was put in there to make sure we
>> didn't show a popup message which would then make everything freeze (see
>> very old commit 27bff7502f04ee01237ed3f5a997748ae43d3a81). Doesn't this
>> turn that back on, so that if you are not actually there to monitor
>> something you can end up with stuck processes and exactly the issues we had
>> before that one?
>>
>
> Ha, I just went digging for the same.
>
> We should not disable WER when running as a service (no UI access), it
> will not display an interactive dialog.
>
> I'm not convinced we should disable it at all personally. Things have come
> a long way from drwatson.exe . Disabling WER makes it hard to debug
> postgres by installing Visual Studio Debugger as the hander (I always
> wondered why that didn't work!) and is generally just painful. It prevents
> us from collecting data via Microsoft about crashes, should we wish to do
> so. And who runs Pg on windows except as a service?!
>
>
I've seen a number of usecases where apps start it alongside the app
instead of as a service. I'm not sure how recent those apps are though, and
I'm not sure it's better than using a service in the first place (but it
does let you install things without being an admin).

We really shouldn't *break* that scenario for people. But making it work
well for the service usecase should definitely be the priority.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: [PATCH] Add support for ON UPDATE/DELETE actions on ALTER CONSTRAINT

2018-02-20 Thread Fabrízio de Royes Mello
On Tue, Feb 20, 2018 at 12:10 PM, Matheus de Oliveira <
matioli.math...@gmail.com> wrote:
>
> Hi all.
>
> I attached a patch to add support for changing ON UPDATE/DELETE actions
of a constraint using ALTER TABLE ... ALTER CONSTRAINT.
>
> Besides that, there is a another change in this patch on current ALTER
CONSTRAINT about deferrability options. Previously, if the user did ALTER
CONSTRAINT without specifying an option on deferrable or initdeferred, it
was implied the default options, so this:
>
> ALTER TABLE tbl
> ALTER CONSTRAINT con_name;
>
> Was equivalent to:
>
> ALTER TABLE tbl
> ALTER CONSTRAINT con_name NOT DEFERRABLE INITIALLY IMMEDIATE;
>
> If I kept it that way, it means that changing only ON UPDATE or ON DELETE
would cause deferrability options to be changed to the default. Now, I keep
an information of which options has actually been changed, so only the
actual changes are persisted.
>
> But there are two exceptions (which I think that make sense):
> 1. If the user does only `ALTER CONSTRAINT ... INITIALLY DEFERRED`, no
matter the previous value of deferrable, it will be set to true.
> 2. If the user does only `ALTER CONSTRAINT ... NOT DEFERRABLE`, no matter
the previous value of initdeferred, it will be set to false.
>
> I have pondered to raise an exception in the above cases instead of
forcing deferrable/initdeferred to valid values, but since the same
behavior happens on ADD CONSTRAINT, I think this way is simpler.
>
> Since I'm a newbie on PG source code, this patch seems to be a bit big
for me. So please, let me know what you think about it. Specially the
change on Constraint struct on parsenode.h (and support to it on
copyfuncs.c and outfuncs.c), I'm not 100% sure that is the best way to
track if deferrability options were changed.
>
> Thanks a lot.
>

Great!

I didn't read your patch yet but make sure to register it to the next open
commitfest.

Regards,


--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
>> Github: http://github.com/fabriziomello


Re: pgsql: Allow UNIQUE indexes on partitioned tables

2018-02-20 Thread Alvaro Herrera
Many thanks for reading through it!

David G. Johnston wrote:
> I found the following change to be confusing.
 [...]
> I was expecting the doc for ADD CONSTRAINT USING INDEX to note the
> limitation explicitly - in lieu of the above paragraph.

Agreed.  I moved the note to ADD CONSTRAINT and added a different on to
ADD CONSTRAINT USING INDEX.

> Also, I cannot reason out what the following limitation means:
> 
> /doc/src/sgml/ref/create_table.sgml
> +  If any partitions are in turn partitioned, all columns of each
> partition
> +  key are considered at each level below the UNIQUE
> +  constraint.

I can see that being unclear.  I tried to be very concise, to avoid
spending too many words on what is mostly a fringe feature; but that
probably didn't work very well.  Wording suggestions welcome.  What this
means is that if you create a partition that is partitioned on a column
different from its parent, then a primary key that covers the whole
hierarchy (i.e. you're not just adding a PK to the partitioned
partition) must include all partition columns, not just the upper one.

Example:

create table t (a int, b int) partition by range (a);
create table t_1 partition of t for values from (0) to (1000) partition by 
range (b);

then you may create a unique or PK constraint on t only if you include
both columns (a,b).  You may not create a PK on t (a), which is a bit
surprising since (b) is not part of the partition key of t directly,
only of t_1.

Of course, if you create a unique constraint on t_1 (i.e. it doesn't
cover all of t) then you may use (b) alone -- that's what "each level
below the UNIQUE constraint" supposed to convey.

I have trouble coming up with a real-world example where you would run
into this limitation in practice.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Contention preventing locking

2018-02-20 Thread Konstantin Knizhnik



On 20.02.2018 16:42, Simon Riggs wrote:

On 20 February 2018 at 13:19, Konstantin Knizhnik
 wrote:


On 20.02.2018 14:26, Simon Riggs wrote:

On 15 February 2018 at 16:00, Konstantin Knizhnik
 wrote:


So in heap_acquire_tuplock all competing transactions are waiting for TID
of
the updated version. When transaction which changed this tuple is
committed,
one of the competitors will grant this lock and proceed, creating new
version of the tuple. Then all other competitors will be awaken and ...
find
out that locked tuple is not the last version any more.
Then they will locate new version and try to lock it... The more
competitors
we have, then more attempts they all have to perform (quadratic
complexity).

What about the tuple lock? You are saying that is ineffective?

src/backend/access/heap/README.tuplock


In my last mail ni this thread I have mentioned that update of tuple cause
setting of three heavy weight locks:

1. Locking of SnapshotDirty.xmax transaction (XactLockTableWait) in
EvalPlanQualFetch
2. Tuple lock (heap_acquire_tuplock) in heap_tuple_update
3. Transaction lock (XactLockTableWait) in heap_tuple_update

So what I see in debugger and monitoring pg_locks, is that under high
contention there are a lot transactions waiting for SnapshotDirty.xmax.
This lock is obtained before tuple lock, so tuple lock can not help in this
case.

Hmm, that situation occurs erroneously, as I previously observed
https://www.postgresql.org/message-id/CANP8%2BjKoMAyXvMO2hUqFzHX8fYSFc82q9MEse2zAEOC8vxwDfA%40mail.gmail.com

So if you apply the patch on the thread above, does performance improve?

First of all this patch is not correct: it cause pin/unpin buffer 
disbalance and I get a lot of errors:


Update failed: ERROR:  buffer 343 is not owned by resource owner Portal

When I slightly modify it by moving ReleaseBuffer(buffer) inside switch 
I get almost the same performance as Vanilla.
And when I combined it with pklock patch, then I get similar performance 
with vanilla+pklock, but with much larger dips:


https://docs.google.com/spreadsheets/d/1QOYfUehy8U3sdasMjGnPGQJY8JiRfZmlS64YRBM0YTo/edit?usp=sharing 




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: committing inside cursor loop

2018-02-20 Thread Simon Riggs
On 20 February 2018 at 14:11, Peter Eisentraut
 wrote:
> Here is a patch that allows COMMIT inside cursor loops in PL/pgSQL.  As
> alluded to in earlier threads, this is done by converting such cursors
> to holdable automatically.  A special flag "auto-held" marks such
> cursors, so we know to clean them up on exceptions.
>
> This is currently only for PL/pgSQL, but the same technique could be
> applied easily to other PLs.

Amazingly clean, looks great.

I notice that PersistHoldablePortal() does ExecutorRewind().

In most cases, the cursor loop doesn't ever rewind. So it would be
good if we could pass a parameter that skips the rewind since it will
never be needed and causes a performance hit. What I imagine is we can
just persist the as-yet unfetched portion of the cursor from the
current row onwards, rather than rewind and store the whole cursor
result.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Add missing type conversion functions for PL/Python

2018-02-20 Thread Haozhou Wang
Thank you very much for your review!

I attached a new patch with typo fixed.

Regards,
Haozhou

On Mon, Feb 19, 2018 at 2:37 PM, Anthony Bykov 
wrote:

> On Wed, 31 Jan 2018 11:57:12 +0800
> Haozhou Wang  wrote:
>
> > Hi All,
> >
> > PL/Python already has different type conversion functions to
> > convert PostgreSQL datum to Python object. However, the conversion
> > functions from Python object to PostgreSQL datum only has Boolean,
> > Bytea and String functions.
> >
> > In this patch, we add rest of Integer and Float related datatype
> > conversion functions
> > and can increase the performance of data conversion greatly especially
> > when returning a large array.
> >
> > We did a quick test about the performance of returning array in
> > PL/Python:
> >
> > The following UDF is used for test:
> >
> > ```
> > CREATE OR REPLACE FUNCTION pyoutfloat8(num int) RETURNS float8[] AS $$
> > return [x/3.0 for x in range(num)]
> > $$ LANGUAGE plpythonu;
> > ```
> >
> > The test command is
> >
> > ```
> > select count(pyoutfloat8(n));
> > ```
> >
> > The count is used for avoid large output, where n is the number of
> > element in returned array, and the size is from 1.5k to15m.
> >
> > Size of Array  1.5k   |15k |
> >  150k|   1.5m| 15m   |
> >
> > Origin 2.324ms | 19.834ms  | 194.991ms
> > | 1927.28ms|   19982.1ms  |
> >
> > With this patch   1.168ms  |  3.052ms   |
> > 21.888ms | 213.39ms  |2138.5ms   |
> >
> > All test for both PG and PL/Python are passed.
> >
> > Thanks very much.
> >
> >
>
> Hello,
> sounds like a really nice patch. I've started looking
> through the code and noticed a sort of a typos (or I just couldn't
> understand what did you mean) in comments.
>
> file "src/pl/plpython/plpy_typeio.c"
> the comment is
> * If can not convert if directly, fallback to PLyObject_ToDatum
> * to convert it
>
> Maybe it should be something like ("it" instead of second "if")
> * If can not convert it directly, fallback to PLyObject_ToDatum
> * to convert it
>
> And the same typo is repeated several times in comments.
>
> --
> Anthony Bykov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>


0001-Add-missing-type-conversion-functions-for-PL-Python-v2.patch
Description: Binary data


Re: [PATCH] Add a few suppression rules for Valgrind

2018-02-20 Thread Aleksander Alekseev
Hello hackers,

> I suggest a patch that adds corresponding suppression rules to the
> src/tools/valgrind.supp file. Though I'm having difficulties
> understanding why Valgrind complains on wcstombs and thus I can't
> explain why in fact everything is OK (or it's actually not?). Hopefully
> someone in the mailing list can explain this behavior. For now I left a
> TODO mark in the comment.

For the record - here are configure flags:

```
CFLAGS="-O0" ./configure --prefix=$PGINSTALL \
--with-libxml --with-libxslt \
--with-python --enable-tap-tests --enable-cassert --enable-debug \
--enable-nls --with-perl --with-tcl --with-gssapi --with-ldap
```

The system is Arch Linux x64, GCC version is 7.3.0.


-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature


Re: NEXT VALUE FOR sequence

2018-02-20 Thread Laurenz Albe
Tom Lane wrote:
> Laurenz Albe  writes:
> > The SQL standard has the expression "NEXT VALUE FOR asequence" to do
> > what we traditionally do with "nextval('asequence')".
> > This is an attempt to implement this on top of the recently introduced
> > NextValueExpr node.
> 
> This has been proposed repeatedly, and rejected repeatedly, because in
> fact the standard's semantics for NEXT VALUE FOR are *not* like nextval().
> See SQL:2011 4.22.2 "Operations involving sequence generators":
> 
> If there are multiple instances of s specifying
> the same sequence generator within a single SQL-statement, all those
> instances return the same value for a given row processed by that
> SQL-statement.
> 
> This is not terribly exact --- what is a "processed row" in a join query,
> for instance?  But it's certainly not supposed to act like independent
> executions of nextval() or NextValueExpr would.  Pending somebody doing
> the legwork to produce something that at least arguably conforms to the
> spec's semantics, we've left the syntax unimplemented.

Would it be reasonable to say that any two NextValueExpr in the same
target list are "in one row"?

Yours,
Laurenz Albe




[PATCH] Add a few suppression rules for Valgrind

2018-02-20 Thread Aleksander Alekseev
Hello hackers,

I decided to run the code from master branch under Valgrind and
discovered that it reports some errors.

There are multiple reports like this one (seems to be a false alarm):

```
Invalid read of size 16
   at 0x605F488: __wcsnlen_sse4_1 (in /usr/lib/libc-2.26.so)
   by 0x604F5C2: wcsrtombs (in /usr/lib/libc-2.26.so)
   by 0x5FE4C41: wcstombs (in /usr/lib/libc-2.26.so)
   by 0x6FFAFC: wchar2char (pg_locale.c:1641)
   by 0x6937A0: str_tolower (formatting.c:1599)
   by 0x6F88E8: lower (oracle_compat.c:50)
   by 0x43097D: ExecInterpExpr (execExprInterp.c:678)
   by 0x48B201: ExecEvalExpr (executor.h:286)
   by 0x48BBD6: tfuncInitialize (nodeTableFuncscan.c:369)
   by 0x48B91E: tfuncFetchRows (nodeTableFuncscan.c:299)
   by 0x48B245: TableFuncNext (nodeTableFuncscan.c:65)
   by 0x4462E6: ExecScanFetch (execScan.c:95)
Address 0x513adab0 is 7,808 bytes inside a recently re-allocated
  block of size 16,384 alloc'd
   at 0x4C2CEDF: malloc (vg_replace_malloc.c:299)
   by 0x7E2807: AllocSetAlloc (aset.c:934)
   by 0x7EF481: palloc (mcxt.c:848)
   by 0x42D2A7: ExprEvalPushStep (execExpr.c:2131)
   by 0x42D7DB: ExecPushExprSlots (execExpr.c:2286)
   by 0x42D723: ExecInitExprSlots (execExpr.c:2265)
   by 0x429B77: ExecBuildProjectionInfo (execExpr.c:370)
   by 0x44A39B: ExecAssignProjectionInfo (execUtils.c:457)
   by 0x4733FE: ExecInitNestLoop (nodeNestloop.c:308)
   by 0x4442F0: ExecInitNode (execProcnode.c:290)
   by 0x43C5EB: InitPlan (execMain.c:1043)
   by 0x43B31C: standard_ExecutorStart (execMain.c:262)
```

Also there are many reports like this one (definitely false alarm):

```
Syscall param epoll_pwait(sigmask) points to unaddressable byte(s)
at 0x60A2F90: epoll_pwait (in /usr/lib/libc-2.26.so)
by 0x5F7B24: WaitEventSetWaitBlock (latch.c:1048)
by 0x5F79E8: WaitEventSetWait (latch.c:1000)
by 0x493A79: secure_read (be-secure.c:170)
by 0x4A3DE2: pq_recvbuf (pqcomm.c:963)
by 0x4A3EAA: pq_getbyte (pqcomm.c:1006)
by 0x627AD0: SocketBackend (postgres.c:339)
by 0x628032: ReadCommand (postgres.c:512)
by 0x62D243: PostgresMain (postgres.c:4086)
by 0x581240: BackendRun (postmaster.c:4412)
by 0x5808BE: BackendStartup (postmaster.c:4084)
by 0x57CA20: ServerLoop (postmaster.c:1757)
Address 0x0 is not stack'd, malloc'd or (recently) free'd
```

I suggest a patch that adds corresponding suppression rules to the
src/tools/valgrind.supp file. Though I'm having difficulties
understanding why Valgrind complains on wcstombs and thus I can't
explain why in fact everything is OK (or it's actually not?). Hopefully
someone in the mailing list can explain this behavior. For now I left a
TODO mark in the comment.

TWIMC the full report can be found here:

https://afiskon.ru/s/3c/736207c9d2_valgrind-2018-02-20.tgz

-- 
Best regards,
Aleksander Alekseev
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index af03051260..accf989047 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,47 @@
Memcheck:Cond
fun:PyObject_Realloc
 }
+
+
+# Accroding to the man page for epoll_pwait:
+# """
+# The sigmask argument may be specified as NULL, in which case epoll_pwait() is
+# equivalent to epoll_wait().
+# """
+# Valgrind seems to be unaware of this and complains about passing NULL as a
+# sigmask, so we need this suppression rule.
+{
+   epoll_pwait_null_sigmask
+   Memcheck:Param
+   epoll_pwait(sigmask)
+   fun:epoll_pwait
+   fun:WaitEventSetWaitBlock
+   fun:WaitEventSetWait
+
+   ...
+   fun:ServerLoop
+}
+
+
+# TODO: why this is harmless?
+{
+   wcstombs
+   Memcheck:Addr16
+   fun:__wcsnlen_sse4_1
+   fun:wcsrtombs
+   fun:wcstombs
+
+   ...
+   fun:ExecInterpExpr
+}
+
+{
+   
+   Memcheck:Cond
+   fun:__wcsnlen_sse4_1
+   fun:wcsrtombs
+   fun:wcstombs
+
+   ...
+   fun:ExecInterpExpr
+}


signature.asc
Description: PGP signature


Re: committing inside cursor loop

2018-02-20 Thread Tom Lane
Peter Eisentraut  writes:
> Here is a patch that allows COMMIT inside cursor loops in PL/pgSQL.  As
> alluded to in earlier threads, this is done by converting such cursors
> to holdable automatically.  A special flag "auto-held" marks such
> cursors, so we know to clean them up on exceptions.

I haven't really read this patch, but this bit jumped out at me:

+Inside a cursor loop, ROLLBACK is not allowed.  (That
+would have to roll back the cursor query, thus invalidating the loop.)

Say what?  That seems to translate into "we have lost the ability to
deal with errors".  I don't think this is really what people are hoping
to get out of the transactional-procedure facility.

regards, tom lane



[PATCH] btree_gin, add support for uuid, bool, name, bpchar and anyrange types

2018-02-20 Thread Matheus de Oliveira
Hi all.

Here is a patch to add support for more types on btree_gin.

I was missing UUID type, so I added it. Since I was there, I checked all
other built-in types with B-tree but not GIN support, and the remaining
list was: uuid, bool, name, bpchar and anyrange (at least ones that seem to
make sense to me). So I added support for all of them.

If you have any other type I missed and you wish to have support to, please
let me know and I can add it.

Thanks a lot.

Regards,
-- 
Matheus de Oliveira
diff --git a/contrib/btree_gin/Makefile b/contrib/btree_gin/Makefile
index 690e1d7..d36f5ad 100644
*** a/contrib/btree_gin/Makefile
--- b/contrib/btree_gin/Makefile
***
*** 5,17  OBJS = btree_gin.o $(WIN32RES)
  
  EXTENSION = btree_gin
  DATA = btree_gin--1.0.sql btree_gin--1.0--1.1.sql btree_gin--1.1--1.2.sql \
! 	btree_gin--unpackaged--1.0.sql
  PGFILEDESC = "btree_gin - B-tree equivalent GIN operator classes"
  
  REGRESS = install_btree_gin int2 int4 int8 float4 float8 money oid \
  	timestamp timestamptz time timetz date interval \
  	macaddr macaddr8 inet cidr text varchar char bytea bit varbit \
! 	numeric enum
  
  ifdef USE_PGXS
  PG_CONFIG = pg_config
--- 5,17 
  
  EXTENSION = btree_gin
  DATA = btree_gin--1.0.sql btree_gin--1.0--1.1.sql btree_gin--1.1--1.2.sql \
! 	 btree_gin--1.2--1.3.sql btree_gin--unpackaged--1.0.sql
  PGFILEDESC = "btree_gin - B-tree equivalent GIN operator classes"
  
  REGRESS = install_btree_gin int2 int4 int8 float4 float8 money oid \
  	timestamp timestamptz time timetz date interval \
  	macaddr macaddr8 inet cidr text varchar char bytea bit varbit \
! 	numeric enum uuid name bool anyrange bpchar
  
  ifdef USE_PGXS
  PG_CONFIG = pg_config
diff --git a/contrib/btree_gin/btindex dd81d27..8965247 100644
*** a/contrib/btree_gin/btree_gin--1.0--1.1.sql
--- b/contrib/btree_gin/btree_gin--1.0--1.1.sql
***
*** 3,9 
  -- complain if script is sourced in psql, rather than via CREATE EXTENSION
  \echo Use "ALTER EXTENSION btree_gin UPDATE TO '1.1'" to load this file. \quit
  
! -- macaddr8 datatype support new in 10.0.
  CREATE FUNCTION gin_extract_value_macaddr8(macaddr8, internal)
  RETURNS internal
  AS 'MODULE_PATHNAME'
--- 3,9 
  -- complain if script is sourced in psql, rather than via CREATE EXTENSION
  \echo Use "ALTER EXTENSION btree_gin UPDATE TO '1.1'" to load this file. \quit
  
! -- macaddr8 datatype support new in 1.0.
  CREATE FUNCTION gin_extract_value_macaddr8(macaddr8, internal)
  RETURNS internal
  AS 'MODULE_PATHNAME'
diff --git a/contrib/btree_gin/btree_gin--1.1--1index 2a16837..bb3e1ba 100644
*** a/contrib/btree_gin/btree_gin--1.1--1.2.sql
--- b/contrib/btree_gin/btree_gin--1.1--1.2.sql
***
*** 1,7 
  /* contrib/btree_gin/btree_gin--1.1--1.2.sql */
  
  -- complain if script is sourced in psql, rather than via CREATE EXTENSION
! \echo Use "ALTER EXTENSION btree_gin UPDATE TO '1.1'" to load this file. \quit
  
  --
  --
--- 1,7 
  /* contrib/btree_gin/btree_gin--1.1--1.2.sql */
  
  -- complain if script is sourced in psql, rather than via CREATE EXTENSION
! \echo Use "ALTER EXTENSION btree_gin UPDATE TO '1.2'" to load this file. \quit
  
  --
  --
diff --git a/contrib/btree_gin/btree_gin--1.2--1new file mode 100644
index 000..f7523a3
*** /dev/null
--- b/contrib/btree_gin/btree_gin--1.2--1.3.sql
***
*** 0 
--- 1,164 
+ /* contrib/btree_gin/btree_gin--1.2--1.3.sql */
+ 
+ -- complain if script is sourced in psql, rather than via CREATE EXTENSION
+ \echo Use "ALTER EXTENSION btree_gin UPDATE TO '1.3'" to load this file. \quit
+ 
+ -- uuid datatype support new in 1.3.
+ CREATE FUNCTION gin_extract_value_uuid(uuid, internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT IMMUTABLE;
+ 
+ CREATE FUNCTION gin_compare_prefix_uuid(uuid, uuid, int2, internal)
+ RETURNS int4
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT IMMUTABLE;
+ 
+ CREATE FUNCTION gin_extract_query_uuid(uuid, internal, int2, internal, internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT IMMUTABLE;
+ 
+ CREATE OPERATOR CLASS uuid_ops
+ DEFAULT FOR TYPE uuid USING gin
+ AS
+ OPERATOR1   <,
+ OPERATOR2   <=,
+ OPERATOR3   =,
+ OPERATOR4   >=,
+ OPERATOR5   >,
+ FUNCTION1   uuid_cmp(uuid,uuid),
+ FUNCTION2   gin_extract_value_uuid(uuid, internal),
+ FUNCTION3   gin_extract_query_uuid(uuid, internal, int2, internal, internal),
+ FUNCTION4   gin_btree_consistent(internal, int2, anyelement, int4, internal, internal),
+ FUNCTION5   gin_compare_prefix_uuid(uuid,uuid,int2, internal),
+ STORAGE uuid;
+ 
+ -- name datatype support new in 1.3.
+ CREATE FUNCTION gin_extract_value_name(name, internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT IMMUTABLE;
+ 
+ CREATE FUNCTION gin_compare_prefix_name(name, name, int2, 

Re: [bug fix] Produce a crash dump before main() on Windows

2018-02-20 Thread Craig Ringer
On 20 February 2018 at 22:18, Craig Ringer  wrote:


> So I'm all for just removing that.
>


... but just to be clear, about -1000 on backpatching any such thing. At
most, a new GUC that defaults to the current behaviour. But I think it's
pretty niche really.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [bug fix] Produce a crash dump before main() on Windows

2018-02-20 Thread Craig Ringer
On 20 February 2018 at 21:47, Magnus Hagander  wrote:

>
>
> On Fri, Feb 16, 2018 at 8:28 AM, Tsunakawa, Takayuki <
> tsunakawa.ta...@jp.fujitsu.com> wrote:
>
>> Hello,
>>
>> postgres.exe on Windows doesn't output a crash dump when it crashes
>> before main() is called.  The attached patch fixes this.  I'd like this to
>> be back-patched.  I'll add this to the next CF.
>>
>> The original problem happened on our customer's production system.  Their
>> application sometimes failed to connect to the database.  That was because
>> postgres.exe crashed due to access violation (exception code C005).
>> But there was no crash dump, so we had difficulty in finding the cause.
>> The frequency was low -- about ten times during half a year.
>>
>> What caused the access violation was Symantec's antivirus software.  It
>> seems that sysfer.dll of the software intercepts registry access, during C
>> runtime library initialization,  before main() is called.  So, the direct
>> cause of this problem is not PostgreSQL.
>>
>> On the other hand, it's PostgreSQL's problem that we can't get the crash
>> dump, which makes the investigation difficult.  The cause is that
>> postmaster calls SetErrorMode() to disable the outputing of crash dumps by
>> WER (Windows Error Reporting).  This error mode is inherited from
>> postmaster to its children.  If a crash happens before the child sets up
>> the exception handler, no crash dump is produced.
>>
>
> The original call to SetErrorMode() was put in there to make sure we
> didn't show a popup message which would then make everything freeze (see
> very old commit 27bff7502f04ee01237ed3f5a997748ae43d3a81). Doesn't this
> turn that back on, so that if you are not actually there to monitor
> something you can end up with stuck processes and exactly the issues we had
> before that one?
>

Ha, I just went digging for the same.

We should not disable WER when running as a service (no UI access), it will
not display an interactive dialog.

I'm not convinced we should disable it at all personally. Things have come
a long way from drwatson.exe . Disabling WER makes it hard to debug
postgres by installing Visual Studio Debugger as the hander (I always
wondered why that didn't work!) and is generally just painful. It prevents
us from collecting data via Microsoft about crashes, should we wish to do
so. And who runs Pg on windows except as a service?!

So I'm all for just removing that.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


committing inside cursor loop

2018-02-20 Thread Peter Eisentraut
Here is a patch that allows COMMIT inside cursor loops in PL/pgSQL.  As
alluded to in earlier threads, this is done by converting such cursors
to holdable automatically.  A special flag "auto-held" marks such
cursors, so we know to clean them up on exceptions.

This is currently only for PL/pgSQL, but the same technique could be
applied easily to other PLs.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 649c84f49faa3f46e546ff9eb6d1b6e98483bdb8 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 20 Feb 2018 08:52:23 -0500
Subject: [PATCH v1] PL/pgSQL: Allow committing inside cursor loop

Previously, committing inside a cursor loop was prohibited because that
would close and remove the cursor.  To allow that, automatically convert
such cursors to holdable cursors so they survive commits.  Portals now
have a new state "auto-held", which means they have been converted
automatically from pinned.  An auto-held portal is kept on transaction
commit but is still removed on transaction abort.
---
 doc/src/sgml/plpgsql.sgml  | 35 +-
 src/backend/utils/mmgr/portalmem.c | 49 --
 src/include/utils/portal.h |  3 +
 .../plpgsql/src/expected/plpgsql_transaction.out   | 74 +-
 src/pl/plpgsql/src/pl_exec.c   |  9 +--
 src/pl/plpgsql/src/sql/plpgsql_transaction.sql | 47 ++
 6 files changed, 199 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
index c1e3c6a19d..6ac72cc5ac 100644
--- a/doc/src/sgml/plpgsql.sgml
+++ b/doc/src/sgml/plpgsql.sgml
@@ -3480,8 +3480,39 @@ Transaction Management

 

-A transaction cannot be ended inside a loop over a query result, nor
-inside a block with exception handlers.
+Special considerations apply to cursor loops.  Consider this example:
+
+CREATE PROCEDURE transaction_test2()
+LANGUAGE plpgsql
+AS $$
+DECLARE
+r RECORD;
+BEGIN
+FOR r IN SELECT * FROM test2 ORDER BY x LOOP
+INSERT INTO test1 (a) VALUES (r.x);
+COMMIT;
+END LOOP;
+END;
+$$;
+
+CALL transaction_test2();
+
+Normally, cursors are automatically closed at transaction commit.
+However, a cursor created as part of a loop like this is automatically
+converted to a holdable cursor by the first COMMIT.
+That means that the cursor is fully evaluated at the first
+COMMIT rather than row by row.  The cursor is still
+removed automatically after the loop, so this is mostly invisible to the
+user.
+   
+
+   
+Inside a cursor loop, ROLLBACK is not allowed.  (That
+would have to roll back the cursor query, thus invalidating the loop.)
+   
+
+   
+A transaction cannot be ended inside a block with exception handlers.

   
 
diff --git a/src/backend/utils/mmgr/portalmem.c 
b/src/backend/utils/mmgr/portalmem.c
index 75a6dde32b..983a6d4436 100644
--- a/src/backend/utils/mmgr/portalmem.c
+++ b/src/backend/utils/mmgr/portalmem.c
@@ -648,9 +648,10 @@ PreCommit_Portals(bool isPrepare)
 
/*
 * There should be no pinned portals anymore. Complain if 
someone
-* leaked one.
+* leaked one. Auto-held portals are allowed; we assume that 
whoever
+* pinned them is managing them.
 */
-   if (portal->portalPinned)
+   if (portal->portalPinned && !portal->autoHeld)
elog(ERROR, "cannot commit while a portal is pinned");
 
/*
@@ -766,9 +767,10 @@ AtAbort_Portals(void)
MarkPortalFailed(portal);
 
/*
-* Do nothing else to cursors held over from a previous 
transaction.
+* Do nothing else to cursors held over from a previous 
transaction,
+* except auto-held ones.
 */
-   if (portal->createSubid == InvalidSubTransactionId)
+   if (portal->createSubid == InvalidSubTransactionId && 
!portal->autoHeld)
continue;
 
/*
@@ -834,8 +836,11 @@ AtCleanup_Portals(void)
if (portal->status == PORTAL_ACTIVE)
continue;
 
-   /* Do nothing to cursors held over from a previous transaction 
*/
-   if (portal->createSubid == InvalidSubTransactionId)
+   /*
+* Do nothing to cursors held over from a previous transaction, 
except
+* auto-held ones.
+*/
+   if (portal->createSubid == InvalidSubTransactionId && 
!portal->autoHeld)
{
Assert(portal->status != PORTAL_ACTIVE);
Assert(portal->resowner == NULL);
@@ -1182,3 +1187,35 @@ ThereArePinnedPortals(void)
 
return false;
 }

Re: [bug fix] Produce a crash dump before main() on Windows

2018-02-20 Thread Magnus Hagander
On Fri, Feb 16, 2018 at 8:28 AM, Tsunakawa, Takayuki <
tsunakawa.ta...@jp.fujitsu.com> wrote:

> Hello,
>
> postgres.exe on Windows doesn't output a crash dump when it crashes before
> main() is called.  The attached patch fixes this.  I'd like this to be
> back-patched.  I'll add this to the next CF.
>
> The original problem happened on our customer's production system.  Their
> application sometimes failed to connect to the database.  That was because
> postgres.exe crashed due to access violation (exception code C005).
> But there was no crash dump, so we had difficulty in finding the cause.
> The frequency was low -- about ten times during half a year.
>
> What caused the access violation was Symantec's antivirus software.  It
> seems that sysfer.dll of the software intercepts registry access, during C
> runtime library initialization,  before main() is called.  So, the direct
> cause of this problem is not PostgreSQL.
>
> On the other hand, it's PostgreSQL's problem that we can't get the crash
> dump, which makes the investigation difficult.  The cause is that
> postmaster calls SetErrorMode() to disable the outputing of crash dumps by
> WER (Windows Error Reporting).  This error mode is inherited from
> postmaster to its children.  If a crash happens before the child sets up
> the exception handler, no crash dump is produced.
>

The original call to SetErrorMode() was put in there to make sure we didn't
show a popup message which would then make everything freeze (see very old
commit 27bff7502f04ee01237ed3f5a997748ae43d3a81). Doesn't this turn that
back on, so that if you are not actually there to monitor something you can
end up with stuck processes and exactly the issues we had before that one?

It might still be useful in debugging, but in that case we might need to
make it a configurable switch?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Contention preventing locking

2018-02-20 Thread Simon Riggs
On 20 February 2018 at 13:19, Konstantin Knizhnik
 wrote:
>
>
> On 20.02.2018 14:26, Simon Riggs wrote:
>>
>> On 15 February 2018 at 16:00, Konstantin Knizhnik
>>  wrote:
>>
>>> So in heap_acquire_tuplock all competing transactions are waiting for TID
>>> of
>>> the updated version. When transaction which changed this tuple is
>>> committed,
>>> one of the competitors will grant this lock and proceed, creating new
>>> version of the tuple. Then all other competitors will be awaken and ...
>>> find
>>> out that locked tuple is not the last version any more.
>>> Then they will locate new version and try to lock it... The more
>>> competitors
>>> we have, then more attempts they all have to perform (quadratic
>>> complexity).
>>
>> What about the tuple lock? You are saying that is ineffective?
>>
>> src/backend/access/heap/README.tuplock
>
>
> In my last mail ni this thread I have mentioned that update of tuple cause
> setting of three heavy weight locks:
>
> 1. Locking of SnapshotDirty.xmax transaction (XactLockTableWait) in
> EvalPlanQualFetch
> 2. Tuple lock (heap_acquire_tuplock) in heap_tuple_update
> 3. Transaction lock (XactLockTableWait) in heap_tuple_update
>
> So what I see in debugger and monitoring pg_locks, is that under high
> contention there are a lot transactions waiting for SnapshotDirty.xmax.
> This lock is obtained before tuple lock, so tuple lock can not help in this
> case.

Hmm, that situation occurs erroneously, as I previously observed
https://www.postgresql.org/message-id/CANP8%2BjKoMAyXvMO2hUqFzHX8fYSFc82q9MEse2zAEOC8vxwDfA%40mail.gmail.com

So if you apply the patch on the thread above, does performance improve?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Contention preventing locking

2018-02-20 Thread Konstantin Knizhnik



On 20.02.2018 14:26, Simon Riggs wrote:

On 15 February 2018 at 16:00, Konstantin Knizhnik
 wrote:


So in heap_acquire_tuplock all competing transactions are waiting for TID of
the updated version. When transaction which changed this tuple is committed,
one of the competitors will grant this lock and proceed, creating new
version of the tuple. Then all other competitors will be awaken and ... find
out that locked tuple is not the last version any more.
Then they will locate new version and try to lock it... The more competitors
we have, then more attempts they all have to perform (quadratic complexity).

What about the tuple lock? You are saying that is ineffective?

src/backend/access/heap/README.tuplock


In my last mail ni this thread I have mentioned that update of tuple 
cause setting of three heavy weight locks:


1. Locking of SnapshotDirty.xmax transaction (XactLockTableWait) in 
EvalPlanQualFetch

2. Tuple lock (heap_acquire_tuplock) in heap_tuple_update
3. Transaction lock (XactLockTableWait) in heap_tuple_update

So what I see in debugger and monitoring pg_locks, is that under high 
contention there are a lot transactions waiting for SnapshotDirty.xmax.
This lock is obtained before tuple lock, so tuple lock can not help in 
this case.






My idea was that we can avoid such performance degradation if we somehow
queue competing transactions so that they will get control one-by-one,
without doing useless work.
First of all I tried to replace TID  lock with PK (primary key) lock. Unlike
TID, PK of record  is not changed during hot update. The second idea is that
instead immediate releasing lock after update we can hold it until the end
of transaction. And this optimization really provides improve of performance
- it corresponds to pg_f1_opt configuration at the attached diagram.
For some workloads it provides up to two times improvement comparing with
vanilla Postgres. But there are many problems with correct (non-prototype)
implementation of this approach:
handling different types of PK, including compound keys, PK updates,...

Try locking the root tid rather than the TID, that is at least unique
per page for a chain of tuples, just harder to locate.



But locking primary key is more efficient, isn't it?
The problem with primary key is that it doesn't always exists or can be 
compound, but if it exists and has integer type, then get be obtained 
easily than root tid.
For pgrw benchmark primary key lock provides even better results than 
transaction lock chaining:


https://docs.google.com/spreadsheets/d/1QOYfUehy8U3sdasMjGnPGQJY8JiRfZmlS64YRBM0YTo/edit?usp=sharing

But for YCSB benchmark xlock optimization is more efficient:

https://docs.google.com/spreadsheets/d/136BfsABXEbzrleZHYoli7XAyDFUAxAXZDHl5EvJR_0Q/edit?usp=sharing




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: ALTER TABLE ADD COLUMN fast default

2018-02-20 Thread Petr Jelinek
On 20/02/18 07:42, Andres Freund wrote:
> Hi,
> 
> On 2018-02-17 00:23:40 +0100, Tomas Vondra wrote:
>> Anyway, I consider the performance to be OK. But perhaps Andres could
>> comment on this too, as he requested the benchmarks.
> 
> My performance concerns were less about CREATE TABLE related things than
> about analytics workloads or such, where deforming is the primary
> bottleneck.  I think it should be ok, but doing a before/after tpc-h of
> scale 5-10 or so wouldn't be a bad thing to verify.
> 

The test Tomas is doing is analytical query, it's running sum on the new
fast default column.

He uses create and create-alter names as comparison between when the
table was created with the columns and when the columns were added using
fast default.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services



Re: heap_lock_updated_tuple_rec can leak a buffer refcount

2018-02-20 Thread Amit Kapila
On Tue, Feb 13, 2018 at 10:11 AM, Amit Kapila  wrote:
> It seems to me that heap_lock_updated_tuple_rec can lead to a buffer
> refcount leak while locking an updated tuple by an aborted
> transaction.  In commit - 5c609a74, we have added the code to deal
> with aborted transactions as below:
>
> heap_lock_updated_tuple_rec()
> {
> ..
>
> if (PageIsAllVisible(BufferGetPage(buf)))
> visibilitymap_pin(rel, block, );
> else
> vmbuffer = InvalidBuffer;
>
> LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
> ..
> -- below code is added by commit -5c609a74 ---
> if (TransactionIdDidAbort(HeapTupleHeaderGetXmin(mytup.t_data)))
> {
> UnlockReleaseBuffer(buf);
> return HeapTupleMayBeUpdated;
> }
> -
>
> I think the above code forgets to deal with vmbuffer and can lead to a
> leak of the same.  Attached patch ensures that it deals with vmbuffer
> when required.
>

Registered the patch for next CF:
https://commitfest.postgresql.org/17/1531/

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] why not parallel seq scan for slow functions

2018-02-20 Thread Amit Kapila
On Mon, Feb 19, 2018 at 9:56 AM, Ashutosh Bapat
 wrote:
> On Mon, Feb 19, 2018 at 9:35 AM, Ashutosh Bapat
>  wrote:
>> On Sat, Feb 17, 2018 at 8:17 AM, Amit Kapila  wrote:
>>> On Fri, Feb 16, 2018 at 9:29 AM, Ashutosh Bapat
>>>  wrote:
 On Thu, Feb 15, 2018 at 7:47 PM, Amit Kapila  
 wrote:
> On Thu, Feb 15, 2018 at 4:18 PM, Ashutosh Bapat
>  wrote:
>> I happened to look at the patch for something else. But here are some
>> comments. If any of those have been already discussed, please feel
>> free to ignore. I have gone through the thread cursorily, so I might
>> have missed something.
>>
>> In grouping_planner() we call query_planner() first which builds the
>> join tree and creates paths, then calculate the target for scan/join
>> rel which is applied on the topmost scan join rel. I am wondering
>> whether we can reverse this order to calculate the target list of
>> scan/join relation and pass it to standard_join_search() (or the hook)
>> through query_planner().
>>
>
> I think the reason for doing in that order is that we can't compute
> target's width till after query_planner().  See the below note in
> code:
>
> /*
> * Convert the query's result tlist into PathTarget format.
> *
> * Note: it's desirable to not do this till after query_planner(),
> * because the target width estimates can use per-Var width numbers
> * that were obtained within query_planner().
> */
> final_target = create_pathtarget(root, tlist);
>
> Now, I think we can try to juggle the code in a way that the width can
> be computed later, but the rest can be done earlier.

 /* Convenience macro to get a PathTarget with valid cost/width fields */
 #define create_pathtarget(root, tlist) \
 set_pathtarget_cost_width(root, make_pathtarget_from_tlist(tlist))
 create_pathtarget already works that way. We will need to split it.

 Create the Pathtarget without widths. Apply width estimates once we
 know the width of Vars somewhere here in query_planner()
 /*
  * We should now have size estimates for every actual table involved in
  * the query, and we also know which if any have been deleted from the
  * query by join removal; so we can compute total_table_pages.
  *
  * Note that appendrels are not double-counted here, even though we 
 don't
  * bother to distinguish RelOptInfos for appendrel parents, because the
  * parents will still have size zero.
  *
 The next step is building the join tree. Set the pathtarget before that.

> However, I think
> that will be somewhat major change

 I agree.

>  still won't address all kind of
> cases (like for ordered paths) unless we can try to get all kind of
> targets pushed down in the call stack.

 I didn't understand that.

>>>
>>> The places where we use a target different than the target which is
>>> pushed via query planner can cause a similar problem (ex. see the
>>> usage of adjust_paths_for_srfs) because the cost of that target
>>> wouldn't be taken into consideration for Gather's costing.
>>>
>>
>> Right. Right now apply_projection_to_path() or adjust_paths_for_srfs()
>> do not take into consideration the type of path whose cost is being
>> adjusted for the new targetlist. That works for most of the paths but
>> not all the paths like custom, FDW or parallel paths. The idea I am
>> proposing is to compute final targetlist before query planner so that
>> it's available when we create paths for the topmost scan/join
>> relation. That way, any path creation logic can then take advantage of
>> this list to compute costs, not just parallel paths.
>
> In fact, we should do this not just for scan/join relations, but all
> the upper relations as well. Upper relations too create parallel
> paths, whose costs need to be adjusted after their targetlists are
> updated by adjust_paths_for_srfs(). Similar adjustments are needed for
> any FDWs, custom paths which cost targetlists differently.
>

I think any such change in planner can be quite tricky and can lead to
a lot of work.  I am not denying that it is not possible to think
along the lines you are suggesting, but OTOH, I don't see it as a
realistic approach for this patch where we can deal with the majority
of cases with the much smaller patch.  In future, if you are someone
can have a patch along those lines for some other purpose (considering
it is feasible to do so which I am not completely sure), then we can
adjust the things for parallel paths as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: ALTER TABLE ADD COLUMN fast default

2018-02-20 Thread David Rowley
On 20 February 2018 at 23:10, David Rowley  wrote:
> Nevertheless, it would be interesting to see how much a bsearch in
> getmissingattr would help Tomas' case. Though, perhaps you're happy
> enough with the performance already.

I thought I'd give this a quick test using the attached (incomplete) patch.

My findings made me realise that Tomas' case actually exploited a best
case for the patch.

Tomas' query.sql is performing a sum(c1000), which is the final column
in the table. No doubt he's done this since it's the typical thing to
do to get a worst case for tuple deforming, but he might not have
realised it was the best case for your patch due to the way you're
performing the for loop in getmissingattr starting with the final
missing value first.

I noticed this when I tested the mockup bsearch code. I was surprised
to see that it actually slowed performance a little with the
sum(c1000) case.

The entire results for those using:

pgbench -n -T 60 -j 1 -c 1 -f query.sql postgres

is:

Using: select sum(c1000) from t;

v11 + bsearch + create.sql: tps = 1479.267518
v11 + bsearch + create-alter.sql: tps = 1366.885968
v11 + create.sql: tps = 1544.378091
v11 + create-alter.sql: tps = 1416.608320

(both the create.sql results should be about the same here since
they're not really exercising any new code.)

Notice that the bsearch version is slightly slower, 1366 tps vs 1416.

If I use sum(c10) instead, I get.

Using: select sum(c10) from t;

v11 + bsearch + create-alter.sql: tps = 1445.940061
v11 + bsearch + create.sql: tps = 3320.102543
v11 + create.sql: tps = 3330.131437
v11 + create-alter.sql: tps = 1398.635251

The bsearch here does help recover some performance, but it's a small
improvement on what is quite a big drop from the create.sql case.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


bsearch_getmissingattr.patch
Description: Binary data


Option to ensure monotonic timestamps

2018-02-20 Thread Brent Kerby
Hi, I'm new to Postgres hacking, and I'm interested in the possibility of a
new feature to make it possible to ensure that Postgres-generated
timestamps never decrease even if the system clock may step backwards. My
use case is that I'm implementing a form of temporal tables based on
transaction commit timestamps (as returned by pg_xact_commit_timestamp),
and to ensure the integrity of the system I need to know that the ordering
of the commit timestamps will always be consistent with the order in which
the transactions actually committed. I don't need the timestamps to be
unique; i.e., if transactions occur close together in time, then it's fine
for them to have the same timestamp -- just if the timestamps are different
then they must be in the right order. I would guess there may be other
scenarios where users may want to ensure the timestamps are monotonic, and
in general it would probably be desired for the monotonicity to apply
across all timestamps generated by a given Postgres server, not only the
commit timestamps.

I'm aware of the obvious alternative, which is simply to try to configure
the system clock so that it can't go backwards (e.g., using the option
"stepback 0" for ntpd). However, in virtual environments this could
potentially be difficult to achieve in a reliable way. And in any case,
since in my application the integrity of the data history hinges on the
timestamps being monotonic, I think it makes sense that this be enforceable
on the database level.

What I propose is that we could add a boolean configuration option, say
'ensure_monotonic_timestamps', that enables the following behavior: when
GetCurrentTimestamp is called (
https://doxygen.postgresql.org/backend_2utils_2adt_2timestamp_8c.html#a9822cdf3fd41b15851c0c18ddc80143c),
before it returns it checks if `result` is less than what was returned last
time (if any) that GetCurrentTimestamp was called, and if so it returns the
result from the previous call (after logging a warning), otherwise it
proceeds as normal. In its simplest form, this could be accomplished by
adding a global variable lastGetCurrentTimestamp that stores the result of
the previous call. Since GetCurrentTimestamp appears to be the source of
all of the significant system-generated timestamps, including commit
timestamps, this should produce the behavior I'm looking for.

One tricky thing is to figure out how to make this reliable even in the
situation where the database engine has to be restarted. When we're
starting up and have to initialize lastGetCurrentTimestamp, we need to make
sure to make sure we initialize it to be at least as large as the largest
previous result of GetCurrentTimestamp that made its way into the WAL
before shutdown, i.e., the largest previous result of GetCurrentTimestamp
that has the potential to be written out to tables upon recovery. What's
fuzzy to me is whether this would require writing new data to the WAL
specifically for this, or whether there are already timestamps (e.g., as
part of WAL metadata) that could serve this purpose.

Any thoughts?

- Brent Kerby


Re: Contention preventing locking

2018-02-20 Thread Simon Riggs
On 15 February 2018 at 16:00, Konstantin Knizhnik
 wrote:

> So in heap_acquire_tuplock all competing transactions are waiting for TID of
> the updated version. When transaction which changed this tuple is committed,
> one of the competitors will grant this lock and proceed, creating new
> version of the tuple. Then all other competitors will be awaken and ... find
> out that locked tuple is not the last version any more.
> Then they will locate new version and try to lock it... The more competitors
> we have, then more attempts they all have to perform (quadratic complexity).

What about the tuple lock? You are saying that is ineffective?

src/backend/access/heap/README.tuplock


> My idea was that we can avoid such performance degradation if we somehow
> queue competing transactions so that they will get control one-by-one,
> without doing useless work.
> First of all I tried to replace TID  lock with PK (primary key) lock. Unlike
> TID, PK of record  is not changed during hot update. The second idea is that
> instead immediate releasing lock after update we can hold it until the end
> of transaction. And this optimization really provides improve of performance
> - it corresponds to pg_f1_opt configuration at the attached diagram.
> For some workloads it provides up to two times improvement comparing with
> vanilla Postgres. But there are many problems with correct (non-prototype)
> implementation of this approach:
> handling different types of PK, including compound keys, PK updates,...

Try locking the root tid rather than the TID, that is at least unique
per page for a chain of tuples, just harder to locate.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Typo in procarray.c

2018-02-20 Thread Magnus Hagander
On Tue, Feb 20, 2018 at 2:47 AM, Masahiko Sawada 
wrote:

> Hi,
>
> Attached patch for fixing $subject.
>
> s/replicaton/replication/
>

Applied, thanks.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: [HACKERS] Runtime Partition Pruning

2018-02-20 Thread Rajkumar Raghuwanshi
On Sat, Feb 17, 2018 at 2:27 PM, David Rowley 
wrote:

> Hi,
>
> I've attached an updated patch, now at v10. v9 was short lived due to
> the evolution of Amit's which which this based on.
>
> This version is based on Amit's v27 of faster partition pruning [1]
> which can be applied atop of ad7dbee36.
>

Hi,

I have applied v10 patch on Amit's v27 over head ad7dbee36. I got "ERROR:
partition missing from Append subplans" with the patch. on head and only
with Amit's patches query is working fine, Please find test case below.

CREATE TABLE part ( c1 INT2, c2 DATE) PARTITION BY RANGE (c1);
CREATE TABLE part_p1 PARTITION OF part FOR VALUES FROM (0) TO (141)
PARTITION BY RANGE(c2);
CREATE TABLE part_p11 PARTITION OF part_p1 FOR VALUES FROM ('1/1/1997') TO
('2/1/1999');
CREATE TABLE part_p12 PARTITION OF part_p1 FOR VALUES FROM ('2/1/1999') TO
('2/1/2000');
CREATE TABLE part_p2 PARTITION OF part FOR VALUES FROM (141) TO (211)
PARTITION BY RANGE(c2);
CREATE TABLE part_p21 PARTITION OF part_p2 FOR VALUES FROM ('1/1/2000') TO
('2/1/2001');
CREATE TABLE part_p22 PARTITION OF part_p2 FOR VALUES FROM ('2/1/2001') TO
('2/1/2006');

INSERT INTO part VALUES (100,'1/1/1999');
INSERT INTO part VALUES (110,'1/1/1998');
INSERT INTO part VALUES (130,'1/1/2000');
INSERT INTO part VALUES (170,'1/1/2000');
INSERT INTO part VALUES (180,'1/1/2001');
INSERT INTO part VALUES (190,'1/1/2006');
INSERT INTO part VALUES (200,'1/1/2000');
INSERT INTO part VALUES (210,'1/1/2002');

postgres=# PREPARE RTP AS SELECT * FROM PART WHERE c2 BETWEEN '1/1/1998'
AND '1/1/1999';
PREPARE
postgres=# EXPLAIN execute RTP;
 QUERY
PLAN
-
 Append  (cost=0.00..46.00 rows=12 width=6)
   ->  Seq Scan on part_p11  (cost=0.00..46.00 rows=12 width=6)
 Filter: ((c2 >= '1998-01-01'::date) AND (c2 <= '1999-01-01'::date))
(3 rows)

postgres=# execute RTP;
*ERROR:  partition missing from Append subplans*

deallocate RTP;
DROP TABLE part;

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


Re: ALTER TABLE ADD COLUMN fast default

2018-02-20 Thread David Rowley
Thanks for making those updates
.
On 20 February 2018 at 19:33, Andrew Dunstan
 wrote:
> On Mon, Feb 19, 2018 at 1:18 PM, David Rowley
>  wrote:
>> Since the attnum order in the missing values appears to be well
>> defined in ascending order, you can also simplify the comparison logic
>> in equalTupleDescs(). The inner-most nested loop shouldn't be
>> required.
>
> Well, this comment in the existing code in equalTupleDescs() suggests
> that in fact we can't rely on the attrdef items being in attnum order:
>
> /*
>  * We can't assume that the items are always read from the system
>  * catalogs in the same order; so use the adnum field to identify
>  * the matching item to compare.
>  */

On closer look, perhaps the reason that comment claims the order is
not well defined is that the use of the index depends on the value of
criticalRelcachesBuilt in the following:

pg_attribute_scan = systable_beginscan(pg_attribute_desc,
   AttributeRelidNumIndexId,
   criticalRelcachesBuilt,
   NULL,
   2, skey);

Otherwise, I see no reason for them to be out of order, as if I grep
for instances of "->missing = " I only see the place where the missing
attr details are set in RelationBuildTupleDesc() and one other place
in CreateTupleDescCopyConstr(), where it's simply just copied via
memcpy().

Nevertheless, it would be interesting to see how much a bsearch in
getmissingattr would help Tomas' case. Though, perhaps you're happy
enough with the performance already.

I'll make another pass over the updated code soon to see if I can see
anything else.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: unique indexes on partitioned tables

2018-02-20 Thread Amit Langote
Hi.

On 2018/02/20 5:45, Alvaro Herrera wrote:
> I pushed this now, with fixes for the last few comments there were.

I noticed with the commit that, while ON CONFLICT (conflict_target) DO
UPDATE gives a less surprising error message by catching it in the parser,
ON CONFLICT (conflict_target) DO NOTHING will go into the executor without
the necessary code to handle the case.  Example:

create table p (a int primary key, b text) partition by list (a);
create table p12 partition of p for values in (1, 2);
create table p3 partition of p (a unique) for values in (3);

insert into p values (1, 'a') on conflict (a) do nothing;
ERROR:  unexpected failure to find arbiter index

Attached is a patch to fix that.  Actually, there are two -- one that
adjusts the partitioned table tests in insert_conflict.sql to have a
partitioned unique index and another that fixes the code.

I suppose we'd need to apply this temporarily until we fix the ON CONFLICT
(conflict_target) case to be able to use partitioned indexes.

Thanks,
Amit
From c5536f89c6723db7b2af8d93ab544254763b522a Mon Sep 17 00:00:00 2001
From: amit 
Date: Tue, 20 Feb 2018 18:02:30 +0900
Subject: [PATCH v1 1/2] Adjust partitioned table tests in insert_conflict.sql

---
 src/test/regress/expected/insert_conflict.out | 24 +---
 src/test/regress/sql/insert_conflict.sql  | 16 +---
 2 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/src/test/regress/expected/insert_conflict.out 
b/src/test/regress/expected/insert_conflict.out
index 2650faedee..a46fe7ec60 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -788,14 +788,24 @@ select * from selfconflict;
 drop table selfconflict;
 -- check that the following works:
 -- insert into partitioned_table on conflict do nothing
-create table parted_conflict_test (a int, b char) partition by list (a);
-create table parted_conflict_test_1 partition of parted_conflict_test (b 
unique) for values in (1);
-insert into parted_conflict_test values (1, 'a') on conflict do nothing;
-insert into parted_conflict_test values (1, 'a') on conflict do nothing;
+create table parted_conflict_test (a int unique, b char) partition by list (a);
+create table parted_conflict_test_1 partition of parted_conflict_test (b 
unique) for values in (1, 2);
+insert into parted_conflict_test values (1, 'a') on conflict (a) do nothing;
+ERROR:  unexpected failure to find arbiter index
+insert into parted_conflict_test values (1, 'a') on conflict (a) do nothing;
+ERROR:  unexpected failure to find arbiter index
 -- however, on conflict do update is not supported yet
-insert into parted_conflict_test values (1) on conflict (b) do update set a = 
excluded.a;
+insert into parted_conflict_test values (1, 'a') on conflict (b) do update set 
a = excluded.a;
+ERROR:  ON CONFLICT DO UPDATE cannot be applied to partitioned table 
"parted_conflict_test"
+insert into parted_conflict_test values (1, 'a') on conflict (a) do update set 
b = excluded.b;
 ERROR:  ON CONFLICT DO UPDATE cannot be applied to partitioned table 
"parted_conflict_test"
 -- but it works OK if we target the partition directly
-insert into parted_conflict_test_1 values (1) on conflict (b) do
-update set a = excluded.a;
+insert into parted_conflict_test_1 values (2, 'a') on conflict (b) do update 
set a = excluded.a;
+insert into parted_conflict_test_1 values (2, 'b') on conflict (a) do update 
set b = excluded.b;
+select * from parted_conflict_test order by a;
+ a | b 
+---+---
+ 2 | b
+(1 row)
+
 drop table parted_conflict_test;
diff --git a/src/test/regress/sql/insert_conflict.sql 
b/src/test/regress/sql/insert_conflict.sql
index 32c647e3f8..97affc3726 100644
--- a/src/test/regress/sql/insert_conflict.sql
+++ b/src/test/regress/sql/insert_conflict.sql
@@ -474,13 +474,15 @@ drop table selfconflict;
 
 -- check that the following works:
 -- insert into partitioned_table on conflict do nothing
-create table parted_conflict_test (a int, b char) partition by list (a);
-create table parted_conflict_test_1 partition of parted_conflict_test (b 
unique) for values in (1);
-insert into parted_conflict_test values (1, 'a') on conflict do nothing;
-insert into parted_conflict_test values (1, 'a') on conflict do nothing;
+create table parted_conflict_test (a int unique, b char) partition by list (a);
+create table parted_conflict_test_1 partition of parted_conflict_test (b 
unique) for values in (1, 2);
+insert into parted_conflict_test values (1, 'a') on conflict (a) do nothing;
+insert into parted_conflict_test values (1, 'a') on conflict (a) do nothing;
 -- however, on conflict do update is not supported yet
-insert into parted_conflict_test values (1) on conflict (b) do update set a = 
excluded.a;
+insert into parted_conflict_test values (1, 'a') on conflict (b) do update set 
a = excluded.a;
+insert into parted_conflict_test values (1, 'a') on conflict (a) do update set 

Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-02-20 Thread Ashutosh Bapat
On Tue, Feb 13, 2018 at 6:21 PM, Ashutosh Bapat
 wrote:
>
> 1. Push down ConvertRowtypeExpr and include it in the pushed down targetlist.
> This would solve both the problems described above. Both set_plan_ref() and
> get_relation_column_alias_ids() will find ConvertRowtypeExpr, they are looking
> for and won't throw an error.
>
> This requires two parts
> a. In build_tlist_to_deparse(), instead of pulling
> Var node from ConvertRowtypeExpr, we pull whole ConvertRowtypeExpr and include
> it in the targetlist being deparsed which is also used as to set
> fdw_scan_tlist. In order to pull any ConvertRowtypeExpr's in the local quals,
> which may be hidden in the expression tree, we will add two more options to
> flags viz. PVC_INCLUDE_CONVERTROWTYPEEXPR and PVC_RECURSE_CONVERTROWTYPEEXPR.
> Unlike the other PVC_* options, which do not default to a value, we may want 
> to
> default to PVC_RECURSE_CONVERTROWTYPEEXPR if nothing is specified. That will
> avoid, possibly updating every pull_var_clause call with
> PVC_RECURSE_CONVERTROWTYPEEXPR.
> b. deparse ConvertRowtypeExpr
> For this we need to get the conversion map between the parent and child. We
> then deparse ConvertRowtypeExpr as a ROW() with the attributes of child
> rearranged per the conversion map. A multi-level partitioned table will have
> nested ConvertRowtypeExpr. To deparse such expressions, we need to find the
> conversion map between the topmost parent and the child, by ignoring any
> intermediate parents.
>
>

Here's patchset implementing this solution.

0001 adds PVC_*_CONVERTROWTYPEEXPR to pull_var_clause() and adjusts its callers.

0002 fixes a similar bug for regular partitioned tables. The patch has
testcase. The commit message explains the bug in more detail.

0003 has postgres_fdw fixes as outlined above with tests.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


expr_ref_error_pwj.tar.gz
Description: GNU Zip compressed data


Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-20 Thread Masahiko Sawada
On Fri, Feb 16, 2018 at 5:00 AM, Claudio Freire  wrote:
> On Thu, Feb 15, 2018 at 4:47 PM, Claudio Freire  
> wrote:
>> On Wed, Feb 14, 2018 at 3:59 AM, Masahiko Sawada  
>> wrote:
>
 The final FSM vacuum pass isn't partial, to finally correct all those
 small inconsistencies.
>>>
>>> Yes, but the purpose of this patch is to prevent table bloating from
>>> concurrent modifications?
>>
>> Yes, by *marking* unmarked space. If the slot is above the threshold,
>> it means there's already marked free space on that branch, and search
>> will go into it already, so no pressing need to refine the mark. The
>> space already marked can serve while vacuum makes further progress.
>> Ie: it can wait.
>
> Although... I think I see what you mean.
>
> If you have this:
>
> 255
> .   0
> .   .   0
> .   .   255
> .   0
> .   .   64
> .   .   64
> .   0
> .   .   0
> .   .   64
>
>
> A partial vacuum won't even recurse beyond the root node, so it won't
> expose the free space 2 levels down.

Yeah, that's what I meant. I think this might be able to happen
slightly easily if a tables has fillfactor < 100. For example,
updating tuples on pages that are almost packed except for fillfactor
with the more bigger value

>
> This could be arrived at by having an almost fully packed table that
> contains some free space near the end, and it gets deletions near the
> beginning. Vacuum will mark the leaf levels at the beginning of the
> table, and we'd like for that free space to be discoverable to avoid
> packing more rows near the end where they prevent truncation.
>
> The patch as it is now will only vacuum that part of the FSM when the
> root value drops, which usually requires all the free space on that
> region of the heap to be exhausted.
>
> With the current recursive FSM vacuum method, however, that's
> unavoidable. We can't detect that there's some free space below the
> current level to be exposed without recursing into the tree, and
> recursing is exactly the kind of work we want to prevent with tree
> pruning.
>
> In all my trials, however, I did not run into that case. It must not
> be easy to get into that situation, because the root node already has
> ~4000 slots in it. Each of those 4000 slots should be in that
> situation to keep the space at the end of the table as the sole
> discoverable free space, which seems like a rather contorted worst
> case.

Okay, I guess that this patch cannot help in the case where the
freespace of pages shown on fsm gets decreased by vacuum because the
upper-level node doesn't reflect the value on the lower page. However
it doesn't happen often as you said and it's the same as it is even
though it happens. So I also think it would not become a problem.

I'll review v4 patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center