Re: [HACKERS] Crash on promotion when recovery.conf is renamed

2016-12-19 Thread Michael Paquier
On Sat, Dec 17, 2016 at 9:23 PM, Magnus Hagander  wrote:
> On Fri, Dec 16, 2016 at 7:08 AM, Michael Paquier 
> wrote:
>> Looking at PrescanPreparedTransactions(), I am thinking as well that it
>> would
>> be better to get a hard failure when bumping on a corrupted 2PC file.
>> Future
>> files are one thing, but corrupted files should be treated more carefully.
>
>
> Again without looking at it, I agree (so much easier that way :P). Ignoring
> corruption is generally a bad idea. Failing hard makes the user notice the
> error, and makes it possible to initiate recovery from a standby or from
> backups or something, or to *intentionally* remove/clear/ignore it.

And I am finishing with the two patches attached:
- 0001 changes the 2PC checks so as corrupted entries are FATAL.
PreScanPreparedTransaction is used when a hot standby is initialized.
In this case a failure protects the range of XIDs generated,
potentially saving from corruption of data. At the end of recovery,
this is done before any on-disk actions are taken.
- 0002 is the thing that Heikki has sent previously to minimize the
window between end-of-recovery record write and timeline history file
archiving.

I am attaching that to next CF.
-- 
Michael
From 0e816454c57b851f4aa2743ea776b1e604a27d77 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 20 Dec 2016 16:47:35 +0900
Subject: [PATCH 2/2] Minimize window between history file and end-of-recovery
 record

Once a standby node is promoted, this makes the assignment of the new
timeline number booked earlier as the history file gets archived immediately.
This way the other nodes are aware that this new timeline number is taken
and should not be assigned to other nodes.

The window between which the history file is archived and the end-of-recovery
record is written cannot be zeroed, but this way it is minimized as much as
possible. The new order of actions prevents as well a corrupted data directory
on failure.
---
 src/backend/access/transam/xlog.c | 27 ++-
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ea2c523e6e..f23feb47e8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7210,6 +7210,24 @@ StartupXLOG(void)
 		else
 			snprintf(reason, sizeof(reason), "no recovery target specified");
 
+		/*
+		 * We are now done reading the old WAL.  Turn off archive fetching if it
+		 * was active, and make a writable copy of the last WAL segment. (Note
+		 * that we also have a copy of the last block of the old WAL in readBuf;
+		 * we will use that below.)
+		 */
+		exitArchiveRecovery(EndOfLogTLI, EndOfLog);
+
+		/*
+		 * Write the timeline history file, and have it archived. After this
+		 * point (or rather, as soon as the file is archived), the timeline
+		 * will appear as "taken" in the WAL archive and to any standby servers.
+		 * If we crash before actually switching to the new timeline, standby
+		 * servers will nevertheless think that we switched to the new timeline,
+		 * and will try to connect to the new timeline. To minimize the window
+		 * for that, try to do as little as possible between here and writing the
+		 * end-of-recovery record.
+		 */
 		writeTimeLineHistory(ThisTimeLineID, recoveryTargetTLI,
 			 EndRecPtr, reason);
 	}
@@ -7219,15 +7237,6 @@ StartupXLOG(void)
 	XLogCtl->PrevTimeLineID = PrevTimeLineID;
 
 	/*
-	 * We are now done reading the old WAL.  Turn off archive fetching if it
-	 * was active, and make a writable copy of the last WAL segment. (Note
-	 * that we also have a copy of the last block of the old WAL in readBuf;
-	 * we will use that below.)
-	 */
-	if (ArchiveRecoveryRequested)
-		exitArchiveRecovery(EndOfLogTLI, EndOfLog);
-
-	/*
 	 * Prepare to write WAL starting at EndOfLog position, and init xlog
 	 * buffer cache using the block containing the last record from the
 	 * previous incarnation.
-- 
2.11.0

From 4037db778a4353fc79b15eb683a5b9e3f648db24 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 20 Dec 2016 16:41:44 +0900
Subject: [PATCH 1/2] Change detection of corrupted 2PC files as FATAL

When scanning 2PC files, be it for initializing a hot standby node or
finding the XID range at the end of recovery, bumping on a corrupted
2PC file is reported as a WARNING and are blindly corrupted. If it
happened that the corrupted file was actually legit, there is actually
a risk of corruption, so switch that to a hard failure.
---
 src/backend/access/transam/twophase.c | 23 ---
 src/backend/access/transam/xlog.c | 10 +++---
 2 files changed, 15 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 5415604993..9cbcb83062 100644
--- a/src/backend/access/transam/twophase.c
+++ 

Re: [HACKERS] increasing the default WAL segment size

2016-12-19 Thread Beena Emerson
Hello all,

On Mon, Dec 19, 2016 at 3:14 PM, Beena Emerson 
wrote:

> Hello all,
>
> Please find attached a patch to make wal segment size initdb configurable.
>
> The attached patch removes --with-wal-segsize configure option and adds a
> new initdb option --wal-segsize. The module initdb passes the wal-segsize
> value into an environment variable which is used to overwrite the guc value
> wal_ segment_size and set the internal variables : XLogSegSize and
> XLOG_SEG_SIZE (xlog_internal.h). The default wal_segment_size is not
> changed but I have increased the maximum size to 1GB.
>
> Since  XLOG_SEG_SIZE is now variable, it could not be used directly in
> src/bin modules and few macros and few changes had to be made:
>   -  in guc.c , remove GUC_UNIT_XSEGS which used XLOG_SEG_SIZE and
> introduce show functions for the guc which used the unit (min_wal_size and
> max_wal_size).
>   -  For pg_basebackup, add new replication command SHOW_WAL_SEGSZ to
> fetch the wal_segment_size in bytes.
>   - pg_controldata, pg_resetxlog, pg_rewind, fetch the xlog_seg_size from
> the ControlFile.
>   - Since pg_xlogdump reads the wal files, it uses the file size to
> determine the xlog_seg_size.
>   - In pg_test_fsync, a buffer of size XLOG_SEG_SIZE was created, filled
> with random data and written to a temporary file to check for any
> write/fsync error before performing the tests. Since it does not affect the
> actual performance results, the XLOG_SEG_SIZE in the module is replaced
> with the default value (16MB).
>
> Please note that the documents are not updated in this patch.
>
> Feedback and suggestions are welcome.
>

This patch has been added to the commit fest (https://commitfest.
postgresql.org/12/921/)

After further testing, I found that pg_standby contrib module does not work
with the changes. I will fix it in the next version of the patch. Comments
on the current patch are welcome.


Thank you,

Beena Emerson

Have a Great Day!


Re: [HACKERS] too low cost of Bitmap index scan

2016-12-19 Thread Pavel Stehule
2016-12-19 23:28 GMT+01:00 Robert Haas :

> On Sat, Dec 17, 2016 at 3:30 AM, Pavel Stehule 
> wrote:
> > ->  Bitmap Heap Scan on "Zasilka"  (cost=5097.39..5670.64 rows=1
> width=12)
> > (actual time=62.253..62.400 rows=3 loops=231)
> ...
> > When I disable bitmap scan, then the query is 6x time faster
> 
> >->  Index Scan using "Zasilka_idx_Dopravce" on "Zasilka"
> > (cost=0.00..30489.04 rows=1 width=12) (actual time=15.651..17.187 rows=3
> > loops=231)
> > Index Cond: ("Dopravce" = "Dopravce_Ridic_1"."ID")
> >Filter: (("StavDatum" > (now() - '10 days'::interval)) AND
> (("Stav" =
> > 34) OR ("Stav" = 29) OR ("Stav" = 180) OR ("Stav" = 213) OR ("Stav" =
> 46) OR
> > (("Produkt" = 33) AND ("Stav" = 179)) OR ((("ZpetnaZasilka" =
> '-1'::integer)
> > OR ("PrimaZasilka" = '-1'::integer)) AND ("Stav" = 40
> > Rows Removed by Filter: 7596
>
> I'm not sure, but my guess would be that the query planner isn't
> getting a very accurate selectivity estimate for that giant filter
> condition, and that's why the cost estimate is off.
>

maybe operator cost is too high?

Regards

Pavel


> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: [HACKERS] Logical Replication WIP

2016-12-19 Thread Erik Rijkers

On 2016-12-19 08:04, Erik Rijkers wrote:

On 2016-12-18 11:12, Petr Jelinek wrote:

(now using latest: patchset:)

0001-Add-PUBLICATION-catalogs-and-DDL-v14.patch
0002-Add-SUBSCRIPTION-catalog-and-DDL-v14.patch
0003-Define-logical-replication-protocol-and-output-plugi-v14.patch
0004-Add-logical-replication-workers-v14.patch
0005-Add-separate-synchronous-commit-control-for-logical--v14.patch

Sometimes  replication (caused by a pgbench run)  runs for a few
seconds replicating all 4 pgbench tables correctly, but never longer
than 10 to 20 seconds.



I've concocted pgbench_derail.sh.  It assumes 2 instances running, 
initially without the publication and subsciption.


There are two separate installations, on the same machine.

To startup the two instances I use instance.sh:

# ./instances.sh
#!/bin/sh
port1=6972
port2=6973
project1=logical_replication
project2=logical_replication2
pg_stuff_dir=$HOME/pg_stuff
PATH1=$pg_stuff_dir/pg_installations/pgsql.$project1/bin:$PATH
PATH2=$pg_stuff_dir/pg_installations/pgsql.$project2/bin:$PATH
server_dir1=$pg_stuff_dir/pg_installations/pgsql.$project1
server_dir2=$pg_stuff_dir/pg_installations/pgsql.$project2
data_dir1=$server_dir1/data
data_dir2=$server_dir2/data
options1="
-c wal_level=logical
-c max_replication_slots=10
-c max_worker_processes=12
-c max_logical_replication_workers=10
-c max_wal_senders=10
-c logging_collector=on
-c log_directory=$server_dir1
-c log_filename=logfile.${project1} "

options2="
-c wal_level=replica
-c max_replication_slots=10
-c max_worker_processes=12
-c max_logical_replication_workers=10
-c max_wal_senders=10
-c logging_collector=on
-c log_directory=$server_dir2
-c log_filename=logfile.${project2} "
which postgres
export PATH=$PATH1; postgres -D $data_dir1 -p $port1 ${options1} &
export PATH=$PATH2; postgres -D $data_dir2 -p $port2 ${options2} &
# end ./instances.sh







#--- pgbench_derail.sh
#!/bin/sh

# assumes both instances are running

# clear logs
# echo > 
$HOME/pg_stuff/pg_installations/pgsql.logical_replication/logfile.logical_replication
# echo > 
$HOME/pg_stuff/pg_installations/pgsql.logical_replication2/logfile.logical_replication2


port1=6972
port2=6973


function cb()
{
  #  display the 4 pgbench tables' accumulated content as md5s
  #  a,b,t,h stand for:  pgbench_accounts, -branches, -tellers, -history
  for port in $port1 $port2
  do
md5_a=$(echo "select * from pgbench_accounts order by aid"
|psql -qtAXp$port|md5sum|cut -b 1-9)
md5_b=$(echo "select * from pgbench_branches order by bid"
|psql -qtAXp$port|md5sum|cut -b 1-9)
md5_t=$(echo "select * from pgbench_tellers  order by tid"
|psql -qtAXp$port|md5sum|cut -b 1-9)
md5_h=$(echo "select * from pgbench_history  order by 
aid,bid,tid"|psql -qtAXp$port|md5sum|cut -b 1-9)
cnt_a=$(echo "select count(*) from pgbench_accounts"|psql -qtAXp 
$port)
cnt_b=$(echo "select count(*) from pgbench_branches"|psql -qtAXp 
$port)
cnt_t=$(echo "select count(*) from pgbench_tellers" |psql -qtAXp 
$port)
cnt_h=$(echo "select count(*) from pgbench_history" |psql -qtAXp 
$port)
printf "$port a,b,t,h: %6d %6d %6d %6d" $cnt_a  $cnt_b  $cnt_t  
$cnt_h

echo -n "   $md5_a  $md5_b  $md5_t  $md5_h"
if   [[ $port -eq $port1 ]]; then echo "   master"
elif [[ $port -eq $port2 ]]; then echo "   replica"
else  echo " ERROR"
fi
  done
}


echo "
drop table if exists pgbench_accounts;
drop table if exists pgbench_branches;
drop table if exists pgbench_tellers;
drop table if exists pgbench_history;" | psql -X -p $port1 \
  && echo "
drop table if exists pgbench_accounts;
drop table if exists pgbench_branches;
drop table if exists pgbench_tellers;
drop table if exists pgbench_history;" | psql -X -p $port2 \
  && pgbench -p $port1 -qis 1 \
  && echo "alter table pgbench_history replica identity full;" | psql 
-1p $port1 \

  && pg_dump -F c  -p $port1 \
 -t pgbench_accounts \
 -t pgbench_branches \
 -t pgbench_tellers  \
 -t pgbench_history  \
| pg_restore -p $port2 -d testdb

echo  "$(cb)"

sleep 2

echo  "$(cb)"

echo "create publication pub1 for all tables;" | psql -p $port1 -aqtAX

echo "
create subscription sub1
  connection 'port=${port1}'
  publication pub1
  with (disabled);
alter subscription sub1 enable;
" | psql -p $port2 -aqtAX
#

# repeat a short (10 s) pgbench-un to show that during such
# short runs the logical replication often remains intact.
# Longer pgbench-runs always derail the logrep of one or more
# of these 4 table
#
# bug:  pgbench_history no longer replicates
#   sometimes also the other 3 table de-synced.

echo  "$(cb)"
echo "-- pgbench -c 1 -T 10 -P 5 (short run, first)"
 pgbench -c 1 -T 10 -P 5
sleep 2
echo  "$(cb)"

echo "-- pgbench -c 1 -T 10 -P 5 (short run, second)"
 pgbench -c 1 -T 10 -P 5
sleep 2
echo  "$(cb)"

echo "-- pgbench -c 1 -T 120 -P 15 (long run)"
   

Re: [HACKERS] Logical decoding on standby

2016-12-19 Thread Petr Jelinek
On 07/12/16 07:05, Craig Ringer wrote:
> On 21 November 2016 at 16:17, Craig Ringer  wrote:
>> Hi all
>>
>> I've prepared a working initial, somewhat raw implementation for
>> logical decoding on physical standbys.
> 
> Hi all
> 
> I've attached a significantly revised patch, which now incorporates
> safeguards to ensure that we prevent decoding if the master has not
> retained needed catalogs and cancel decoding sessions that are holding
> up apply because they need too-old catalogs
> 
> The biggest change in this patch, and the main intrusive part, is that
> procArray->replication_slot_catalog_xmin is no longer directly used by
> vacuum. Instead, a new ShmemVariableCache->oldestCatalogXmin field is
> added, with a corresponding CheckPoint field. Vacuum notices if
> procArray->replication_slot_catalog_xmin has advanced past
> ShmemVariableCache->oldestCatalogXmin and writes a new xact rmgr
> record with the new value before it copies it to oldestCatalogXmin.
> This means that a standby can now reliably tell when catalogs are
> about to be removed or become candidates for removal, so it can pause
> redo until logical decoding sessions on the standby advance far enough
> that their catalog_xmin passes that point. It also means that if our
> hot_standby_feedback somehow fails to lock in the catalogs our slots
> need on a standby, we can cancel sessions with a conflict with
> recovery.
> 
> If wal_level is < logical this won't do anything, since
> replication_slot_catalog_xmin and oldestCatalogXmin will both always
> be 0.
> 
> Because oldestCatalogXmin advances eagerly as soon as vacuum sees the
> new replication_slot_catalog_xmin, this won't impact catalog bloat.
> 
> Ideally this mechanism won't generally actually be needed, since
> hot_standby_feedback stops the master from removing needed catalogs,
> and we make an effort to ensure that the standby has
> hot_standby_feedback enabled and is using a replication slot. We
> cannot prevent the user from dropping and re-creating the physical
> slot on the upstream, though, and it doesn't look simple to stop them
> turning off hot_standby_feedback or turning off use of a physical slot
> after creating logical slots, either. So we try to stop users shooting
> themselves in the foot, but if they do it anyway we notice and cope
> gracefully. Logging catalog_xmin also helps slots created on standbys
> know where to start, and makes sure we can deal gracefully with a race
> between hs_feedback and slot creation on a standby.
> 

Hi,

If this mechanism would not be needed most of the time, wouldn't it be
better to not have it and just have a way to ask physical slot about
what's the current reserved catalog_xmin (in most cases the standby
should actually know what it is since it's sending the hs_feedback, but
first slot created on such standby may need to check). WRT preventing
hs_feedback going off, we can refuse to start with hs_feedback off when
there are logical slots detected. We can also refuse to connect to the
master without physical slot if there are logical slots detected. I
don't see problem with either of those.

You may ask what if user drops the slot and recreates or somehow
otherwise messes up catalog_xmin on master, well, considering that under
my proposal we'd first (on connect) check the slot for catalog_xmin we'd
know about it so we'd either mark the local slots broken/drop them or
plainly refuse to connect to the master same way as if it didn't have
required WAL anymore (not sure which behavior is better). Note that user
could mess up catalog_xmin even in your design the same way, so it's not
really a regression.

In general I don't think that it's necessary to WAL log anything for
this. It will not work without slot and therefore via archiving anyway
so writing to WAL does not seem to buy us anything. There are some
interesting side effects of cascading (ie having A->B->C replication and
creating logical slot on C) but those should not be insurmountable. Plus
it might even be okay to only allow creating logical slots on standbys
connected directly to master in v1.


That's about approach, but since there are prerequisite patches in the
patchset that don't really depend on the approach I will comment about
them as well.

0001 and 0002 add testing infrastructure and look fine to me, possibly
committable.

But in 0003 I don't understand following code:
> + if (endpos != InvalidXLogRecPtr && !do_start_slot)
> + {
> + fprintf(stderr,
> + _("%s: cannot use --create-slot or --drop-slot 
> together with --endpos\n"),
> + progname);
> + fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
> + progname);
> + exit(1);
> + }

Why is --create-slot and --endpos not allowed together?

0004 again looks good but depends on 0003.

0005 is timeline following which is IMHO ready for committer, as is 

Re: [HACKERS] invalid combination of options "-D - -F t -X stream" in pg_basebackup

2016-12-19 Thread Fujii Masao
On Tue, Dec 20, 2016 at 1:43 AM, Magnus Hagander  wrote:
>
>
> On Mon, Dec 19, 2016 at 5:39 PM, Fujii Masao  wrote:
>>
>> Hi,
>>
>> Isn't it better to forbid the conbination of the options "-D -", "-F t"
>> and
>> "-X stream" in pg_basebackup? This is obviously invalid setting and the
>> docs
>> warns this as follows. But currently users can specify such setting and
>> pg_basebackup can exit unexpectedly with an error.
>>
>> ---
>> If the value - (dash) is specified as target directory, the tar contents
>> will
>> be written to standard output, suitable for piping to for example gzip.
>> This is only possible if the cluster has no additional tablespaces.
>> ---
>
>
> Yes, definitely. I'd say that's an oversight in implementing the support for
> stream-to-tar that it did not detect this issue.
>
> Do you want to provide a patch for it, or should I?

What about the attached patch?

+fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+progname);

I added the above hint message because other codes checking invalid
options also have such hint messages. But there is no additional
useful information about valid combination of options in the help
messages, so I'm a bit tempted to remove the above hint message.

Regards,

-- 
Fujii Masao
*** a/src/bin/pg_basebackup/pg_basebackup.c
--- b/src/bin/pg_basebackup/pg_basebackup.c
***
*** 2268,2273  main(int argc, char **argv)
--- 2268,2283 
  		exit(1);
  	}
  
+ 	if (format == 't' && streamwal && strcmp(basedir, "-") == 0)
+ 	{
+ 		fprintf(stderr,
+ 			_("%s: cannot stream transaction logs in tar mode to stdout\n"),
+ progname);
+ 		fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+ progname);
+ 		exit(1);
+ 	}
+ 
  	if (replication_slot && !streamwal)
  	{
  		fprintf(stderr,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-12-19 Thread Michael Paquier
On Tue, Dec 20, 2016 at 2:31 PM, Masahiko Sawada  wrote:
> Do we need to consider the sorting method and the selecting k-th
> latest LSN method?

Honestly, nah. Tests are showing that there are many more bottlenecks
before that with just memory allocation and parsing.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background contrib module proposal

2016-12-19 Thread amul sul
On Tue, Dec 20, 2016 at 12:21 AM, David Fetter  wrote:
> On Thu, Nov 24, 2016 at 09:16:53AM +0530, amul sul wrote:
>> Hi All,
>>
>> I would like to take over pg_background patch and repost for
>> discussion and review.
>
> This looks great.
>
> Sadly, it now breaks initdb when applied atop
> dd728826c538f000220af98de025c00114ad0631 with:
>
> performing post-bootstrap initialization ... TRAP: 
> FailedAssertion("!(const Node*)(rinfo))->type) == T_RestrictInfo))", 
> File: "costsize.c", Line: 660)
> sh: line 1:  2840 Aborted (core dumped) 
> "/home/shackle/pggit/postgresql/tmp_install/home/shackle/10/bin/postgres" 
> --single -F -O -j -c search_path=pg_catalog -c exit_on_error=true template1 > 
> /dev/null
>

I've complied binary with --enable-cassert flag and pg_background
patch to the top of described commit as well as on latest HEAD, but I
am not able to reproduce this crash, see this:

[VM postgresql]$ /home/amul/Public/pg_inst/pg-master/bin/postgres
--single -F -O -j -c search_path=pg_catalog -c exit_on_error=true
template1

PostgreSQL stand-alone backend 10devel
backend> CREATE EXTENSION pg_background;

backend> select * from pg_extension;

1: extname (typeid = 19, len = 64, typmod = -1, byval = f)
2: extowner (typeid = 26, len = 4, typmod = -1, byval = t)
3: extnamespace (typeid = 26, len = 4, typmod = -1, byval = t)
4: extrelocatable (typeid = 16, len = 1, typmod = -1, byval = t)
5: extversion (typeid = 25, len = -1, typmod = -1, byval = f)
6: extconfig (typeid = 1028, len = -1, typmod = -1, byval = f)
7: extcondition (typeid = 1009, len = -1, typmod = -1, byval = f)

1: extname = "plpgsql" (typeid = 19, len = 64, typmod = -1, byval = f)
2: extowner = "10" (typeid = 26, len = 4, typmod = -1, byval = t)
3: extnamespace = "11" (typeid = 26, len = 4, typmod = -1, byval = t)
4: extrelocatable = "f" (typeid = 16, len = 1, typmod = -1, byval = t)
5: extversion = "1.0" (typeid = 25, len = -1, typmod = -1, byval = f)

1: extname = "pg_background" (typeid = 19, len = 64, typmod = -1, byval = f)
2: extowner = "10" (typeid = 26, len = 4, typmod = -1, byval = t)
3: extnamespace = "11" (typeid = 26, len = 4, typmod = -1, byval = t)
4: extrelocatable = "t" (typeid = 16, len = 1, typmod = -1, byval = t)
5: extversion = "1.0" (typeid = 25, len = -1, typmod = -1, byval = f)

backend>

I hope I am missing anything. Thanks !

Regards,
Amul


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-12-19 Thread Masahiko Sawada
On Mon, Dec 19, 2016 at 9:49 PM, Fujii Masao  wrote:
> On Sun, Dec 18, 2016 at 9:36 PM, Michael Paquier
>  wrote:
>> On Fri, Dec 16, 2016 at 10:42 PM, Fujii Masao  wrote:
>>> Attached is the modified version of the patch. Barring objections, I will
>>> commit this version.
>>
>> There is a whitespace:
>> $ git diff master --check
>> src/backend/replication/syncrep.c:39: trailing whitespace.
>> + *
>
> Okey, pushed the patch with this fix. Thanks!

Thank you for reviewing and commit!

> Regarding this feature, there are some loose ends. We should work on
> and complete them until the release.
>
> (1)
> Which synchronous replication method, priority or quorum, should be
> chosen when neither FIRST nor ANY is specified in s_s_names? Right now,
> a priority-based sync replication is chosen for keeping backward
> compatibility. However some hackers argued to change this decision
> so that a quorum commit is chosen because they think that most users
> prefer to a quorum.
>
> (2)
> There will be still many source comments and documentations that
> we need to update, for example, in high-availability.sgml. We need to
> check and update them throughly.

Will try to update them.

> (3)
> The priority value is assigned to each standby listed in s_s_names
> even in quorum commit though those priority values are not used at all.
> Users can see those priority values in pg_stat_replication.
> Isn't this confusing? If yes, it might be better to always assign 1 as
> the priority, for example.
>
>
> Any other?
>

Do we need to consider the sorting method and the selecting k-th
latest LSN method?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning - another take

2016-12-19 Thread Amit Langote
On 2016/12/20 12:59, Robert Haas wrote:
> On Sun, Dec 18, 2016 at 10:00 PM, Amit Langote
>  wrote:
>> Here are updated patches including the additional information.
> 
> Thanks.  Committed 0001.  Will have to review the others when I'm less tired.

Thanks!

> BTW, elog() is only supposed to be used for can't happen error
> messages; when it is used, no translation is done.  So this is wrong:
> 
> if (skip_validate)
> elog(NOTICE, "skipping scan to validate partition constraint");

You're right.  I was using it for debugging when I first wrote that code,
but then thought it would be better to eventually turn that into
ereport(INFO/NOTICE) for the final submission, which I missed to do. Sorry
about that.  Here's a patch.

Thanks,
Amit
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 1c219b03dd..6a179596ce 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -13297,8 +13297,10 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
 		}
 	}
 
+	/* It's safe to skip the validation scan after all */
 	if (skip_validate)
-		elog(NOTICE, "skipping scan to validate partition constraint");
+		ereport(INFO,
+(errmsg("skipping scan to validate partition constraint")));
 
 	/*
 	 * Set up to have the table to be scanned to validate the partition
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index 99e20eb922..2a324c0b49 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -3179,7 +3179,7 @@ ALTER TABLE list_parted2 ATTACH PARTITION part_3_4 FOR VALUES IN (3, 4);
 ALTER TABLE list_parted2 DETACH PARTITION part_3_4;
 ALTER TABLE part_3_4 ALTER a SET NOT NULL;
 ALTER TABLE list_parted2 ATTACH PARTITION part_3_4 FOR VALUES IN (3, 4);
-NOTICE:  skipping scan to validate partition constraint
+INFO:  skipping scan to validate partition constraint
 -- check validation when attaching range partitions
 CREATE TABLE range_parted (
 	a int,
@@ -3204,7 +3204,7 @@ CREATE TABLE part2 (
 	b int NOT NULL CHECK (b >= 10 AND b < 18)
 );
 ALTER TABLE range_parted ATTACH PARTITION part2 FOR VALUES FROM (1, 10) TO (1, 20);
-NOTICE:  skipping scan to validate partition constraint
+INFO:  skipping scan to validate partition constraint
 -- check that leaf partitions are scanned when attaching a partitioned
 -- table
 CREATE TABLE part_5 (
@@ -3219,7 +3219,7 @@ ERROR:  partition constraint is violated by some row
 DELETE FROM part_5_a WHERE a NOT IN (3);
 ALTER TABLE part_5 ADD CONSTRAINT check_a CHECK (a IN (5)), ALTER a SET NOT NULL;
 ALTER TABLE list_parted2 ATTACH PARTITION part_5 FOR VALUES IN (5);
-NOTICE:  skipping scan to validate partition constraint
+INFO:  skipping scan to validate partition constraint
 -- check that the table being attached is not already a partition
 ALTER TABLE list_parted2 ATTACH PARTITION part_2 FOR VALUES IN (2);
 ERROR:  "part_2" is already a partition

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning - another take

2016-12-19 Thread Robert Haas
On Sun, Dec 18, 2016 at 10:00 PM, Amit Langote
 wrote:
> Here are updated patches including the additional information.

Thanks.  Committed 0001.  Will have to review the others when I'm less tired.

BTW, elog() is only supposed to be used for can't happen error
messages; when it is used, no translation is done.  So this is wrong:

if (skip_validate)
elog(NOTICE, "skipping scan to validate partition constraint");


-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Measuring replay lag

2016-12-19 Thread Thomas Munro
On Mon, Dec 19, 2016 at 10:46 PM, Simon Riggs  wrote:
> On 26 October 2016 at 11:34, Thomas Munro  
> wrote:
>
>> It works by taking advantage of the { time, end-of-WAL } samples that
>> sending servers already include in message headers to standbys.  That
>> seems to provide a pretty good proxy for when the WAL was written, if
>> you ignore messages where the LSN hasn't advanced.  The patch
>> introduces a new GUC replay_lag_sample_interval, defaulting to 1s, to
>> control how often the standby should record these timestamped LSNs
>> into a small circular buffer.  When its recovery process eventually
>> replays a timestamped LSN, the timestamp is sent back to the upstream
>> server in a new reply message field.  The value visible in
>> pg_stat_replication.replay_lag can then be updated by comparing with
>> the current time.
>
> Why not just send back the lag as calculated by max_standby_streaming_delay?
> I.e. at the end of replay of each chunk record the current delay in
> shmem, then send it back periodically.
>
> If we have two methods of calculation it would be confusing.

Hmm.  If I understand correctly, GetStandbyLimitTime is measuring
something a bit different: it computes how long it has been since the
recovery process received the chunk that it's currently trying to
apply, most interestingly in the case where we are waiting due to
conflicts.  It doesn't include time in walsender, on the network, in
walreceiver, or writing and flushing and reading before it arrives in
the recovery process.  If I'm reading it correctly, it only updates
XLogReceiptTime when it's completely caught up applying all earlier
chunks, so when it falls behind, its measure of lag has a growing-only
phase and a reset that can only be triggered by catching up to the
latest chunk.  That seems OK for its intended purpose of putting a cap
on the delay introduced by conflicts.  But that's not what I'm trying
to provide here.

The purpose of this proposal is to show the replay_lag as judged by
the sending server: in the case of a primary server, that is an
indication of how commits done here will take to show up to users over
there, and how long COMMIT will take with remote_apply will take to
come back.  It measures the WAL's whole journey, and does so in a
smooth way that shows accurate information even if the standby never
quite catches up during long periods.

Example 1:  Suppose I have two servers right next each other, and the
primary server has periods of high activity which exceed the standby's
replay rate, perhaps because of slower/busier hardware, or because of
conflicts with other queries, or because our single-core 'REDO' can't
always keep up with multi-core 'DO'.  By the logic of
max_standby_streaming_delay, if it never catches up to the latest
chunk but remains a fluctuating number of chunks behind, then AIUI the
standby will compute a constantly increasing lag.  By my logic, the
primary will tell you quite accurately how far behind the standby's
recovery is at regular intervals, showing replay_lag fluctuating up
and down as appropriate, even if it never quite catches up.  It can do
that because it has a buffer full of regularly spaced out samples to
work through, and even if you exceed the buffer size (8192 seconds'
worth by default) it'll just increase the interval between samples.

Example 2:  Suppose I have servers on opposite sides of the world with
a ping time of 300ms.  By the logic used for
max_standby_streaming_delay, the lag computed by the standby would be
close to zero when there is no concurrent activity to conflict with.
I don't think that's what users other than the recovery-conflict
resolution code want to know.  By my logic, replay_lag computed by the
primary would show 300ms + a tiny bit more, which is how long it takes
for committed transactions to be visible to user queries on the
standby and for us to know that that is the case.  That's interesting
because it tells you how long synchronous_commit = remote_apply would
make you wait (if that server is waited for according to syncrep
config).

In summary, the max_standby_streaming_delay approach only measures
activity inside the recovery process on the standby, and only uses a
single variable for timestamp tracking, so although it's semi-related
it's not what I wanted to show.

(I suppose there might be an argument that max_standby_streaming_delay
should also track received-on-standby-time for each sampled LSN in a
circular buffer, and then use that information to implement
max_standby_streaming_delay more fairly.  We only need to cancel
queries that conflict with WAL records that have truly been waiting
max_standby_streaming_delay since receive time, instead of cancelling
everything that conflicts with recovery until we're caught up to the
last chunk as we do today as soon as max_standby_streaming_delay is
exceeded while trying to apply *any* WAL record.  This may not make
any sense or be worth 

Re: pg_authid.rolpassword format (was Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol)

2016-12-19 Thread Robert Haas
On Sat, Dec 17, 2016 at 5:48 PM, Michael Paquier
 wrote:
> On Sun, Dec 18, 2016 at 3:59 AM, Robert Haas  wrote:
>> On Fri, Dec 16, 2016 at 5:30 PM, Michael Paquier
>>  wrote:
>>> From the discussions of last year on -hackers, it was decided to *not*
>>> have an additional column per complains from a couple of hackers
>>> (Robert you were in this set at this point), ...
>>
>> Hmm, I don't recall taking that position, but then there are a lot of
>> things that I ought to recall and don't.  (Ask my wife!)
>
> [... digging objects of the past ...]
> From the past thread:
> https://www.postgresql.org/message-id/ca+tgmoy790rphhbogxmbtg6mzsendoxdbxebekaet9zpz8g...@mail.gmail.com
> The complain is directed directly to multiple verifiers per users
> though, not to have the type in a separate column.

Yes, I rather like the separate column.  But since Heikki is doing the
work (or if he is) I'm not going to gripe too much.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Creating a DSA area to provide work space for parallel execution

2016-12-19 Thread Robert Haas
On Mon, Dec 19, 2016 at 6:35 PM, Thomas Munro
 wrote:
> On Tue, Dec 20, 2016 at 11:12 AM, Robert Haas  wrote:
>> On Thu, Dec 1, 2016 at 6:35 AM, Thomas Munro
>>  wrote:
>>> On Sat, Nov 26, 2016 at 1:55 AM, Thomas Munro
>>>  wrote:
 Here's a new version to apply on top of dsa-v7.patch.
>>>
>>> Here's a version to go with dsa-v8.patch.
>>
>> All right, so I've committed this, but not before (1) renaming some
>> things that you added, (2) adding documentation for the new LWLock
>> tranche, (3) not creating the DSA in the "simulated DSM using
>> backend-private memory" case, and (4) some cosmetic tweaks.
>
> Thanks!  Hmm, so now in rare circumstances we can finish up running
> ExecXXXInitializeDSM, but later have estate->es_query_dsa == NULL.
> This is something to be careful about.

Yes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical Replication WIP

2016-12-19 Thread Petr Jelinek
On 19/12/16 15:39, Steve Singer wrote:
> On 12/18/2016 09:04 PM, Petr Jelinek wrote:
>> On 18/12/16 19:02, Steve Singer wrote:
>>
>>> pg_dump is also generating warnings
>>>
>>> pg_dump: [archiver] WARNING: don't know how to set owner for object type
>>> SUBSCRIPTION
>>>
>>> I know that the plan is to add proper ACL's for publications and
>>> subscriptions later. I don't know if we want to leave the warning in
>>> until then or do something about it.
>>>
>> No, ACLs are separate from owner. This is thinko on my side. I was
>> thinking we can live without ALTER ... OWNER TO for now, but we actually
>> need it for pg_dump and for REASSIGN OWNED. So now I added the OWNER TO
>> for both PUBLICATION and SUBSCRIPTION.
> 
> 
> When I try to restore my pg_dump with publications I get
> 
> ./pg_dump  -h localhost --port 5440 test |./psql -h localhost --port
> 5440 test2
> 
> 
> ALTER TABLE
> CREATE PUBLICATION
> ERROR:  unexpected command tag "PUBLICATION
> 
> This comes from a
> ALTER PUBLICATION mypub OWNER TO ssinger;
> 
> 
> Does the OWNER TO  clause need to be added to AlterPublicationStmt:
> instead of AlterOwnerStmt ?

Nah that's just bug in what command tag string we return in the
utility.c, I noticed this myself after sending the v14, it's one line fix.

> Also we should update the tab-complete for ALTER PUBLICATION to show the
> OWNER to options  + the \h help in psql and the reference SGML
> 

Yeah.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol

2016-12-19 Thread Michael Paquier
On Thu, Dec 15, 2016 at 3:17 PM, Michael Paquier
 wrote:
> In the case where the binaries are *not* built with libidn, I think
> that we had better reject valid UTF-8 string directly and just allow
> ASCII? SASLprep is a no-op on ASCII characters.
>
> Thoughts about this approach?

And Heikki has mentioned me that he'd prefer not having an extra
dependency for the normalization, which is LGPL-licensed by the way.
So I have looked at the SASLprep business to see what should be done
to get a complete implementation in core, completely independent of
anything known.

The first thing is to be able to understand in the SCRAM code if a
string is UTF-8 or not, and this code is in src/common/. pg_wchar.c
offers a set of routines exactly for this purpose, which is built with
libpq but that's not available for src/common/. So instead of moving
all the file, I'd like to create a new file in src/common/utf8.c which
includes pg_utf_mblen() and pg_utf8_islegal(). On top of that I think
that having a routine able to check a full string would be useful for
many users, as pg_utf8_islegal() can only check one set of characters.
If the password string is found to be of UTF-8 format, SASLprepare is
applied. If not, the string is copied as-is with perhaps unexpected
effects for the client But he's in trouble already if client is not
using UTF-8.

Then comes the real business... Note that's my first time touching
encoding, particularly UTF-8 in depth, so please be nice. I may write
things that are incorrect or sound so from here :)

The second thing is the normalization itself. Per RFC4013, NFKC needs
to be applied to the string.  The operation is described in [1]
completely, and it is named as doing 1) a compatibility decomposition
of the bytes of the string, followed by 2) a canonical composition.

About 1). The compatibility decomposition is defined in [2], "by
recursively applying the canonical and compatibility mappings, then
applying the canonical reordering algorithm". Canonical and
compatibility mapping are some data available in UnicodeData.txt, the
6th column of the set defined in [3] to be precise. The meaning of the
decomposition mappings is defined in [2] as well. The canonical
decomposition is basically to look for a given UTF-8 character, and
then apply the multiple characters resulting in its new shape. The
compatibility mapping should as well be applied, but [5], a perl tool
called charlint.pl doing this normalization work, does not care about
this phase... Do we?

About 2)... Once the decomposition has been applied, those bytes need
to be recomposed using the Canonical_Combining_Class field of
UnicodeData.txt in [3], which is the 3rd column of the set. Its values
are defined in [4]. An other interesting thing, charlint.pl [5] does
not care about this phase. I am wondering if we should as well not
just drop this part as well...

Once 1) and 2) are done, NKFC is complete, and so is SASLPrepare.

So what we need from Postgres side is a mapping table to, having the
following fields:
1) Hexa sequence of UTF8 character.
2) Its canonical combining class.
3) The kind of decomposition mapping if defined.
4) The decomposition mapping, in hexadecimal format.
Based on what I looked at, either perl or python could be used to
process UnicodeData.txt and to generate a header file that would be
included in the tree. There are 30k entries in UnicodeData.txt, 5k of
them have a mapping, so that will result in many tables. One thing to
improve performance would be to store the length of the table in a
static variable, order the entries by their hexadecimal keys and do a
dichotomy lookup to find an entry. We could as well use more fancy
things like a set of tables using a Radix tree using decomposed by
bytes. We should finish by just doing one lookup of the table for each
character sets anyway.

In conclusion, at this point I am looking for feedback regarding the
following items:
1) Where to put the UTF8 check routines and what to move.
2) How to generate the mapping table using UnicodeData.txt. I'd think
that using perl would be better.
3) The shape of the mapping table, which depends on how many
operations we want to support in the normalization of the strings.
The decisions for those items will drive the implementation in one
sense or another.

[1]: http://www.unicode.org/reports/tr15/#Description_Norm
[2]: 
http://www.unicode.org/Public/5.1.0/ucd/UCD.html#Character_Decomposition_Mappings
[3]: http://www.unicode.org/Public/5.1.0/ucd/UCD.html#UnicodeData.txt
[4]: 
http://www.unicode.org/Public/5.1.0/ucd/UCD.html#Canonical_Combining_Class_Values
[5]: https://www.w3.org/International/charlint/

Heikki, others, thoughts?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2016-12-19 Thread Petr Jelinek
On 19/12/16 18:25, Peter Eisentraut wrote:
> On 12/19/16 4:30 AM, Petr Jelinek wrote:
>> So existing table data can be copied once subscription is created, but
>> also new tables can be added and their data copied. This is where the
>> REFRESH PUBLICATION comes into play. Adding table to publication does
>> not make it automatically replicated by the subscription as the
>> subscription does not have tracking info for that table. So to add new
>> table user must call ALTER SUBSCRIPTION ... REFRESH PUBLICATION on
>> subscriber otherwise the data won't be replicated.
> 
> Couldn't the subscriber automatically add tracking info when apply
> stream data arrives for a relation it has not seen before?
> 

Sure, but it has many caveats:
- what if the table does not exist
- what it if exists and already has data
- what if the table is rarely written to

We can't control any of that until we have DDL replication/automatic
structure dumping. Once we have those, we can add options to control
default behavior per subscriber, but with current feature set, anything
that does not require user action will behave non-deterministically
which is usually confusing.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor correction in alter_table.sgml

2016-12-19 Thread Amit Langote
On 2016/12/20 4:08, Peter Eisentraut wrote:
> On 11/30/16 8:47 PM, Amit Langote wrote:
>>> So maybe something like
>>>
>>> All the forms of ALTER TABLE that act on a single table,
>>> except RENAME and SET SCHEMA, can be combined into a
>>> list of multiple alterations to be applied together.
>>
>> Updated patch attached.
> 
> Could you please rebase this patch?  Some new stuff about partitions has
> been added in the paragraph you are changing.

Sure, here it is.

Thanks,
Amit
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 8ea6624147..cce853be8a 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -767,13 +767,12 @@ ALTER TABLE [ IF EXISTS ] name
   
 
   
-   All the actions except RENAME,
-   SET TABLESPACE, SET SCHEMA,
-   ATTACH PARTITION, and
-   DETACH PARTITION can be combined into
-   a list of multiple alterations to apply in parallel.  For example, it
-   is possible to add several columns and/or alter the type of several
-   columns in a single command.  This is particularly useful with large
+   All the forms of ALTER TABLE that act on a single table,
+   except RENAME, SET SCHEMA,
+   ATTACH PARTITION, and DETACH PARTITION
+   can be combined into a list of multiple alterations to be applied together.
+   For example, it is possible to add several columns and/or alter the type of
+   several columns in a single command.  This is particularly useful with large
tables, since only one pass over the table need be made.
   
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2016-12-19 Thread Craig Ringer
On 20 Dec. 2016 1:27 am, "Peter Eisentraut" <
peter.eisentr...@2ndquadrant.com> wrote:

On 12/19/16 4:30 AM, Petr Jelinek wrote:
> So existing table data can be copied once subscription is created, but
> also new tables can be added and their data copied. This is where the
> REFRESH PUBLICATION comes into play. Adding table to publication does
> not make it automatically replicated by the subscription as the
> subscription does not have tracking info for that table. So to add new
> table user must call ALTER SUBSCRIPTION ... REFRESH PUBLICATION on
> subscriber otherwise the data won't be replicated.

Couldn't the subscriber automatically add tracking info when apply
stream data arrives for a relation it has not seen before?


If no table has been created by the user and we start trying to apply a
data stream apply will break.

Since manual action is needed to create the destination I don't see a
problem with requiring manual enabling too, personally.

Let the fully transparent way wait until we can do DDL replication in v11+


Re: [HACKERS] Creating a DSA area to provide work space for parallel execution

2016-12-19 Thread Thomas Munro
On Tue, Dec 20, 2016 at 11:12 AM, Robert Haas  wrote:
> On Thu, Dec 1, 2016 at 6:35 AM, Thomas Munro
>  wrote:
>> On Sat, Nov 26, 2016 at 1:55 AM, Thomas Munro
>>  wrote:
>>> Here's a new version to apply on top of dsa-v7.patch.
>>
>> Here's a version to go with dsa-v8.patch.
>
> All right, so I've committed this, but not before (1) renaming some
> things that you added, (2) adding documentation for the new LWLock
> tranche, (3) not creating the DSA in the "simulated DSM using
> backend-private memory" case, and (4) some cosmetic tweaks.

Thanks!  Hmm, so now in rare circumstances we can finish up running
ExecXXXInitializeDSM, but later have estate->es_query_dsa == NULL.
This is something to be careful about.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] too low cost of Bitmap index scan

2016-12-19 Thread Robert Haas
On Sat, Dec 17, 2016 at 3:30 AM, Pavel Stehule  wrote:
> ->  Bitmap Heap Scan on "Zasilka"  (cost=5097.39..5670.64 rows=1 width=12)
> (actual time=62.253..62.400 rows=3 loops=231)
...
> When I disable bitmap scan, then the query is 6x time faster

>->  Index Scan using "Zasilka_idx_Dopravce" on "Zasilka"
> (cost=0.00..30489.04 rows=1 width=12) (actual time=15.651..17.187 rows=3
> loops=231)
> Index Cond: ("Dopravce" = "Dopravce_Ridic_1"."ID")
>Filter: (("StavDatum" > (now() - '10 days'::interval)) AND (("Stav" =
> 34) OR ("Stav" = 29) OR ("Stav" = 180) OR ("Stav" = 213) OR ("Stav" = 46) OR
> (("Produkt" = 33) AND ("Stav" = 179)) OR ((("ZpetnaZasilka" = '-1'::integer)
> OR ("PrimaZasilka" = '-1'::integer)) AND ("Stav" = 40
> Rows Removed by Filter: 7596

I'm not sure, but my guess would be that the query planner isn't
getting a very accurate selectivity estimate for that giant filter
condition, and that's why the cost estimate is off.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache Hash Index meta page.

2016-12-19 Thread Robert Haas
On Fri, Dec 16, 2016 at 5:16 AM, Mithun Cy 
wrote:

> Shouldn't _hash_doinsert() be using the cache, too
>>
>
> Yes, we have an opportunity there, added same in code. But one difference
> is at the end at-least once we need to read the meta page to split and/or
> write. Performance improvement might not be as much as read-only.
>

Why would you need to read it at least once in one case but not the other?


>
>> I think it's probably not a good idea to cache the entire metapage.  The
>> only things that you are "really" trying to cache, I think, are
>> hashm_maxbucket, hashm_lowmask, and hashm_highmask.  The entire
>> HashPageMetaData structure is 696 bytes on my machine, and it doesn't
>> really make sense to copy the whole thing into memory if you only need 16
>> bytes of it.  It could even be dangerous -- if somebody tries to rely on
>> the cache for some other bit of data and we're not really guaranteeing that
>> it's fresh enough for that.
>>
>> I'd suggest defining a new structure HashMetaDataCache with members
>> hmc_maxbucket, hmc_lowmask, and hmc_highmask.  The structure can have a
>> comment explaining that we only care about having the data be fresh enough
>> to test whether the bucket mapping we computed for a tuple is still
>> correct, and that for that to be the case we only need to know whether a
>> bucket has suffered a new split since we last refreshed the cache.
>>
>
> It is not only hashm_maxbucket, hashm_lowmask, and hashm_highmask (3
> uint32s) but we also need
>
> *uint32  hashm_spares[HASH_MAX_SPLITPOINTS],* for bucket number to
> block mapping in "BUCKET_TO_BLKNO(metap, bucket)".
>
> Note : #define HASH_MAX_SPLITPOINTS 32, so it is (3*uint32 + 32*uint32) =
> 35*4 = 140 bytes.
>
Well, I guess that makes it more appealing to cache the whole page at least
in terms of raw number of bytes, but I suppose my real complaint here is
that there don't seem to be any clear rules for where, whether, and to what
extent we can rely on the cache to be valid.  Without that, I think this
patch is creating an extremely serious maintenance hazard for anyone who
wants to try to modify this code in the future.  A future patch author
needs to be able to tell what data they can use and what data they can't
use, and why.

Apart from this, there seems to be some base bug in _hash_doinsert().
> +* XXX this is useless code if we are only storing hash keys.
>
> +   */
>
> +if (itemsz > HashMaxItemSize((Page) metap))
>
> +ereport(ERROR,
>
> +(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
>
> + errmsg("index row size %zu exceeds hash maximum %zu",
>
> +itemsz, HashMaxItemSize((Page) metap)),
>
> +   errhint("Values larger than a buffer page cannot be
> indexed.")));
>
>  "metap" (HashMetaPage) and Page are different data structure their member
> types are not in sync, so should not typecast blindly as above. I think we
> should remove this part of the code as we only store hash keys. So I have
> removed same but kept the assert below as it is.
>
Any existing bugs should be the subject of a separate patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Creating a DSA area to provide work space for parallel execution

2016-12-19 Thread Robert Haas
On Thu, Dec 1, 2016 at 6:35 AM, Thomas Munro
 wrote:
> On Sat, Nov 26, 2016 at 1:55 AM, Thomas Munro
>  wrote:
>> Here's a new version to apply on top of dsa-v7.patch.
>
> Here's a version to go with dsa-v8.patch.

All right, so I've committed this, but not before (1) renaming some
things that you added, (2) adding documentation for the new LWLock
tranche, (3) not creating the DSA in the "simulated DSM using
backend-private memory" case, and (4) some cosmetic tweaks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for changes to recovery.conf API

2016-12-19 Thread Simon Riggs
On 19 December 2016 at 21:29, Peter Eisentraut
 wrote:
> On 12/16/16 8:52 PM, Robert Haas wrote:
>> If the explanation is just a few sentences long, I see no reason not
>> to include it in the release notes.
>
> As far as I can tell from the latest posted patch, the upgrade
> instructions are
>
> - To trigger recovery, create an empty file recovery.trigger instead of
> recovery.conf.
>
> - All parameters formerly in recovery.conf are now regular
> postgresql.conf parameters.  For backward compatibility, recovery.conf
> is read after postgresql.conf, but the parameters can now be put into
> postgresql.conf if desired.
>
> Some of that might be subject to patch review, but it's probably not
> going to be much longer than that.  So I think that will fit well into
> the usual release notes section.

Although that's right, there is more detail.

The current list is this... based upon discussion in Tokyo dev meeting

* Any use of the earlier recovery.conf or any of the old recovery
parameter names causes an ERROR, so we have a clear API break

* To trigger recovery, create an empty recovery.trigger file. Same as
recovery.conf with standby_mode = off

* To trigger standby, create an empty standby.trigger file. Same as
recovery.conf with standby_mode = on

* Trigger files can be placed in a writable directory outside of the
data directory, trigger_directory
(I would also like to rename the concept of "trigger file" to another
word/phrase not already used for something else, c.f. table triggers)

* recovery.conf parameters are all moved to postgresql.conf, with these changes
a) standby_mode no longer exists
b) trigger_file is replaced by promote_trigger_file
equivalent parameters will be added for c) recovery_trigger_file and
d) standby_trigger_file
e) recovery_target is a new parameter with the two part content
mentioned upthread, e.g. 'xid 1234'
f) recovery_target and all trigger_file parameters are now reloadable,
not server start
g) recovery parameters are shown in the postgresql.conf.sample, but
commented out
h) recovery.conf.sample is removed
which leaves many of the parameters with same name and meaning as
before, but enough change to be a clear API break that needs people to
understand the changes

* pg_basebackup -R will generate a parameter file written to data
directory that will allow it to work, even when postgresql.conf is in
a different directory
(this bit was the most awkward part of the list of changes)

* hot_standby parameter would be removed

I think I can work this into the docs without anything worth further
discussion, but let me write that first.

The release notes can warn about the old API generating an ERROR, with
a multiple links to discussion of how it now works.

Merry Christmas everybody, new patch in time for New Year, further
discussion welcome

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: add error message in backend/catalog/index.c

2016-12-19 Thread Peter Eisentraut
On 12/13/16 1:15 AM, Ioseph Kim wrote:
> I propose to append an error message when index name and table name are
> same.
> 
> 
> example:
> 
> postgres@postgres=# create table t (a int not null, constraint t primary
> key (a));
> ERROR:  relation "t" already exists

The code to detect that would probably be fairly complicated for such a
rare case.

A more principled approach might be to create an error context for the
index creation so that you can tell what is going on when the error happens.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackups and slots

2016-12-19 Thread Peter Eisentraut
On 12/17/16 9:55 AM, Magnus Hagander wrote:
> That makes a lot of sense now that we have temporary replication slots,
> I agree. And then max_replication_slots could be something like
> persistent_replication_slots?

I was thinking was more like that each walsender gets one fixed slot,
which can be temporary or persistent.

Or maybe default max_replication_slots to something like -1, which means
equal to max_wal_senders.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for changes to recovery.conf API

2016-12-19 Thread Peter Eisentraut
On 12/16/16 8:52 PM, Robert Haas wrote:
> If the explanation is just a few sentences long, I see no reason not
> to include it in the release notes.

As far as I can tell from the latest posted patch, the upgrade
instructions are

- To trigger recovery, create an empty file recovery.trigger instead of
recovery.conf.

- All parameters formerly in recovery.conf are now regular
postgresql.conf parameters.  For backward compatibility, recovery.conf
is read after postgresql.conf, but the parameters can now be put into
postgresql.conf if desired.

Some of that might be subject to patch review, but it's probably not
going to be much longer than that.  So I think that will fit well into
the usual release notes section.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Time to drop old-style (V0) functions?

2016-12-19 Thread Robert Haas
On Mon, Dec 19, 2016 at 3:13 PM, Peter Eisentraut
 wrote:
> On 12/9/16 7:52 AM, Robert Haas wrote:
>> It's kind of ironic, at least IMHO, that the DirectionFunctionCall
>> macros are anything but direct.  Each one is a non-inlined function
>> call that does a minimum of 8 variable assignments before actually
>> calling the function.
>
> If this is a problem (it might be), then we can just make those calls,
> er, direct C function calls to different function.  For example,
>
> result = DatumGetObjectId(DirectFunctionCall1(oidin,
>   CStringGetDatum(pro_name_or_oid)));
>
> could just be
>
> result = oidin_internal(pro_name_or_oid);

Yeah, true -- that works for simple cases, and might be beneficial
where you're doing lots and lots of calls in a tight loop.

For the more general problem, I wonder if we should try to figure out
something where we have one calling convention for "simple" functions
(those that little or none of the information in fcinfo and can pretty
much just take their SQL args as C args) and another for "complex"
functions (where we do the full push-ups).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Time to drop old-style (V0) functions?

2016-12-19 Thread Peter Eisentraut
On 12/9/16 7:52 AM, Robert Haas wrote:
> It's kind of ironic, at least IMHO, that the DirectionFunctionCall
> macros are anything but direct.  Each one is a non-inlined function
> call that does a minimum of 8 variable assignments before actually
> calling the function.

If this is a problem (it might be), then we can just make those calls,
er, direct C function calls to different function.  For example,

result = DatumGetObjectId(DirectFunctionCall1(oidin,
  CStringGetDatum(pro_name_or_oid)));

could just be

result = oidin_internal(pro_name_or_oid);

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor correction in alter_table.sgml

2016-12-19 Thread Peter Eisentraut
On 11/30/16 8:47 PM, Amit Langote wrote:
>> So maybe something like
>>
>>  All the forms of ALTER TABLE that act on a single table,
>>  except RENAME and SET SCHEMA, can be combined into a
>>  list of multiple alterations to be applied together.
> 
> Updated patch attached.

Could you please rebase this patch?  Some new stuff about partitions has
been added in the paragraph you are changing.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background contrib module proposal

2016-12-19 Thread David Fetter
On Thu, Nov 24, 2016 at 09:16:53AM +0530, amul sul wrote:
> Hi All,
> 
> I would like to take over pg_background patch and repost for
> discussion and review.

This looks great.

Sadly, it now breaks initdb when applied atop
dd728826c538f000220af98de025c00114ad0631 with:

performing post-bootstrap initialization ... TRAP: FailedAssertion("!(const 
Node*)(rinfo))->type) == T_RestrictInfo))", File: "costsize.c", Line: 660)
sh: line 1:  2840 Aborted (core dumped) 
"/home/shackle/pggit/postgresql/tmp_install/home/shackle/10/bin/postgres" 
--single -F -O -j -c search_path=pg_catalog -c exit_on_error=true template1 > 
/dev/null

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PSQL commands: \quit_if, \quit_unless

2016-12-19 Thread David G. Johnston
On Mon, Dec 19, 2016 at 11:23 AM, Alvaro Herrera 
wrote:

> David G. Johnston wrote:
>
> > Being able to do more conditional work in psql would make setting up more
> > robust scripts easier and without either losing transaction capabilities
> or
> > session pooling for improved performance when large numbers of small
> > commands are run in between flow control in done in bash.
>
> Have you tried to script processes in shell using a single background
> psql process with which the shell code communicates using a pipe?  I've
> long been curious about that approach, but never had a strong need
> enough to actually write the code.  It should be possible.
>

​I've envisioned and read up a bit on the approach but the cost-benefit
hasn't yet made actually doing it worthwhile.

I do pretty much myself run all of the scripts I've been writing - the
cost-benefit ratio is likely to change once they are turned over to a
non-programmer to run.

David J.​


Re: [HACKERS] PSQL commands: \quit_if, \quit_unless

2016-12-19 Thread Alvaro Herrera
David G. Johnston wrote:

> Being able to do more conditional work in psql would make setting up more
> robust scripts easier and without either losing transaction capabilities or
> session pooling for improved performance when large numbers of small
> commands are run in between flow control in done in bash.

Have you tried to script processes in shell using a single background
psql process with which the shell code communicates using a pipe?  I've
long been curious about that approach, but never had a strong need
enough to actually write the code.  It should be possible.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PSQL commands: \quit_if, \quit_unless

2016-12-19 Thread Corey Huinker
>
> >
> >   -- this is how I fake an 'exit 0' now:
> >   \set work_needs_to_be_done t
> >   select
> > case
> >   when :'work_needs_to_be_done'::boolean then ''
> >   else '\q'
> > end as cmd
> >   \gset
> >   :cmd
> >
> >   -- ridiculous example to illustrate complications in remembering past
> > commands
> >   select
> > case
> >when random() < 0.5 then '\ir my_script.sql'
> >when random() < 0.7 'select * from a_table; select count(*) from
> > another_table;'
> >else 'select null as foo;'
> > end as cmd
> >   \gset
> >   :cmd
> >
> > And even then, things get complicated, because an \ir include which
> makes it
> > this iteration might not make it the next, and the \endwhile might have
> been
> > inside that include, or vice-versa, an included file starts a \while it
> > doesn't finish.
>
> I see your point.  Just out of curiosity, why in the world don't you
> use something other than psql for scripting?  I mean, if you accessed
> the data from Perl or Python or
> $INSERT_YOUR_FAVORITE_SCRIPTING_LANGUAGE_HERE, you could do all of
> this stuff very easily without any contortions.  I've always thought
> of psql as something that's fine for interactive use and goofy trivial
> scripting but not really suitable for serious work.  I grant that you
> seem to be making it serve the purpose, but, man.
>

Since you asked:

Heh. I *don't* do the second example, I was just pointing out that those
things could be done, not that they should be done, and how hard it would
be to implement loops when the source code is potentially coming from a
stream.

My current client does use mostly python, but also perl, and ruby, and PHP
and, reactjs, and $NEW_THING.

Here are the reasons I often prefer psql:
- Wiring up a python script to do one if-then in the middle of 40 SQL
statements goes a long way toward obfuscating what SQL is going to be run.
- Packaging up the SQL statements in a DO $$ $$; block conceals what
statements were run, and how long they took.
- In python, etc, it's up to me to show rowcounts and timings.
- On very small docker-ish systems, the fewer things I have to install, the
better, and golly, python is large these days.
- When doing work for regulated industry clients (SOX,HIPAA, school
district PII, etc), the auditors like seeing clearly what SQL _will_ run,
what SQL _did_ run, and what was affected. psql scripts with echo-queries
set and captured output do that nicely. Installing extra scripting
languages gives them the vapors, and now we need an auditor that thinks
they know two languages, not one. I'm not saying it makes sense, I'm saying
fewer dependencies gets a auditor's checkbox checked sooner.


> Right, I think that while/for can be left for another time, as long as
> the plan doesn't preclude doing it someday.


+1


Re: [HACKERS] PSQL commands: \quit_if, \quit_unless

2016-12-19 Thread David G. Johnston
On Mon, Dec 19, 2016 at 10:30 AM, Robert Haas  wrote:

>
> I see your point.  Just out of curiosity, why in the world don't you
> use something other than psql for scripting?  I mean, if you accessed
> the data from Perl or Python or
> $INSERT_YOUR_FAVORITE_SCRIPTING_LANGUAGE_HERE, you could do all of
> this stuff very easily without any contortions.  I've always thought
> of psql as something that's fine for interactive use and goofy trivial
> scripting but not really suitable for serious work.  I grant that you
> seem to be making it serve the purpose, but, man.
>

I'm coming to the realization that this sentiment, when applied to my
primary application, is probably correct...

In my situation the scripting language of choice is Bash - which largely
acts as glue for programs such as psql, pdftk, enscript, and the R language.

Being able to do more conditional work in psql would make setting up more
robust scripts easier and without either losing transaction capabilities or
session pooling for improved performance when large numbers of small
commands are run in between flow control in done in bash.

David J.


Re: [HACKERS] PSQL commands: \quit_if, \quit_unless

2016-12-19 Thread Pavel Stehule
2016-12-19 18:30 GMT+01:00 Robert Haas :

> On Sat, Dec 17, 2016 at 3:39 PM, Corey Huinker 
> wrote:
> >> To implement \while, we'd also need to remember previous commands so
> >> that when we reach the end of the loop, we can rewind and put all of
> >> those commands back on the stack to be executed again, or perhaps to
> >> be skipped if the \while condition turns out now to be false.
> >
> > This might be what you meant when you said "those commands back on the
> > stack", but I think we'd have to remember not a list of commands, but the
> > raw string of bytes from the start of the \while to the \endwhile (or
> > equivalent), because any psql vars within that block could themselves be
> a
> > non-parameter part of a command:
> >
> >   -- this is how I fake an 'exit 0' now:
> >   \set work_needs_to_be_done t
> >   select
> > case
> >   when :'work_needs_to_be_done'::boolean then ''
> >   else '\q'
> > end as cmd
> >   \gset
> >   :cmd
> >
> >   -- ridiculous example to illustrate complications in remembering past
> > commands
> >   select
> > case
> >when random() < 0.5 then '\ir my_script.sql'
> >when random() < 0.7 'select * from a_table; select count(*) from
> > another_table;'
> >else 'select null as foo;'
> > end as cmd
> >   \gset
> >   :cmd
> >
> > And even then, things get complicated, because an \ir include which
> makes it
> > this iteration might not make it the next, and the \endwhile might have
> been
> > inside that include, or vice-versa, an included file starts a \while it
> > doesn't finish.
>
> I see your point.  Just out of curiosity, why in the world don't you
> use something other than psql for scripting?  I mean, if you accessed
> the data from Perl or Python or
> $INSERT_YOUR_FAVORITE_SCRIPTING_LANGUAGE_HERE, you could do all of
> this stuff very easily without any contortions.  I've always thought
> of psql as something that's fine for interactive use and goofy trivial
> scripting but not really suitable for serious work.  I grant that you
> seem to be making it serve the purpose, but, man.
>

The integration of any scripting environment with SQL is much more less
than in psql - just some easy scenarios are in psql natural.

It is similar why some people like me, prefer PLpgSQL against Java, Perl,
...

years ago there was a bash integrated with postgres - for me nice idea, but
this project is dead.


>
> > So maybe what we store is a stack of buffers that are currently open
> (STDIN
> > being captured as a buffer only when a \while starts, everything else
> being
> > files), and additionally have a stack of positions where a \while started
> > (buffer_id, position in buffer).
>
> Yeah, sounds about right.
>
> > Additionally, we could assert that all \while-\endwhile pairs must
> happen in
> > the same MainLoop (aka file), and mismatches are an error.
>
> Sounds prudent.
>
> > I'm happy to keep sketching out what control structures might look like
> and
> > how to implement them. But I'm also happy to leave while/for loops out
> for
> > now.
>
> Right, I think that while/for can be left for another time, as long as
> the plan doesn't preclude doing it someday.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: [HACKERS] Hash Indexes

2016-12-19 Thread Robert Haas
On Sun, Dec 18, 2016 at 8:54 AM, Amit Kapila  wrote:
>> I committed remove-hash-wrtbuf and fix_dirty_marking_v1 but I've got
>> some reservations about fix_lock_chaining_v1.  ISTM that the natural
>> fix here would be to change the API contract for _hash_freeovflpage so
>> that it doesn't release the lock on the write buffer.  Why does it
>> even do that?  I think that the only reason why _hash_freeovflpage
>> should be getting wbuf as an argument is so that it can handle the
>> case where wbuf happens to be the previous block correctly.
>
> Yeah, as of now that is the only case, but for WAL patch, I think we
> need to ensure that the action of moving all the tuples to the page
> being written and the overflow page being freed needs to be logged
> together as an atomic operation.

Not really.  We can have one operation that empties the overflow page
and another that unlinks it and makes it free.

> Now apart from that, it is
> theoretically possible that write page will remain locked for multiple
> overflow pages being freed (when the page being written has enough
> space that it can accommodate tuples from multiple overflow pages).  I
> am not sure if it is worth worrying about such a case because
> practically it might happen rarely.  So, I have prepared a patch to
> retain a lock on wbuf in _hash_freeovflpage() as suggested by you.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PSQL commands: \quit_if, \quit_unless

2016-12-19 Thread Robert Haas
On Sat, Dec 17, 2016 at 3:39 PM, Corey Huinker  wrote:
>> To implement \while, we'd also need to remember previous commands so
>> that when we reach the end of the loop, we can rewind and put all of
>> those commands back on the stack to be executed again, or perhaps to
>> be skipped if the \while condition turns out now to be false.
>
> This might be what you meant when you said "those commands back on the
> stack", but I think we'd have to remember not a list of commands, but the
> raw string of bytes from the start of the \while to the \endwhile (or
> equivalent), because any psql vars within that block could themselves be a
> non-parameter part of a command:
>
>   -- this is how I fake an 'exit 0' now:
>   \set work_needs_to_be_done t
>   select
> case
>   when :'work_needs_to_be_done'::boolean then ''
>   else '\q'
> end as cmd
>   \gset
>   :cmd
>
>   -- ridiculous example to illustrate complications in remembering past
> commands
>   select
> case
>when random() < 0.5 then '\ir my_script.sql'
>when random() < 0.7 'select * from a_table; select count(*) from
> another_table;'
>else 'select null as foo;'
> end as cmd
>   \gset
>   :cmd
>
> And even then, things get complicated, because an \ir include which makes it
> this iteration might not make it the next, and the \endwhile might have been
> inside that include, or vice-versa, an included file starts a \while it
> doesn't finish.

I see your point.  Just out of curiosity, why in the world don't you
use something other than psql for scripting?  I mean, if you accessed
the data from Perl or Python or
$INSERT_YOUR_FAVORITE_SCRIPTING_LANGUAGE_HERE, you could do all of
this stuff very easily without any contortions.  I've always thought
of psql as something that's fine for interactive use and goofy trivial
scripting but not really suitable for serious work.  I grant that you
seem to be making it serve the purpose, but, man.

> So maybe what we store is a stack of buffers that are currently open (STDIN
> being captured as a buffer only when a \while starts, everything else being
> files), and additionally have a stack of positions where a \while started
> (buffer_id, position in buffer).

Yeah, sounds about right.

> Additionally, we could assert that all \while-\endwhile pairs must happen in
> the same MainLoop (aka file), and mismatches are an error.

Sounds prudent.

> I'm happy to keep sketching out what control structures might look like and
> how to implement them. But I'm also happy to leave while/for loops out for
> now.

Right, I think that while/for can be left for another time, as long as
the plan doesn't preclude doing it someday.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2016-12-19 Thread Peter Eisentraut
On 12/19/16 4:30 AM, Petr Jelinek wrote:
> So existing table data can be copied once subscription is created, but
> also new tables can be added and their data copied. This is where the
> REFRESH PUBLICATION comes into play. Adding table to publication does
> not make it automatically replicated by the subscription as the
> subscription does not have tracking info for that table. So to add new
> table user must call ALTER SUBSCRIPTION ... REFRESH PUBLICATION on
> subscriber otherwise the data won't be replicated.

Couldn't the subscriber automatically add tracking info when apply
stream data arrives for a relation it has not seen before?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Creating a DSA area to provide work space for parallel execution

2016-12-19 Thread Robert Haas
On Sun, Dec 18, 2016 at 10:33 PM, Thomas Munro
 wrote:
> On Sat, Dec 17, 2016 at 5:41 AM, Robert Haas  wrote:
>> On Wed, Dec 14, 2016 at 3:25 PM, Robert Haas  wrote:
>>> Thoughts?
>>
>> Hearing no objections, I've gone ahead and committed this.  If that
>> makes somebody really unhappy I can revert it, but I am betting that
>> the real story is that nobody cares about preserving T_ID().
>
> I suppose LWLock could include a uint16 member 'id' without making
> LWLock any larger, since it would replace the padding between
> 'tranche' and 'state'.  But I think a better solution, if anyone
> really wants these T_ID numbers from a dtrace script, would be to add
> lock address to the existing lwlock probes, and then figure out a way
> to add some new probes to report the base and stride in the right
> places so you can do the book keeping in dtrace variables.

Interesting idea.  My bet is that nobody cares about dtrace very much.
probes.d was first added in 2006, and continued to gradually get new
probes (all from submissions by Robert Lor) until 2009.  Since then,
nothing has happened except for perfunctory maintenance by various
committers trying to solve other problems who had to maintain the work
that had already been done whether they cared about it or not.  (I
notice that the probes lwlock-acquire-or-wait and
lwlock-acquire-or-wait-fail have never been documented.)  I would be
tempted to rip the whole thing out as an attractive nuisance, except
that I bet somebody would complain about that...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning vs. sql_inheritance

2016-12-19 Thread Robert Haas
On Mon, Dec 19, 2016 at 11:48 AM, Alvaro Herrera
 wrote:
> Any particular reason not to change inhOpt to be a simple boolean, and
> remove the enum?

No, no particular reason.  I thought about it, but I didn't really see
any advantage in getting rid of the typedef.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning vs. sql_inheritance

2016-12-19 Thread Alvaro Herrera
Robert Haas wrote:
> On Fri, Dec 16, 2016 at 7:39 PM, Tom Lane  wrote:
> > Peter Eisentraut  writes:
> >> On 12/16/16 11:05 AM, Robert Haas wrote:
> >>> If we were going to do anything about this,
> >>> my vote would be to remove sql_inheritance.
> >
> >> Go for it.
> >
> >> Let's also remove the table* syntax then.
> >
> > Meh --- that might break existing queries, to what purpose?
> >
> > We certainly shouldn't remove query syntax without a deprecation period.
> > I'm less concerned about that for GUCs.
> 
> I agree.  Patch attached, just removing the GUC and a fairly minimal
> amount of the supporting infrastructure.

Any particular reason not to change inhOpt to be a simple boolean, and
remove the enum?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-12-19 Thread Alvaro Herrera
Fujii Masao wrote:

> Regarding this feature, there are some loose ends. We should work on
> and complete them until the release.

Please list these in https://wiki.postgresql.org/wiki/Open_Items so that we
don't forget.


-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] invalid combination of options "-D - -F t -X stream" in pg_basebackup

2016-12-19 Thread Magnus Hagander
On Mon, Dec 19, 2016 at 5:39 PM, Fujii Masao  wrote:

> Hi,
>
> Isn't it better to forbid the conbination of the options "-D -", "-F t" and
> "-X stream" in pg_basebackup? This is obviously invalid setting and the
> docs
> warns this as follows. But currently users can specify such setting and
> pg_basebackup can exit unexpectedly with an error.
>
> ---
> If the value - (dash) is specified as target directory, the tar contents
> will
> be written to standard output, suitable for piping to for example gzip.
> This is only possible if the cluster has no additional tablespaces.
> ---
>

Yes, definitely. I'd say that's an oversight in implementing the support
for stream-to-tar that it did not detect this issue.

Do you want to provide a patch for it, or should I?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


[HACKERS] invalid combination of options "-D - -F t -X stream" in pg_basebackup

2016-12-19 Thread Fujii Masao
Hi,

Isn't it better to forbid the conbination of the options "-D -", "-F t" and
"-X stream" in pg_basebackup? This is obviously invalid setting and the docs
warns this as follows. But currently users can specify such setting and
pg_basebackup can exit unexpectedly with an error.

---
If the value - (dash) is specified as target directory, the tar contents will
be written to standard output, suitable for piping to for example gzip.
This is only possible if the cluster has no additional tablespaces.
---

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background contrib module proposal

2016-12-19 Thread Andrew Borodin
2016-12-19 4:21 GMT+05:00 David Fetter :
> Couldn't it sleep in increments smaller than a second?  Like maybe
> milliseconds?  Also, it's probably cleaner (or at least more
> comprehensible) to write something using format() and dollar quoting
> than the line with the double 's.

Right. Here's version which waits for half a second. I do not see how
to path doubles via dollar sign, pg_background_launch() gets a string
as a parameter, it's not EXCEUTE USING.

Best regards, Andrey Borodin.
diff --git a/contrib/pg_background/Makefile b/contrib/pg_background/Makefile
index c4e717d..085fbff 100644
--- a/contrib/pg_background/Makefile
+++ b/contrib/pg_background/Makefile
@@ -6,6 +6,8 @@ OBJS = pg_background.o
 EXTENSION = pg_background
 DATA = pg_background--1.0.sql
 
+REGRESS = pg_background
+
 ifdef USE_PGXS
 PG_CONFIG = pg_config
 PGXS := $(shell $(PG_CONFIG) --pgxs)
diff --git a/contrib/pg_background/expected/pg_background.out 
b/contrib/pg_background/expected/pg_background.out
new file mode 100644
index 000..a6613ce
--- /dev/null
+++ b/contrib/pg_background/expected/pg_background.out
@@ -0,0 +1,31 @@
+CREATE EXTENSION pg_background;
+--run 6 workers which wait .0, .1, .2, .3, .4, .5 seconds respectively
+CREATE TABLE input as SELECT x FROM generate_series(0,.5,0.1) x ORDER BY x 
DESC;
+CREATE TABLE output(place int,value float);
+--sequence for indication of who's finished on which place
+CREATE sequence s start 1;
+CREATE TABLE handles as SELECT pg_background_launch(format('select 
pg_sleep(%s); insert into output values (nextval(''s''),%s);',x,x)) h FROM 
input;
+--wait until everyone finishes
+SELECT (SELECT * FROM pg_background_result(h) as (x text) limit 1) FROM 
handles;
+x 
+--
+ SELECT 1
+ SELECT 1
+ SELECT 1
+ SELECT 1
+ SELECT 1
+ SELECT 1
+(6 rows)
+
+--output results
+SELECT * FROM output ORDER BY place;
+ place | value 
+---+---
+ 1 | 0
+ 2 |   0.1
+ 3 |   0.2
+ 4 |   0.3
+ 5 |   0.4
+ 6 |   0.5
+(6 rows)
+
diff --git a/contrib/pg_background/sql/pg_background.sql 
b/contrib/pg_background/sql/pg_background.sql
new file mode 100644
index 000..770f0b8
--- /dev/null
+++ b/contrib/pg_background/sql/pg_background.sql
@@ -0,0 +1,12 @@
+CREATE EXTENSION pg_background;
+
+--run 6 workers which wait .0, .1, .2, .3, .4, .5 seconds respectively
+CREATE TABLE input as SELECT x FROM generate_series(0,.5,0.1) x ORDER BY x 
DESC;
+CREATE TABLE output(place int,value float);
+--sequence for indication of who's finished on which place
+CREATE sequence s start 1;
+CREATE TABLE handles as SELECT pg_background_launch(format('select 
pg_sleep(%s); insert into output values (nextval(''s''),%s);',x,x)) h FROM 
input;
+--wait until everyone finishes
+SELECT (SELECT * FROM pg_background_result(h) as (x text) limit 1) FROM 
handles;
+--output results
+SELECT * FROM output ORDER BY place;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Retire src/backend/port/dynloader/linux.c ?

2016-12-19 Thread Andres Freund
On 2016-12-18 22:19:36 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > Shouldn't we just remove that code?
> 
> What for?

I every now and then end up looking at it for a few minutes, and wonder
what the hell dld is, just to see that it's old stuff.

> It's maintenance-free ... hasn't been touched since 2004.
> While I agree with you that it's *probably* dead code, it's hard to
> see much upside from removing it.
> 
> If we want to get into arguing whether code is dead or not, there's
> an awful lot of potentially removable stuff in the tree, but I doubt
> it's worth the trouble to figure out what's really dead.

Well, it's sometimes annoying to hit the code while looking around
that's actually dead. But I don't feel that strongly.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw bug in 9.6

2016-12-19 Thread Tom Lane
Etsuro Fujita  writes:
> On 2016/12/17 1:13, Tom Lane wrote:
>> So I think the rule could be

>> "When first asked to produce a path for a given foreign joinrel, collect
>> the cheapest paths for its left and right inputs, and make a nestloop path
>> (or hashjoin path, if full join) from those, using the join quals needed
>> for the current input relation pair.

> Seems reasonable.

>> Use this as the fdw_outerpath for
>> all foreign paths made for the joinrel."

> I'm not sure that would work well for foreign joins with sort orders. 
> Consider a merge join, whose left input is a 2-way foreign join with a 
> sort order that implements a full join and whose right input is a sorted 
> local table scan.  If the EPQ subplan for the foreign join wouldn't 
> produce the right sort order, the merge join might break during EPQ 
> rechecks (note that in this case the EPQ subplan for the foreign join 
> might produce more than a single row during an EPQ recheck).

How so?  We only recheck one row at a time, therefore it can be claimed to
have any sort order you care about.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical Replication WIP

2016-12-19 Thread Steve Singer

On 12/18/2016 09:04 PM, Petr Jelinek wrote:

On 18/12/16 19:02, Steve Singer wrote:


pg_dump is also generating warnings

pg_dump: [archiver] WARNING: don't know how to set owner for object type
SUBSCRIPTION

I know that the plan is to add proper ACL's for publications and
subscriptions later. I don't know if we want to leave the warning in
until then or do something about it.


No, ACLs are separate from owner. This is thinko on my side. I was
thinking we can live without ALTER ... OWNER TO for now, but we actually
need it for pg_dump and for REASSIGN OWNED. So now I added the OWNER TO
for both PUBLICATION and SUBSCRIPTION.



When I try to restore my pg_dump with publications I get

./pg_dump  -h localhost --port 5440 test |./psql -h localhost --port 
5440 test2



ALTER TABLE
CREATE PUBLICATION
ERROR:  unexpected command tag "PUBLICATION

This comes from a
ALTER PUBLICATION mypub OWNER TO ssinger;


Does the OWNER TO  clause need to be added to AlterPublicationStmt: 
instead of AlterOwnerStmt ?
Also we should update the tab-complete for ALTER PUBLICATION to show the 
OWNER to options  + the \h help in psql and the reference SGML






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-12-19 Thread Fujii Masao
On Sun, Dec 18, 2016 at 9:36 PM, Michael Paquier
 wrote:
> On Fri, Dec 16, 2016 at 10:42 PM, Fujii Masao  wrote:
>> Attached is the modified version of the patch. Barring objections, I will
>> commit this version.
>
> There is a whitespace:
> $ git diff master --check
> src/backend/replication/syncrep.c:39: trailing whitespace.
> + *

Okey, pushed the patch with this fix. Thanks!

Regarding this feature, there are some loose ends. We should work on
and complete them until the release.

(1)
Which synchronous replication method, priority or quorum, should be
chosen when neither FIRST nor ANY is specified in s_s_names? Right now,
a priority-based sync replication is chosen for keeping backward
compatibility. However some hackers argued to change this decision
so that a quorum commit is chosen because they think that most users
prefer to a quorum.

(2)
There will be still many source comments and documentations that
we need to update, for example, in high-availability.sgml. We need to
check and update them throughly.

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.


Any other?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw bug in 9.6

2016-12-19 Thread Etsuro Fujita

On 2016/12/19 13:59, Ashutosh Bapat wrote:

On Fri, Dec 16, 2016 at 9:43 PM, Tom Lane  wrote:

Etsuro Fujita  writes:

On 2016/12/16 11:25, Etsuro Fujita wrote:

As I said upthread, an alternative I am thinking is (1) to create an
equivalent nestloop join path using inner/outer paths of a foreign join
path, except when that join path implements a full join, in which case a
merge/hash join path is used, (2) store it in fdw_outerpath as-is, and
(3) during an EPQ recheck, apply postgresRecheckForeignScan to the outer
subplan created from the fdw_outerpath as-is.  What do you think about
that?



Let me explain about #1 and #2 a bit more.  What I have in mind is:



* modify postgresGetForeignPaths so that a simple foreign table scan
path is stored into the base relation's PgFdwRelationInfo.
* modify postgresGetForeignJoinPaths so that
 (a) a local join path for a 2-way foreign join is created using
simple foreign table scan paths stored in the base relations'
PgFdwRelationInfos, and stored into the join relation's PgFdwRelationInfo.
 (b) a local join path for a 3-way foreign join, whose left input is
a 2-way foreign join, is created using a local join path stored in the
left input join relation's PgFdwRelationInfo and a simple foreign table
scan path stored into the right input base relation's PgFdwRelationInfo.
 (c) Likewise for higher level foreign joins.
 (d) local join paths created are passed to create_foreignscan_path
and stored into the fdw_outerpaths of the resulting ForeignPahts.



Hm, isn't this overcomplicated?  As you said earlier, we don't really
care all that much whether the fdw_outerpath is free of lower foreign
joins, because EPQ setup will select those lower join's substitute EPQ
plans anyway.  All that we need is that the EPQ tree be a legal
implementation of the join order with join quals applied at the right
places.  So I think the rule could be

"When first asked to produce a path for a given foreign joinrel, collect
the cheapest paths for its left and right inputs, and make a nestloop path
(or hashjoin path, if full join) from those, using the join quals needed
for the current input relation pair.  Use this as the fdw_outerpath for
all foreign paths made for the joinrel."



We could use fdw_outerpath of the left and right inputs instead of
looking for the cheapest paths. For base relations as left or right
relations, use the unparameterized cheapest foreign path. This will
ensure that all joins in fdw_outerpath are local joins, making EPQ
checks slightly efficient by avoiding redirection at every foreign
path.


That seems close to what I had in mind described above.  One additional 
work required for that would to store the fdw_outerpath into the 
RelOptInfo's fdw_private such as PgFdwRelationInfo before add_path for 
the foreign-join path containing the fdw_outerpath, because add_path 
might reject the foreign-join path.  I think the additional work would 
make that rather complicated.


Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Protect syscache from bloating with negative cache entries

2016-12-19 Thread Kyotaro HORIGUCHI
Hello, recently one of my customer stumbled over an immoderate
catcache bloat.

This is a known issue living on the Todo page in the PostgreSQL
wiki.

https://wiki.postgresql.org/wiki/Todo#Cache_Usage
> Fix memory leak caused by negative catcache entries

https://www.postgresql.org/message-id/51c0a1ff.2050...@vmware.com

This patch addresses the two cases of syscache bloat by using
invalidation callback mechanism.


Overview of the patch

The bloat is caused by negative cache entries in catcaches. They
are crucial for performance but it is a problem that there's no
way to remove them. They last for the backends' lifetime.

The first patch provides a means to flush catcache negative
entries, then defines a relcache invalidation callback to flush
negative entries in syscaches for pg_statistic(STATRELATTINH) and
pg_attributes(ATTNAME, ATTNUM).  The second patch implements a
syscache invalidation callback so that deletion of a schema
causes a flush for pg_class (RELNAMENSP).

Both of the aboves are not hard-coded and defined in cacheinfo
using additional four members.


Remaining problems

Still, catcache can bloat by repeatedly accessing non-existent
table with unique names in a permanently-living schema but it
seems a bit too artificial (or malicious). Since such negative
entries don't have a trigger to remove, caps are needed to
prevent them from bloating syscaches, but the limits are hardly
seem reasonably determinable.


Defects or disadvantages

This patch scans over whole the target catcache to find negative
entries to remove and it might take a (comparably) long time on a
catcache with so many entries. By the second patch, unrelated
negative caches may be involved in flushing since they are keyd
by hashvalue, not by the exact key values.



The attached files are the following.

1. 0001-Cleanup-negative-cache-of-pg_statistic-when-dropping.patch
   Negative entry flushing by relcache invalidation using
   relcache invalidation callback.

2. 0002-Cleanup-negative-cache-of-pg_class-when-dropping-a-s.patch
   Negative entry flushing by catcache invalidation using
   catcache invalidation callback.

3. gen.pl
   a test script for STATRELATTINH bloating.

4. gen2.pl
   a test script for RELNAMENSP bloating.

3 and 4 are used as the following,

./gen.pl | psql postgres > /dev/null



regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 203735e13d2b8584e1ddc652b602465a4f839355 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 15 Dec 2016 17:43:03 +0900
Subject: [PATCH 1/2] Cleanup negative cache of pg_statistic when dropping a
 relation.

Accessing columns that don't have statistics causes leaves negative
entries in catcache for pg_statstic, but there's no chance to remove
them. Especially when repeatedly creating then dropping temporary
tables bloats catcache so much that memory pressure can be
significant. This patch removes negative entries in STATRELATTINH,
ATTNAME and ATTNUM when corresponding relation is dropped.
---
 src/backend/utils/cache/catcache.c |  57 +++-
 src/backend/utils/cache/syscache.c | 274 +++--
 src/include/utils/catcache.h   |   3 +
 src/include/utils/syscache.h   |   2 +
 4 files changed, 263 insertions(+), 73 deletions(-)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 6016d19..c1d9d2f 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -304,10 +304,11 @@ CatCachePrintStats(int code, Datum arg)
 
 		if (cache->cc_ntup == 0 && cache->cc_searches == 0)
 			continue;			/* don't print unused caches */
-		elog(DEBUG2, "catcache %s/%u: %d tup, %ld srch, %ld+%ld=%ld hits, %ld+%ld=%ld loads, %ld invals, %ld lsrch, %ld lhits",
+		elog(DEBUG2, "catcache %s/%u: %d tup, %d negtup, %ld srch, %ld+%ld=%ld hits, %ld+%ld=%ld loads, %ld invals, %ld lsrch, %ld lhits",
 			 cache->cc_relname,
 			 cache->cc_indexoid,
 			 cache->cc_ntup,
+			 cache->cc_nnegtup,
 			 cache->cc_searches,
 			 cache->cc_hits,
 			 cache->cc_neg_hits,
@@ -374,6 +375,10 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
 	/* free associated tuple data */
 	if (ct->tuple.t_data != NULL)
 		pfree(ct->tuple.t_data);
+
+	if (ct->negative)
+		--cache->cc_nnegtup;
+
 	pfree(ct);
 
 	--cache->cc_ntup;
@@ -637,6 +642,49 @@ ResetCatalogCache(CatCache *cache)
 }
 
 /*
+ *		CleanupCatCacheNegEntries
+ *
+ *	Remove negative cache tuples maching a partial key.
+ *
+ */
+void
+CleanupCatCacheNegEntries(CatCache *cache, ScanKeyData *skey)
+{
+	int i;
+
+	/* If this cache has no negative entries, nothing to do */
+	if (cache->cc_nnegtup == 0)
+		return;
+
+	/* searching with a paritial key needs scanning the whole cache */
+	for (i = 0; i < cache->cc_nbuckets; i++)
+	{
+		dlist_head *bucket = >cc_bucket[i];
+		dlist_mutable_iter iter;
+
+		dlist_foreach_modify(iter, bucket)
+		{
+			CatCTup*ct = dlist_container(CatCTup, cache_elem, iter.cur);
+			bool	

Re: [HACKERS] Measuring replay lag

2016-12-19 Thread Thomas Munro
On Mon, Dec 19, 2016 at 4:03 PM, Peter Eisentraut
 wrote:
> On 11/22/16 4:27 AM, Thomas Munro wrote:
>> Thanks very much for testing!  New version attached.  I will add this
>> to the next CF.
>
> I don't see it there yet.

Thanks for the reminder.  Added here:  https://commitfest.postgresql.org/12/920/

Here's a rebased patch.

-- 
Thomas Munro
http://www.enterprisedb.com


replay-lag-v14.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Make pg_basebackup -x stream the default

2016-12-19 Thread Vladimir Rusinov
On Sat, Dec 17, 2016 at 2:37 PM, Magnus Hagander 
wrote:

> Attached is an updated patch that does this. As a bonus it simplifies the
> code a bit. I also fixed an error message that I missed updating in the
> previous patch.


looks good to me.

Still applies cleanly at head with all tests pass.
I have no further comments, although since I'm not experienced Postgres
reviewer, commiter may want to take another look before merging.

-- 
Vladimir Rusinov
Storage SRE, Google Ireland

Google Ireland Ltd.,Gordon House, Barrow Street, Dublin 4, Ireland
Registered in Dublin, Ireland
Registration Number: 368047


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Measuring replay lag

2016-12-19 Thread Simon Riggs
On 26 October 2016 at 11:34, Thomas Munro  wrote:

> It works by taking advantage of the { time, end-of-WAL } samples that
> sending servers already include in message headers to standbys.  That
> seems to provide a pretty good proxy for when the WAL was written, if
> you ignore messages where the LSN hasn't advanced.  The patch
> introduces a new GUC replay_lag_sample_interval, defaulting to 1s, to
> control how often the standby should record these timestamped LSNs
> into a small circular buffer.  When its recovery process eventually
> replays a timestamped LSN, the timestamp is sent back to the upstream
> server in a new reply message field.  The value visible in
> pg_stat_replication.replay_lag can then be updated by comparing with
> the current time.

Why not just send back the lag as calculated by max_standby_streaming_delay?
I.e. at the end of replay of each chunk record the current delay in
shmem, then send it back periodically.

If we have two methods of calculation it would be confusing.

Admittedly the approach here is the same one I advocated a some years
back when Robert and I were discussing time delayed standbys.

> Compared to the usual techniques people use to estimate replay lag,
> this approach has the following advantages:
>
> 1.  The lag is measured in time, not LSN difference.
> 2.  The lag time is computed using two observations of a single
> server's clock, so there is no clock skew.
> 3.  The lag is updated even between commits (during large data loads etc).

Yes, good reasons.

> In the previous version I was effectively showing the ping time
> between the servers during idle times when the standby was fully
> caught up because there was nothing happening.  I decided that was not
> useful information and that it's more newsworthy and interesting to
> see the estimated replay lag for the most recent real replayed
> activity, so I changed that.
>
> In the last thread[1], Robert Haas wrote:
>> Well, one problem with this is that you can't put a loop inside of a
>> spinlock-protected critical section.
>
> Fixed.
>
>> In general, I think this is a pretty reasonable way of attacking this
>> problem, but I'd say it's significantly under-commented.  Where should
>> someone go to get a general overview of this mechanism?  The answer is
>> not "at place XXX within the patch".  (I think it might merit some
>> more extensive documentation, too, although I'm not exactly sure what
>> that should look like.)
>
> I have added lots of comments.
>
>> When you overflow the buffer, you could thin in out in a smarter way,
>> like by throwing away every other entry instead of the oldest one.  I
>> guess you'd need to be careful how you coded that, though, because
>> replaying an entry with a timestamp invalidates some of the saved
>> entries without formally throwing them out.
>
> Done, by overwriting the newest sample rather than the oldest if the
> buffer is full.  That seems to give pretty reasonable degradation,
> effectively lowering the sampling rate, without any complicated buffer
> or rate management code.
>
>> Conceivably, 0002 could be split into two patches, one of which
>> computes "stupid replay lag" considering only records that naturally
>> carry timestamps, and a second adding the circular buffer to handle
>> the case where much time passes without finding such a record.
>
> I contemplated this but decided that it'd be best to use ONLY samples
> from walsender headers, and never use the time stamps from commit
> records for this.  If we use times from commit records, then a
> cascading sending server will not be able to compute the difference in
> time without introducing clock skew (not to mention the difficulty of
> combining timestamps from two sources if we try to do both).  I
> figured that it's better to have value that shows a cascading
> sender->standby->cascading sender round trip time that is free of
> clock skew, than a master->cascading sender->standby->cascading sender
> incomplete round trip that includes clock skew.
>
> By the same reasoning I decided against introducing a new periodic WAL
> record or field from the master to hold extra time stamps in between
> commits to do this, in favour of the buffered transient timestamp
> approach I took in this patch.  That said, I can see there are
> arguments for doing it with extra periodic WAL timestamps, if people
> don't think it'd be too invasive to mess with the WAL for this, and
> don't care about cascading standbys giving skewed readings.  One
> advantage would be that persistent WAL timestamps would still be able
> to provide lag estimates if a standby has been down for a while and
> was catching up, and this approach can't until it's caught up due to
> lack of buffered transient timestamps.  Thoughts?

-1 for adding anything to the WAL for this.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via 

Re: [HACKERS] increasing the default WAL segment size

2016-12-19 Thread Beena Emerson
Hello all,

Please find attached a patch to make wal segment size initdb configurable.

The attached patch removes --with-wal-segsize configure option and adds a
new initdb option --wal-segsize. The module initdb passes the wal-segsize
value into an environment variable which is used to overwrite the guc value
wal_ segment_size and set the internal variables : XLogSegSize and
XLOG_SEG_SIZE (xlog_internal.h). The default wal_segment_size is not
changed but I have increased the maximum size to 1GB.

Since  XLOG_SEG_SIZE is now variable, it could not be used directly in
src/bin modules and few macros and few changes had to be made:
  -  in guc.c , remove GUC_UNIT_XSEGS which used XLOG_SEG_SIZE and
introduce show functions for the guc which used the unit (min_wal_size and
max_wal_size).
  -  For pg_basebackup, add new replication command SHOW_WAL_SEGSZ to fetch
the wal_segment_size in bytes.
  - pg_controldata, pg_resetxlog, pg_rewind, fetch the xlog_seg_size from
the ControlFile.
  - Since pg_xlogdump reads the wal files, it uses the file size to
determine the xlog_seg_size.
  - In pg_test_fsync, a buffer of size XLOG_SEG_SIZE was created, filled
with random data and written to a temporary file to check for any
write/fsync error before performing the tests. Since it does not affect the
actual performance results, the XLOG_SEG_SIZE in the module is replaced
with the default value (16MB).

Please note that the documents are not updated in this patch.

Feedback and suggestions are welcome.
--

Beena Emerson

Have a Great Day!


initdb-walsegsize_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slow I/O can break throttling of base backup

2016-12-19 Thread Magnus Hagander
On Fri, Dec 16, 2016 at 11:24 AM, Antonin Houska  wrote:

> Antonin Houska  wrote:
>
> > It seems to be my bug. I'll check tomorrow.
>
> I could reproduce the problem by adding sufficient sleep time to the
> loop.
>
> > Magnus Hagander  wrote:
> >> I wonder if the else if (sleep > 0) at the bottom of throttle() should
> just
> >> be a simple else and always run, resetting last_throttle?
>
> I agree. In fact, I could simplify the code even more.
>
> Since (elapsed + sleep) almost equals to GetCurrentIntegerTimestamp(), we
> can
> use the following statement unconditionally (I think I tried too hard to
> avoid
> calling GetCurrentIntegerTimestamp too often in the original patch):
>
> throttled_last = GetCurrentIntegerTimestamp();
>
> Thus we can also get rid of the "else" branch that clears both "sleep" and
> "wait_result".
>
> (The patch contains minor comment refinement that I found useful when
> seeing
> the code after years.)
>
>
Thanks! Applied and backpatched.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] Logical Replication WIP

2016-12-19 Thread Petr Jelinek
On 19/12/16 08:04, Erik Rijkers wrote:
> On 2016-12-18 11:12, Petr Jelinek wrote:
> 
> (now using latest: patchset:)
> 
> 0001-Add-PUBLICATION-catalogs-and-DDL-v14.patch
> 0002-Add-SUBSCRIPTION-catalog-and-DDL-v14.patch
> 0003-Define-logical-replication-protocol-and-output-plugi-v14.patch
> 0004-Add-logical-replication-workers-v14.patch
> 0005-Add-separate-synchronous-commit-control-for-logical--v14.patch
> 
>> BTW you don't need to add primary key to pgbench_history. Simply ALTER
>> TABLE pgbench_history REPLICA IDENTITY FULL; should be enough.
> 
> Either should, but neither is.
> 
> set-up:
> Before creating the publication/subscription:
> On master I run   pgbench -qis 1,  then set replica identity (and/or add
> serial column) for pgbench_history, then dump/restore the 4 pgbench
> tables from master to replica.
> Then enabling publication/subscription.  logs looks well.  (Other tests 
> I've devised earlier (on other tables) still work nicely.)
> 
> Now when I do a pgbench-run on master, something like:
> 
>pgbench -c 1 -T 20 -P 1
> 
> I often see this (when running pgbench):
> 
> ERROR:  publisher does not send replica identity column expected by the
> logical replication target public.pgbench_tellers
> or, sometimes  (less often) the same ERROR for pgbench_accounts appears
> (as in the subsciber-log below)
> 
> -- publisher log
> 2016-12-19 07:44:22.738 CET 22690 LOG:  logical decoding found
> consistent point at 0/14598C78
> 2016-12-19 07:44:22.738 CET 22690 DETAIL:  There are no running
> transactions.
> 2016-12-19 07:44:22.738 CET 22690 LOG:  exported logical decoding
> snapshot: "000130FA-1" with 0 transaction IDs
> 2016-12-19 07:44:22.886 CET 22729 LOG:  starting logical decoding for
> slot "sub1"
> 2016-12-19 07:44:22.886 CET 22729 DETAIL:  streaming transactions
> committing after 0/14598CB0, reading WAL from 0/14598C78
> 2016-12-19 07:44:22.886 CET 22729 LOG:  logical decoding found
> consistent point at 0/14598C78
> 2016-12-19 07:44:22.886 CET 22729 DETAIL:  There are no running
> transactions.
> 2016-12-19 07:45:25.568 CET 22729 LOG:  could not receive data from
> client: Connection reset by peer
> 2016-12-19 07:45:25.568 CET 22729 LOG:  unexpected EOF on standby
> connection
> 2016-12-19 07:45:25.580 CET 26696 LOG:  starting logical decoding for
> slot "sub1"
> 2016-12-19 07:45:25.580 CET 26696 DETAIL:  streaming transactions
> committing after 0/1468E0D0, reading WAL from 0/1468DC90
> 2016-12-19 07:45:25.589 CET 26696 LOG:  logical decoding found
> consistent point at 0/1468DC90
> 2016-12-19 07:45:25.589 CET 26696 DETAIL:  There are no running
> transactions.
> 
> -- subscriber log
> 2016-12-19 07:44:22.878 CET 17027 LOG:  starting logical replication
> worker for subscription 24581
> 2016-12-19 07:44:22.883 CET 22726 LOG:  logical replication apply for
> subscription sub1 started
> 2016-12-19 07:45:11.069 CET 22726 WARNING:  leaked hash_seq_search scan
> for hash table 0x2def1a8
> 2016-12-19 07:45:25.566 CET 22726 ERROR:  publisher does not send
> replica identity column expected by the logical replication target
> public.pgbench_accounts
> 2016-12-19 07:45:25.568 CET 16984 LOG:  worker process: logical
> replication worker 24581 (PID 22726) exited with exit code 1
> 2016-12-19 07:45:25.568 CET 17027 LOG:  starting logical replication
> worker for subscription 24581
> 2016-12-19 07:45:25.574 CET 26695 LOG:  logical replication apply for
> subscription sub1 started
> 2016-12-19 07:46:10.950 CET 26695 WARNING:  leaked hash_seq_search scan
> for hash table 0x2def2c8
> 2016-12-19 07:46:10.950 CET 26695 WARNING:  leaked hash_seq_search scan
> for hash table 0x2def2c8
> 2016-12-19 07:46:10.950 CET 26695 WARNING:  leaked hash_seq_search scan
> for hash table 0x2def2c8
> 
> 
> Sometimes  replication (caused by a pgbench run)  runs for a few seconds
> replicating all 4 pgbench tables correctly, but never longer than 10 to
> 20 seconds.
> 
> If you cannot reproduce with the provided info it  I will make a more
> precise setup-desciption, but it's so solidly failing here that I hope
> that won't be necessary.
> 

Hi,

nope can't reproduce that. I can reproduce the leaked hash_seq_search.
The attached fixes that. But no issues with replication itself.

The error basically means that pkey on publisher and subscriber isn't
the same.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index d795c5b..7ab94f5 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -59,6 +59,7 @@ logicalrep_relmap_invalidate_cb(Datum arg, Oid reloid)
if (entry->reloid == reloid)
{
entry->reloid = InvalidOid;
+   hash_seq_term();
break;