date:20190410

Fujita-san,

Thanks for the review.

On 2019/04/10 17:38, Etsuro Fujita wrote:
> (2019/03/06 18:33), Amit Langote wrote:
>> I noticed a bug with how UPDATE tuple routing initializes ResultRelInfos
>> to use for partition routing targets.  Specifically, the bug occurs when
>> UPDATE targets include a foreign partition that is locally modified (as
>> opposed to being modified directly on the remove server) and its
>> ResultRelInfo (called subplan result rel in the source code) is reused for
>> tuple routing:
> 
>> The solution is to teach ExecSetupPartitionTupleRouting to avoid using a
>> subplan result rel if its ri_FdwState is already set.
> 
> I'm not sure that is a good idea, because that requires to create a new
> ResultRelInfo, which is not free; I think it requires a lot of work.

After considering this a bit more and studying your proposed solution, I
tend to agree.  Beside the performance considerations you point out that I
too think are valid, I realize that going with my approach would mean
embedding some assumptions in the core code about ri_FdwState; in this
case, assuming that a subplan result rel's ri_FdwState is not to be
re-purposed for tuple routing insert, which is only based on looking at
postgres_fdw's implementation.  Not to mention the ensuing complexity in
the core tuple routing code -- it now needs to account for the fact that
not all subplan result rels may have been re-purposed for tuple-routing,
something that's too complicated to handle in PG 11.

IOW, let's call this a postgres_fdw bug and fix it there as you propose.

> Another solution to avoid that would be to store the fmstate created by
> postgresBeginForeignInsert() into the ri_FdwState, not overwrite the
> ri_FdwState, like the attached.  This would not need any changes to the
> core, and this would not cause any overheads either, IIUC.  What do you
> think about that?

+1.  Just one comment:

+/*
+ * If the given resultRelInfo already has PgFdwModifyState set, it means
+ * the foreign table is an UPDATE subplan resultrel; in which case, store
+ * the resulting state into the PgFdwModifyState.
+ */

The last "PgFdwModifyState" should be something else?  Maybe,
sub-PgModifyState or sub_fmstate?


>> I've also added a
>> test in postgres_fdw.sql to exercise this test case.
> 
> Thanks again, but the test case you added works well without any fix:

Oops, I should really adopt a habit of adding a test case before code.

> +-- Check case where the foreign partition is a subplan target rel and
> +-- foreign parttion is locally modified (target table being joined
> +-- locally prevents a direct/remote modification plan).
> +explain (verbose, costs off)
> +update utrtest set a = 1 from (values (1), (2)) s(x) where a = s.x
> returning *;
> +  QUERY PLAN
> +--
> 
> + Update on public.utrtest
> +   Output: remp.a, remp.b, "*VALUES*".column1
> +   Foreign Update on public.remp
> + Remote SQL: UPDATE public.loct SET a = $2 WHERE ctid = $1 RETURNING
> a, b
> +   Update on public.locp
> +   ->  Hash Join
> + Output: 1, remp.b, remp.ctid, "*VALUES*".*, "*VALUES*".column1
> + Hash Cond: (remp.a = "*VALUES*".column1)
> + ->  Foreign Scan on public.remp
> +   Output: remp.b, remp.ctid, remp.a
> +   Remote SQL: SELECT a, b, ctid FROM public.loct FOR UPDATE
> + ->  Hash
> +   Output: "*VALUES*".*, "*VALUES*".column1
> +   ->  Values Scan on "*VALUES*"
> + Output: "*VALUES*".*, "*VALUES*".column1
> +   ->  Hash Join
> + Output: 1, locp.b, locp.ctid, "*VALUES*".*, "*VALUES*".column1
> + Hash Cond: (locp.a = "*VALUES*".column1)
> + ->  Seq Scan on public.locp
> +   Output: locp.b, locp.ctid, locp.a
> + ->  Hash
> +   Output: "*VALUES*".*, "*VALUES*".column1
> +   ->  Values Scan on "*VALUES*"
> + Output: "*VALUES*".*, "*VALUES*".column1
> +(24 rows)
> 
> In this test case, the foreign partition is updated before ri_FdwState is
> overwritten by postgresBeginForeignInsert() that's invoked by the tuple
> routing code, so it would work without any fix.  So I modified this test
> case as such.

Thanks for fixing that.   I hadn't noticed that foreign partition remp is
updated before local partition locp; should've added a locp0 for values
(0) so that remp appears second in the ModifyTable's subplans.  Or, well,
make the foreign table remp appear later by making it a partition for
values in (3) as your patch does.

BTW, have you noticed that the RETURNING clause returns the same row twice?

+update utrtest set a = 3 from (values (2), (3)) s(x) where a = s.x
returning *;
+ a | b | x
+---+---+---
+ 3 | qux triggered !   | 2
+ 3 | xyzzy triggered ! | 3
+ 3 | qux triggered !   | 3
+(3 rows)

You can see that

Re: speeding up planning with partitions

On 2019/04/11 14:03, David Rowley wrote:
> On Fri, 5 Apr 2019 at 19:50, Amit Langote  
> wrote:
>> While we're on the topic of the relation between constraint exclusion and
>> partition pruning, I'd like to (re-) propose this documentation update
>> patch.  The partitioning chapter in ddl.sgml says update/delete of
>> partitioned tables uses constraint exclusion internally to emulate
>> partition pruning, which is no longer true as of 428b260f8.
> 
> Update-docs-that-update-delete-no-longer-use-cons.patch looks good to
> me. It should be changed as what the docs say is no longer true.

Thanks for the quick review. :)

Regards,
Amit

Re: speeding up planning with partitions

On Fri, 5 Apr 2019 at 19:50, Amit Langote  wrote:
> While we're on the topic of the relation between constraint exclusion and
> partition pruning, I'd like to (re-) propose this documentation update
> patch.  The partitioning chapter in ddl.sgml says update/delete of
> partitioned tables uses constraint exclusion internally to emulate
> partition pruning, which is no longer true as of 428b260f8.

Update-docs-that-update-delete-no-longer-use-cons.patch looks good to
me. It should be changed as what the docs say is no longer true.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Zedstore - compressed in-core columnar storage

2019-04-10 Thread Ashwin Agrawal



> On Apr 10, 2019, at 9:08 PM, Mark Kirkwood  
> wrote:
> 
> 
>> On 11/04/19 4:01 PM, Mark Kirkwood wrote:
>>> On 9/04/19 12:27 PM, Ashwin Agrawal wrote:
>>> 
>>> Heikki and I have been hacking recently for few weeks to implement
>>> in-core columnar storage for PostgreSQL. Here's the design and initial
>>> implementation of Zedstore, compressed in-core columnar storage (table
>>> access method). Attaching the patch and link to github branch [1] to
>>> follow along.
>>> 
>>> 
>> 
>> Very nice. I realize that it is very early days, but applying this patch 
>> I've managed to stumble over some compression bugs doing some COPY's:
>> 
>> benchz=# COPY dim1 FROM '/data0/dump/dim1.dat'
>> USING DELIMITERS ',';
>> psql: ERROR:  compression failed. what now?
>> CONTEXT:  COPY dim1, line 458
>> 
>> The log has:
>> 
>> 2019-04-11 15:48:43.976 NZST [2006] ERROR:  XX000: compression failed. what 
>> now?
>> 2019-04-11 15:48:43.976 NZST [2006] CONTEXT:  COPY dim1, line 458
>> 2019-04-11 15:48:43.976 NZST [2006] LOCATION: zs_compress_finish, 
>> zedstore_compression.c:287
>> 2019-04-11 15:48:43.976 NZST [2006] STATEMENT:  COPY dim1 FROM 
>> '/data0/dump/dim1.dat'
>> USING DELIMITERS ',';
>> 
>> The dataset is generated from and old DW benchmark I wrote 
>> (https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_projects_benchw_=DwIDaQ=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw=gxIaqms7ncm0pvqXLI_xjkgwSStxAET2rnZQpzba2KM=BgmTkDoY6SKOgODe8v6fpH4hs-wM0H91cLfrAfEL6C0=lLcXp_8h2bRb_OR4FT8kxD-FG9MaLBPU7M5aV9nQ7JY=).
>>  The row concerned looks like:
>> 
>> 457,457th interesting measure,1th measure 
>> type,aqwycdevcmybxcnpwqgrdsmfelaxfpbhfxghamfezdiwfvneltvqlivstwralshsppcpchvdkdbraoxnkvexdbpyzgamajfp
>> 458,458th interesting measure,2th measure 
>> type,bjgdsciehjvkxvxjqbhtdwtcftpfewxfhfkzjsdrdabbvymlctghsblxucezydghjrgsjjjnmmqhncvpwbwodhnzmtakxhsg
>> 
>> 
>> I'll see if changing to LZ4 makes any different.
>> 
>> 
> 
> The COPY works with LZ4 configured.

Thank you for trying it out. Yes, noticed for certain patterns pg_lzcompress() 
actually requires much larger output buffers. Like for one 86 len source it 
required 2296 len output buffer. Current zedstore code doesn’t handle this case 
and errors out. LZ4 for same patterns works fine, would highly recommend using 
LZ4 only, as anyways speed is very fast as well with it.

Re: Should we add GUCs to allow partition pruning to be disabled?

On 2019/04/11 13:50, David Rowley wrote:
> On Thu, 11 Apr 2019 at 16:06, Amit Langote
>  wrote:
>> I've posted a patch last week on the "speed up partition planning" thread
>> [1] which modifies ddl.sgml to remove the text about UPDATE/DELETE using
>> constraint exclusion under the covers.  Do you think there's any merit to
>> combining that with this one?
> 
> Probably separate is better. I don't think anything you're proposing
> there is for back-patching, but I think the original patch over here
> should be.

OK, no problem.  I just thought to point out my patch because you've
posted a version of the patch here for HEAD *because of* 428b260f8, the
commit which also obsoleted the text that the other patch fixes.

Anyway, let's leave the other patch on its own thread where there are a
few other things to be sorted out.

Thanks,
Amit

Re: Should we add GUCs to allow partition pruning to be disabled?

On Thu, 11 Apr 2019 at 16:06, Amit Langote
 wrote:
>
> On 2019/04/11 12:34, David Rowley wrote:
> > Now that we have 428b260f8, I think the version of this that goes into
> > master should be more like the attached.
>
> Thanks, looks good.

Thanks for looking.

> I've posted a patch last week on the "speed up partition planning" thread
> [1] which modifies ddl.sgml to remove the text about UPDATE/DELETE using
> constraint exclusion under the covers.  Do you think there's any merit to
> combining that with this one?

Probably separate is better. I don't think anything you're proposing
there is for back-patching, but I think the original patch over here
should be.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Reducing the runtime of the core regression tests

2019-04-10 Thread Tom Lane

David Rowley  writes:
> I was surprised to see nothing mentioned about attempting to roughly
> sort the test order in each parallel group according to their runtime.

I'm confused about what you have in mind here?  I'm pretty sure pg_regress
launches all the scripts in a group at the same time, so that just
rearranging the order they're listed in on the schedule line shouldn't
make any noticeable difference.  If you meant changing the order of
operations within each script, I don't really want to go there.
It'd require careful per-script analysis, and to the extent that the
existing tests have some meaningful ordering (admittedly, many don't),
we'd lose that.

>> Thoughts?  Anyone object to making these sorts of changes
>> post-feature-freeze?

> I think it's a good time to do this sort of thing.  It should be
> easier to differentiate tests destabilising due to this change out
> from the noise of other changes that are going in since currently,
> the rate of those other changes should not be very high. Doing it any
> later in the freeze does not seem better since we might discover some
> things that need to be fixed due to this.

Yeah.  I wouldn't propose pushing this in shortly before beta, but
if we do it now then we've probably got a month to sort out any
problems that may appear.

regards, tom lane

Re: Cleanup/remove/update references to OID column

On Wed, Apr 10, 2019 at 11:59:18AM -0500, Justin Pryzby wrote:
> I found and included fixes for a few more references:
> 
>  doc/src/sgml/catalogs.sgml   | 2 +-
>  doc/src/sgml/ddl.sgml| 3 +--
>  doc/src/sgml/information_schema.sgml | 4 ++--
>  doc/src/sgml/ref/create_trigger.sgml | 2 +-
>  doc/src/sgml/spi.sgml| 2 +-

Open Item++.

Andres, as this is a consequence of 578b229, could you look at what is
proposed here?
--
Michael


signature.asc
Description: PGP signature

Re: Proper usage of ndistinct vs. dependencies extended statistics

On Thu, 11 Apr 2019 at 11:53, Paul Martinez  wrote:
>
> I have some questions about the different types of extended statistics
> that were introduced in Postgres 10.
> - Which types of queries are each statistic type supposed to improve?

Multivariate ndistinct stats are aimed to improve distinct estimates
over groups of columns.  These can help in cases like GROUP BY a,b,
SELECT DISTINCT a,b, SELECT a,b FROM x UNION SELECT a,b FROM y;  They
also help in determining the number of times an index will be
rescanned in cases like nested loops with a parameterised inner path.

I see multivariate ndistinct estimates are not used for normal
selectivity estimates for unknown values.  e.g PREPARE q1 (int, int)
AS SELECT * FROM t1 WHERE a = $1 and b = $2; still assumes a and b are
independent even when ndistinct stats exist on the two columns.

There are a few other usages too. See calls of estimate_num_groups()

dependency stats just handle WHERE clauses (or more accurately,
clauses containing a reference to a single relation.  These only
handle equality OpExprs.  e.g "a = 10 and y = 3", not "a < 6 and y =
3".  Further stat types (most common values) added in PG12 aim to
allow inequality operators too.

> - When should one type of statistic be used over the other? Should they
>   both always be used?

If they both always should be always used then we'd likely not have
bothered making the types optional.   Both ndistinct and dependency
stats are fairly cheap to calculate and store, so it might not be too
big an issue adding both types if you're not sure. With these two
types there's not really any choice for the planner to decide to use
one or the other, it just makes use of the ones it can use for the
given situation.   That won't be the case as more stats types get
added. In PG12, for example, we had to choose of MCV stats should be
applied before dependencies stats.  That might be a no-brainer, but
perhaps the future there will be stats types where the order to apply
them is not so clear, although in those cases it might be questionable
why you'd want to define more than one type of stats on the same set
of columns.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: block-level incremental backup

On Wed, Apr 10, 2019 at 09:42:47PM +0200, Peter Eisentraut wrote:
> That is a great analysis.  Seems like block-level is the preferred way
> forward.

In any solution related to incremental backups I have see from
community, all of them tend to prefer block-level backups per the
filtering which is possible based on the LSN of the page header.  The
holes in the middle of the page are also easier to handle so as an
incremental page size is reduced in the actual backup.  My preference
tends toward a block-level approach if we were to do something in this
area, though I fear that performance will be bad if we begin to scan
all the relation files to fetch a set of blocks since a past LSN.
Hence we need some kind of LSN map so as it is possible to skip a
one block or a group of blocks (say one LSN every 8/16 blocks for
example) at once for a given relation if the relation is mostly
read-only.
--
Michael


signature.asc
Description: PGP signature

Re: pg_rewind vs superuser

On Tue, Apr 09, 2019 at 10:38:19AM +0900, Michael Paquier wrote:
> Sure.  With something like the attached?  I don't think that there is
> much point to complicate the test code with multiple roles if the
> default is a superuser.

As this topic differs from the original thread, I haev started a new
thread, so let's discuss the proposed patch there:
https://www.postgresql.org/message-id/20190411041336.gm2...@paquier.xyz
--
Michael


signature.asc
Description: PGP signature

Switch TAP tests of pg_rewind to use role with only function permissions

Hi all,

Recent commit bfc80683 has added some documentation in pg_rewind about
the fact that it is possible to do the operation with a non-superuser,
assuming that this role has sufficient grant rights to execute the
functions used by pg_rewind.

Peter Eisentraut has suggested to have some tests for this kind of
user here:
https://www.postgresql.org/message-id/e1570ba6-4459-d9b2-1321-9449adaae...@2ndquadrant.com

Attached is a patch which switches all the TAP tests of pg_rewind to
do that.  As of now, the tests depend on a superuser for everything,
and it seems to me that it makes little sense to make the tests more
pluggable by being able to switch the roles used on-the-fly (the
invocation of pg_rewind is stuck into RewindTest.pm) as a superuser
has no restrictions.

Any thoughts?
--
Michael
diff --git a/src/bin/pg_rewind/t/RewindTest.pm b/src/bin/pg_rewind/t/RewindTest.pm
index 900d452d8b..618de85161 100644
--- a/src/bin/pg_rewind/t/RewindTest.pm
+++ b/src/bin/pg_rewind/t/RewindTest.pm
@@ -144,6 +144,20 @@ sub start_master
 {
 	$node_master->start;
 
+	# Create a custom role which will be used to run pg_rewind.  This has
+	# minimal permissions to make pg_rewind able to work with an online
+	# source.
+	$node_master->psql('postgres', "
+		CREATE ROLE rewind_user LOGIN;
+		GRANT EXECUTE ON function pg_catalog.pg_ls_dir(text, boolean, boolean)
+		  TO rewind_user;
+		GRANT EXECUTE ON function pg_catalog.pg_stat_file(text, boolean)
+		  TO rewind_user;
+		GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text)
+		  TO rewind_user;
+		GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, boolean)
+		  TO rewind_user;");
+
 	 Now run the test-specific parts to initialize the master before setting
 	# up standby
 
@@ -207,6 +221,9 @@ sub run_pg_rewind
 	my $standby_connstr = $node_standby->connstr('postgres');
 	my $tmp_folder  = TestLib::tempdir;
 
+	# Append the rewind role to the connection string.
+	$standby_connstr = "$standby_connstr user=rewind_user";
+
 	# Stop the master and be ready to perform the rewind
 	$node_master->stop;
 


signature.asc
Description: PGP signature

Re: Zedstore - compressed in-core columnar storage

2019-04-10 Thread Mark Kirkwood

On 11/04/19 4:01 PM, Mark Kirkwood wrote:

On 9/04/19 12:27 PM, Ashwin Agrawal wrote:

Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.

Very nice. I realize that it is very early days, but applying this
patch I've managed to stumble over some compression bugs doing some
COPY's:

benchz=# COPY dim1 FROM '/data0/dump/dim1.dat'
USING DELIMITERS ',';
psql: ERROR: compression failed. what now?
CONTEXT: COPY dim1, line 458

The log has:

2019-04-11 15:48:43.976 NZST [2006] ERROR: XX000: compression failed.
what now?

2019-04-11 15:48:43.976 NZST [2006] CONTEXT: COPY dim1, line 458
2019-04-11 15:48:43.976 NZST [2006] LOCATION: zs_compress_finish,
zedstore_compression.c:287
2019-04-11 15:48:43.976 NZST [2006] STATEMENT: COPY dim1 FROM
'/data0/dump/dim1.dat'

USING DELIMITERS ',';

The dataset is generated from and old DW benchmark I wrote
(https://sourceforge.net/projects/benchw/). The row concerned looks like:

457,457th interesting measure,1th measure
type,aqwycdevcmybxcnpwqgrdsmfelaxfpbhfxghamfezdiwfvneltvqlivstwralshsppcpchvdkdbraoxnkvexdbpyzgamajfp
458,458th interesting measure,2th measure
type,bjgdsciehjvkxvxjqbhtdwtcftpfewxfhfkzjsdrdabbvymlctghsblxucezydghjrgsjjjnmmqhncvpwbwodhnzmtakxhsg

I'll see if changing to LZ4 makes any different.

The COPY works with LZ4 configured.

Re: Zedstore - compressed in-core columnar storage

2019-04-10 Thread Mark Kirkwood

On 9/04/19 12:27 PM, Ashwin Agrawal wrote:

Very nice. I realize that it is very early days, but applying this patch
I've managed to stumble over some compression bugs doing some COPY's:

benchz=# COPY dim1 FROM '/data0/dump/dim1.dat'
USING DELIMITERS ',';
psql: ERROR: compression failed. what now?
CONTEXT: COPY dim1, line 458

The log has:

2019-04-11 15:48:43.976 NZST [2006] ERROR: XX000: compression failed.
what now?

USING DELIMITERS ',';

The dataset is generated from and old DW benchmark I wrote
(https://sourceforge.net/projects/benchw/). The row concerned looks like:

I'll see if changing to LZ4 makes any different.

best wishes

Mark

Re: REINDEX CONCURRENTLY 2.0

On Tue, Apr 09, 2019 at 03:50:27PM +0900, Michael Paquier wrote:
> And here is the patch to address this issue.  It happens that a bit
> more than the dependency switch was lacking here:
> - At swap time, we need to have the new index definition track
> relispartition from the old index.
> - Again at swap time, the inheritance link needs to be updated between
> the old/new index and its parent when reindexing a partition index.

Peter, this is an open item, and I think as the committer of the
feature you are its owner.  Well, in this case, I don't mind taking
the ownership as need be as I know this stuff.  Anyway, could you have
a look at the patch proposed and see if you have any issues with it?
--
Michael


signature.asc
Description: PGP signature

Re: Should we add GUCs to allow partition pruning to be disabled?

On Thu, 21 Mar 2019 at 00:51, David Rowley  wrote:
> Just so I don't forget about this, I've added it to the July 'fest.
>
> https://commitfest.postgresql.org/23/2065/

Now that we have 428b260f8, I think the version of this that goes into
master should be more like the attached.

I think the original patch is fine for the back branches.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

docs_partitioning_warning_master.patch
Description: Binary data

RE: Contribution to Perldoc for TestLib module in Postgres

2019-04-10 Thread Iwata, Aya

Hi Ram,

I think this documentation helps people who want to understand functions.
>Please find the updated patch. Added to the commitfest as well
I have some comments.

I think some users who would like to know custom function check 
src/test/perl/README at first.
How about add comments to the README file, such as "See Testlib.pm if you want 
to know more details"?

In the above document, why not write a description after the function name?
I think it is better to write the function name first and then the description.
In your code;
  #Checks if all the tests passed or not
 all_tests_passing()

In my suggestion;
  all_tests_passing()
  Checks if all the tests passed or not

And some functions return value. How about adding return information to the 
above doc?

I find ^M character in your new code. Please use LF line ending not CRLF or get 
rid of it in next patch.

Regards,
Aya Iwata

Re: Should the docs have a warning about pg_stat_reset()?

On Thu, 11 Apr 2019 at 06:52, Bruce Momjian  wrote:
>
> OK, let me step back.  Why are people resetting the statistics
> regularly?  Based on that purpose, does it make sense to clear the
> stats that effect autovacuum?

I can't speak for everyone, but once upon a time when I first started
using PostgreSQL, to monitor the application's use of the database I
recorded the output of pg_stat_user_tables once per day and then reset
the statistics.  It was useful to know the number of inserted tuples,
seq scans, index scans etc so I could understand the load on the
database better.   Of course, nowadays with LEAD()/LAG() it's pretty
easy to find the difference from the previous day. I'd have done it
differently if those had existed back then.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH v20] GSSAPI encryption support

2019-04-10 Thread Stephen Frost

Greetings,

* Robbie Harwood (rharw...@redhat.com) wrote:
> Bruce Momjian  writes:
> > On Wed, Apr  3, 2019 at 08:49:25AM +0200, Magnus Hagander wrote:
> >> On Wed, Apr 3, 2019 at 12:22 AM Joe Conway  wrote:
> >>
> >> Personally I don't find it as confusing as is either, and I find
> >> hostgss to be a good analog of hostssl. On the other hand hostgssenc
> >> is long and unintuitive. So +1 for leaving as is and -1 one for
> >> changing it IMHO.
> >> 
> >> I think for those who are well versed in pg_hba (and maybe gss as
> >> well), it's not confusing. That includes me.
> >> 
> >> However, for a new user, I can definitely see how it can be
> >> considered confusing. And confusion in *security configuration* is
> >> always a bad idea, even if it's just potential.
> >> 
> >> Thus +1 on changing it.
> >> 
> >> If it was on the table it might have been better to keep hostgss and
> >> change the authentication method to gssauth or something, but that
> >> ship sailed *years* ago.
> >
> > Uh, did we consider keeping hostgss and changing the auth part at the
> > end to "gssauth"?
> 
> I think that was implicitly rejected because we'd have to keep the
> capability to configure "gss" there else break compatibility.

Right, if we changed the name of the auth method then everyone who is
using the "gss" auth method would have to update their pg_hba.conf
files...  That would be very ugly.  Also, it wasn't implicitly rejected,
it was discussed up-thread (see the comments between Magnus and I,
specifically, quoted above- "that ship sailed *years* ago") and
explicitly rejected.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid

2019-04-10 Thread Thomas Munro

On Mon, Apr 8, 2019 at 6:42 PM Noah Misch  wrote:
> - AIX animals failed two ways.  First, I missed a "use" statement such that
>   poll_start() would fail if it needed more than one attempt.  Second, I
>   assumed $pid would be gone as soon as kill(9, $pid) returned[1].

> [1] POSIX says "sig or at least one pending unblocked signal shall be
> delivered to the sending thread before kill() returns."  I doubt the
> postmaster had another signal pending often enough to explain the failures, so
> AIX probably doesn't follow POSIX in this respect.

It looks like you fixed this, but I was curious about this obversation
as someone interested in learning more about kernel stuff and
portability... Maybe I misunderstood, but isn't POSIX referring to
kill(sig, $YOUR_OWN_PID) there?  That is, if you signal *yourself*,
and no other thread exists that could handle the signal, it will be
handled by the sending thread, and in the case of SIGKILL it will
therefore never return.  But here, you were talking about a perl
script that kills the postmaster, no?  If so, that passage doesn't
seem to apply.  In any case, regardless of whether the signal handler
has run to completion when kill() returns, doesn't the pid have to
continue to exist in the process table until it is reaped by its
parent (possibly in response to SIGCHLD), with one of the wait*()
family of system calls?

-- 
Thomas Munro
https://enterprisedb.com

Qestion about .partial WAL file

2019-04-10 Thread Matsumura, Ryo

Hi, Hackers

I noticed something strange. Does it cause nothing?
I didn't detect anything, but feel restless.

Step:
- There are two standbys that connect to primary.
- Kill primary and promote one standby.
- Restart another standby that is reset primary_conninfo to connect new primary.

I expected that the latest WAL segment file in old timeline is renamed with 
.partial suffix,
but it's not renamed in the restarted standby.

xlog.c says the following, but I didn't understand the bad situation.

 * the archive. It's physically present in the new file with new TLI,
 * but recovery won't look there when it's recovering to the older
-->  * timeline. On the other hand, if we archive the partial segment, and
-->  * the original server on that timeline is still running and archives
-->  * the completed version of the same segment later, it will fail. (We
 * used to do that in 9.4 and below, and it caused such problems).
 *
 * As a compromise, we rename the last segment with the .partial
 * suffix, and archive it. Archive recovery will never try to read
 * .partial segments, so they will normally go unused. But in the odd
 * PITR case, the administrator can copy them manually to the pg_wal
 * directory (removing the suffix). They can be useful in debugging,
 * too.

Regards
Ryo Matsumura

Re: finding changed blocks using WAL scanning

On Wed, Apr 10, 2019 at 5:49 PM Robert Haas  wrote:
> There is one thing that does worry me about the file-per-LSN-range
> approach, and that is memory consumption when trying to consume the
> information.  Suppose you have a really high velocity system.  I don't
> know exactly what the busiest systems around are doing in terms of
> data churn these days, but let's say just for kicks that we are
> dirtying 100GB/hour.  That means, roughly 12.5 million block
> references per hour.  If each block reference takes 12 bytes, that's
> maybe 150MB/hour in block reference files.  If you run a daily
> incremental backup, you've got to load all the block references for
> the last 24 hours and deduplicate them, which means you're going to
> need about 3.6GB of memory.  If you run a weekly incremental backup,
> you're going to need about 25GB of memory.  That is not ideal.  One
> can keep the memory consumption to a more reasonable level by using
> temporary files.  For instance, say you realize you're going to need
> 25GB of memory to store all the block references you have, but you
> only have 1GB of memory that you're allowed to use.  Well, just
> hash-partition the data 32 ways by dboid/tsoid/relfilenode/segno,
> writing each batch to a separate temporary file, and then process each
> of those 32 files separately.  That does add some additional I/O, but
> it's not crazily complicated and doesn't seem too terrible, at least
> to me.  Still, it's something not to like.

Oh, I'm being dumb.  We should just have the process that writes out
these files sort the records first.  Then when we read them back in to
use them, we can just do a merge pass like MergeAppend would do.  Then
you never need very much memory at all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PostgreSQL pollutes the file system

2019-04-10 Thread Tatsuo Ishii

>>> On 2019-03-29 20:32, Joe Conway wrote:
   pg_util  
>>>
>>> How is that better than just renaming to pg_$oldname?
>> 
>> As I already said in up thread:
>> 
>>> This way, we would be free from the command name conflict problem
> 
> Well, whatever we do -- if anything -- we would certainly need to keep
> the old names around for a while somehow.  So this doesn't really make
> that issue go away.

Another complain was, it's hard to remember the tool names for novice
users. I think this way would solve the problem.

I agree that command name conflicting problem will not be solved by
the idea. However I do not believe there's name conflicting problem in
the first place. So I am happy to keep the old names as they are.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: Reducing the runtime of the core regression tests

2019-04-10 Thread Peter Geoghegan

On Wed, Apr 10, 2019 at 4:56 PM Peter Geoghegan  wrote:
> The original fastpath tests don't seem particularly effective to me,
> even without the oversight I mentioned. I suggest that you remove
> them, since the minimal btree_index.sql fast path test is sufficient.

To be clear: I propose that you remove the tests entirely, and we
leave it at that. I don't intend to follow up with my own patch
because I don't think that there is anything in the original test case
that is worth salvaging.

-- 
Peter Geoghegan

Re: PostgreSQL pollutes the file system

2019-04-10 Thread Euler Taveira

Em sex, 29 de mar de 2019 às 13:25, Tom Lane  escreveu:
>
> Maybe if we want to merge these things into one executable,
> it should be a new one.  "pg_util createrole bob" ?
>
+1 as I proposed in
https://www.postgresql.org/message-id/bdd1adb1-c26d-ad1f-2f15-cc52056065d4%40timbira.com.br


-- 
   Euler Taveira   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: Reducing the runtime of the core regression tests

2019-04-10 Thread Peter Geoghegan

On Wed, Apr 10, 2019 at 4:19 PM Tom Lane  wrote:
> > I'll come up with a patch to deal with this situation, by
> > consolidating the old and new tests in some way. I don't think that
> > your work needs to block on that, though.
>
> Should I leave out the part of my patch that creates index_fastpath.sql?
> If we're going to end up removing that version of the test, there's no
> point in churning the related lines beforehand.

The suffix truncation stuff made it tricky to force a B-Tree to be
tall without also consisting of many blocks. Simply using large,
random key values in suffix attributes didn't work anymore. The
solution I came up with for the new fastpath test that made it into
btree_index.sql was to have redundancy in leading keys, while avoiding
TOAST compression by using plain storage in the table/index.

> One way or the other I want to get that test out of where it is,
> because indexing.sql is currently the slowest test in its group.
> But if you prefer to make btree_index run a bit longer rather than
> inventing a new test script, that's no problem from where I stand.

The original fastpath tests don't seem particularly effective to me,
even without the oversight I mentioned. I suggest that you remove
them, since the minimal btree_index.sql fast path test is sufficient.
If there ever was a problem in this area, then amcheck would certainly
detect it -- there is precisely one place for every tuple on v4
indexes. The original fastpath tests won't tickle the implementation
in any interesting way in my opinion.

-- 
Peter Geoghegan

Proper usage of ndistinct vs. dependencies extended statistics

2019-04-10 Thread Paul Martinez

Hello,

I have some questions about the different types of extended statistics
that were introduced in Postgres 10.
- Which types of queries are each statistic type supposed to improve?
- When should one type of statistic be used over the other? Should they
  both always be used?

We have a multi-tenant application and all of our tables have a denormalized
tenant_id column. (Most tables actually use the tenant_id as part of a
composite primary key on (tenant_id, id).)

As the docs suggest, we haven't created extended STATISTICS except for when
we observe the query planner making poor query plans.

We've seen poor query plans on queries involving filters on foreign keys:

  Table: fk_table

tenant_id  | integer
id | integer
fk_id  | integer

PRIMARY KEY (tenant_id, id)
FOREIGN KEY (tenant_id, fk_id) REFERENCES left_table(tenant_id, id)

The id columns on these tables are unique, so there is a functional dependence
between fk_id and tenant_id; if the fk_id columns are the same, then the
tenant_id columns must also be the same.

This table has ~4.6 million rows, ~1300 distinct values for tenant_id, and
~13000 distinct values for fk_id.

A single SELECT query that filters on tenant_id and fk_id erroneously
estimates that it will return a single row (4,600,000 / 1300 / 13,000 ~= 0.1):

=> EXPLAIN ANALYZE SELECT * FROM fk_table WHERE tenant_id = 100 AND
fk_id = 1;
  QUERY PLAN
--
 Index Scan using fk_table_tenant_id_fk_id_index on fk_table
   (cost=0.43..4.45 rows=1 width=44) (actual time=0.016..1.547
rows=3113 loops=1)
   Index Cond: ((tenant_id = 100) AND (fk_id = 1))

In other places we've used a ndistinct statistic to solve this issue, but that
doesn't help in this case. Postgres still estimates that the query will return
a single row.

=> CREATE STATISTICS ndistinct_stat (ndistinct) ON tenant_id, fk_id
FROM fk_table;
=> ANALYZE fk_table;
=> SELECT stxname, stxndistinct FROM pg_statistic_ext;
stxname |  stxndistinct   |
+-+
 ndistinct_stat | {"1, 3": 3433}  |
=> EXPLAIN ANALYZE SELECT * FROM fk_table WHERE tenant_id = 100 AND
fk_id = 1;
-- (unchanged)

Why doesn't the ndistinct statistic get used when planning this query? (We're
currently on Postgre 10.6.) In contrast, if we create a functional dependency
statistic then Postgres will accurately predict the result size.

=> CREATE STATISTICS dep_stat (dependencies) ON tenant_id, fk_id FROM fk_table;
=> ANALYZE fk_table;
=> SELECT stxname, stxdependencies FROM pg_statistic_ext;
  stxname   | stxdependencies
+--
 dep_stat   | {"1 => 3": 1.00, "3 => 1": 0.060300}

=> EXPLAIN ANALYZE SELECT * FROM fk_table WHERE tenant_id = 100 AND
fk_id = 1;
  QUERY PLAN
--
 Index Scan using fk_table_tenant_id_fk_id_index on fk_table
   (cost=0.43..1042.23 rows=612 width=44) (actual time=0.011..0.813
rows=3056 loops=1)
   Index Cond: ((tenant_id = 100) AND (fk_id = 1))

So, in general, which type of extended statistic should be used? Where do the
different kinds of statistics get used in the query planner? Is there an
advantage to using one type of statistic vs the other, or should we always
create both?

And in our specific example, with a schema designed for multi-tenancy, which
types of statistics should we use for our foreign keys, where tenant_id is
functionally dependent on the other foreign_id columns?

To explain where some of our confusion is coming from, here's the example where
adding an ndistinct statistic helped: Postgres was adding a filter after an
index scan instead of including the filter as part of the index scan.

big_table had ~500,000,000 rows,
~3000 distinct values for column a,
~3000 distinct values for column b,
but just ~4500 distinct values for the (a, b) tuple,
and column b was functionally dependent on column c.

Postgres wanted to do:

=> SELECT * FROM big_table WHERE a = 1 AND b = 10 AND c IN (100, 101, 102, ...);
Index Scan using big_table_a_b_c on big_table  (cost=0.57..122.41
rows=1 width=16)
  Index Cond: ((a = 1) AND (b = 10))
  Filter: c = ANY ('{100, 101, 102, 103, 104, 105, ...}')

But then we did:

=> CREATE STATISTICS big_table_a_b_ndistinct (ndistinct) ON a, b FROM big_table;
=> ANALYZE big_table;
=> SELECT * FROM big_table WHERE a = 1 AND b = 10 AND c IN (100, 101, 102, ...);
Index Scan using big_table_a_b_c on big_table  (cost=0.57..122.41
rows=1 width=16)
  Index Cond: ((a = 1) AND (b = 10)) AND (c = ANY ('{100, 101, 102,
103, 104, 105, ...}'))

(This had very poor performance between Postgres thought it would have to
filter 500,000,000 / 3000 / 3000 ~= 55 rows, but actually it had to filter
500,000,000 / 4500 ~= 110,000 rows.)

Because of the functional

Re: Reducing the runtime of the core regression tests

2019-04-10 Thread Tom Lane

Peter Geoghegan  writes:
> On Wed, Apr 10, 2019 at 3:35 PM Tom Lane  wrote:
>> * Likewise, I split up indexing.sql by moving the "fastpath" test into
>> a new file index_fastpath.sql.

> I just noticed that the "fastpath" test actually fails to test the
> fastpath optimization -- the coverage we do have comes from another
> test in btree_index.sql, that I wrote back in December.

Oh!  Hmm.

> I'll come up with a patch to deal with this situation, by
> consolidating the old and new tests in some way. I don't think that
> your work needs to block on that, though.

Should I leave out the part of my patch that creates index_fastpath.sql?
If we're going to end up removing that version of the test, there's no
point in churning the related lines beforehand.

One way or the other I want to get that test out of where it is,
because indexing.sql is currently the slowest test in its group.
But if you prefer to make btree_index run a bit longer rather than
inventing a new test script, that's no problem from where I stand.

regards, tom lane

Re: serializable transaction: exclude constraint violation (backed by GIST index) instead of ssi conflict

2019-04-10 Thread Thomas Munro

On Thu, Apr 11, 2019 at 10:54 AM Thomas Munro  wrote:
> On Thu, Apr 11, 2019 at 9:43 AM Peter Billen  wrote:
> > I kinda expected/hoped that transaction t2 would get aborted by a 
> > serialization error, and not an exclude constraint violation. This makes 
> > the application session bound to transaction t2 failing, as only 
> > serialization errors are retried.

> Yeah, I agree, the behaviour you are expecting is desirable and we
> should figure out how to do that.  The basic trick for btree unique
> constraints was to figure out where the index *would* have written, to
> give the SSI machinery a chance to object to that before raising the
> UCV.  I wonder if we can use the same technique here... at first
> glance, check_exclusion_or_unique_constraint() is raising the error,
> but is not index AM specific code, and it is somewhat removed from the
> GIST code that would do the equivalent
> CheckForSerializableConflictIn() call.  I haven't looked into it
> properly, but that certainly complicates matters somewhat...  Perhaps
> the index AM would actually need a new entrypoint that could be called
> before the error is raised, or perhaps there is an easier way.

Adding Kevin (architect of SSI and reviewer/committer of my UCV
interception patch) and Shubham (author of GIST SSI support) to the CC
list in case they have thoughts on this.

-- 
Thomas Munro
https://enterprisedb.com

Re: Libpq support to connect to standby server as priority

2019-04-10 Thread Haribabu Kommi

On Fri, Mar 29, 2019 at 7:06 PM Tsunakawa, Takayuki <
tsunakawa.ta...@jp.fujitsu.com> wrote:

> Hi Hari-san,
>
> I've reviewed all the files.  The patch would be OK when the following
> have been fixed, except for the complexity of fe-connect.c (which probably
> cannot be improved.)
>
> Unfortunately, I'll be absent next week.  The earliest date I can do the
> test will be April 8 or 9.  I hope someone could take care of this patch...
>

Thanks for the review. Apologies that I could not able finish it on time
because of
other work.



>
> (1) 0001
> With this read-only option type, application can connect to
> to a read-only server in the list of hosts, in case
> ...
> before issuing the SHOW transaction_readonly to find out whether
>
>
> "to" appears twice in a row.
> transaction_readonly -> transaction_read_only
>
>
> (2) 0001
> +succesful connection or failure.
>
> succesful -> successful
>
>
> (3) 0008
> to conenct to a standby server with a faster check instead of
>
> conenct -> connect
>
>
> (4) 0008
> Logically, recovery exit should be notified after the following statement:
>
> XLogCtl->SharedRecoveryInProgress = false;
>
>
> (5) 0008
> +   /* Update in_recovery status. */
> +   if (LocalRecoveryInProgress)
> +   SetConfigOption("in_recovery",
> +   "on",
> +   PGC_INTERNAL,
> PGC_S_OVERRIDE);
> +
>
> This SetConfigOption() is called for every RecoveryInProgress() call on
> the standby.  It may cause undesirable overhead.  How about just calling
> SetConfigOption() once in InitPostgres() to set the initial value for
> in_recovery?  InitPostgres() and its subsidiary functions call
> SetConfigOption() likewise.
>
>
> (6) 0008
> async.c is for LISTEN/UNLISTEN/NOTIFY.  How about adding the new functions
> in postgres.c like RecoveryConflictInterrupt()?
>
>
> (7) 0008
> +   if (pid != 0)
> +   {
> +   (void) SendProcSignal(pid, reason,
> procvxid.backendId);
> +   }
>
> The braces are not necessary because the block only contains a single
> statement.
>

I fixed all the comments that you have raised above and attached the updated
patches.

Regards,
Haribabu Kommi
Fujitsu Australia


0001-Restructure-the-code-to-remove-duplicate-code.patch
Description: Binary data


0002-New-TargetSessionAttrsType-enum.patch
Description: Binary data


0003-Make-transaction_read_only-as-GUC_REPORT-varaible.patch
Description: Binary data


0004-New-prefer-read-target_session_attrs-type.patch
Description: Binary data


0005-New-read-only-target_session_attrs-type.patch
Description: Binary data


0006-Primary-prefer-standby-and-standby-options.patch
Description: Binary data


0007-New-function-to-rejecting-the-checked-write-connecti.patch
Description: Binary data


0008-Server-recovery-mode-handling.patch
Description: Binary data

Re: Reducing the runtime of the core regression tests

2019-04-10 Thread Peter Geoghegan

On Wed, Apr 10, 2019 at 3:35 PM Tom Lane  wrote:
> I finally got some time to pursue that, and attached is a proposed patch
> that moves some tests around and slightly adjusts some other ones.
> To cut to the chase: on my workstation, this cuts the time for
> "make installcheck-parallel" from 21.9 sec to 13.9 sec, or almost 40%.
> I think that's a worthwhile improvement, considering how often all of us
> run those tests.

Great!

> * create_index.sql ran much longer than other tests in its parallel
> group, so I split out the SP-GiST-related tests into a new file
> create_index_spgist.sql, and moved the delete_test_table test case
> to btree_index.sql.

Putting the delete_test_table test case in btree_index.sql make perfect sense.

> * Likewise, I split up indexing.sql by moving the "fastpath" test into
> a new file index_fastpath.sql.

I just noticed that the "fastpath" test actually fails to test the
fastpath optimization -- the coverage we do have comes from another
test in btree_index.sql, that I wrote back in December. While I did
make a point of ensuring that we had test coverage for the nbtree
fastpath optimization that went into Postgres 11, I also didn't
consider the original fastpath test. I assumed that there were no
tests to begin with, because gcov showed me that there was no test
coverage back in December.

What happened here was that commit 074251db limited the fastpath to
only be applied to B-Trees with at least 3 levels. While the original
fastpath optimization tests actually tested the fastpath optimization
when it first went in in March 2018, that only lasted a few weeks,
since 074251db didn't adjust the test to still be effective.

I'll come up with a patch to deal with this situation, by
consolidating the old and new tests in some way. I don't think that
your work needs to block on that, though.

> Thoughts?  Anyone object to making these sorts of changes
> post-feature-freeze?

IMV there should be no problem with pushing ahead with this after
feature freeze.

-- 
Peter Geoghegan

Re: Should the docs have a warning about pg_stat_reset()?

2019-04-10 Thread Euler Taveira

Em qua, 10 de abr de 2019 às 16:33, Alvaro Herrera
 escreveu:
>
> On 2019-Apr-10, Bruce Momjian wrote:
>
> > On Thu, Apr 11, 2019 at 04:14:11AM +1200, David Rowley wrote:
>
> > > I still think we should start with a warning about pg_stat_reset().
> > > People are surprised by this, and these are just the ones who notice:
> > >
> > > https://www.postgresql.org/message-id/CAB_myF4sZpxNXdb-x=welpqbdou6ue8fhtm0fverpm-1j7p...@mail.gmail.com
> > >
> > > I imagine there are many others just suffering from bloat due to
> > > auto-vacuum not knowing how many dead tuples there are in the tables.
> >
> > OK, let me step back.  Why are people resetting the statistics
> > regularly?  Based on that purpose, does it make sense to clear the
> > stats that effect autovacuum?
>
> I agree that we should research that angle.  IMO resetting stats should
> be seriously frowned upon.  And if they do need to reset some counters
> for some valid reason, offer a mechanism that leaves the autovac-
> guiding counters alone.
>
Then you have to change the way pg_stat_reset() works (it currently
removes the hash tables). Even pg_stat_reset_single_table_counters()
could cause trouble although it is in a smaller proportion. Reset
statistics leaves autovacuum state machine in an invalid state. Since
reset statistic is a rare situation (at least I don't know monitoring
tools or practices that regularly execute those functions), would it
be worth adding complexity to pg_stat_reset* functions? autovacuum
could handle those rare cases just fine.

> IMO the answer for $SUBJECT is yes.
>
+1. However, I also suggest a WARNING saying "autovacuum won't work
because you reset statistics that it depends on" plus detail "Consider
executing ANALYZE on all your tables" / "Consider executing ANALYZE on
table foo.bar".


-- 
   Euler Taveira   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: Reducing the runtime of the core regression tests

2019-04-10 Thread Tom Lane

Andres Freund  writes:
> On 2019-04-10 18:35:15 -0400, Tom Lane wrote:
>> ... What I did instead was to shove
>> that test case and some related ones into a new plpgsql test file,
>> src/pl/plpgsql/src/sql/plpgsql_trap.sql, so that it's not part of the
>> core regression tests at all.  (We've talked before about moving
>> chunks of plpgsql.sql into the plpgsql module, so this is sort of a
>> down payment on that.)  Now, if you think about the time to do
>> check-world rather than just the core regression tests, this isn't
>> obviously a win, and in fact it might be a loss because the plpgsql
>> tests run serially not in parallel with anything else.

> Hm, can't we "just" parallelize the plpgsql schedule instead?

If somebody wants to work on that, I won't stand in the way, but
it seems like material for a different patch.

>> Thoughts?  Anyone object to making these sorts of changes
>> post-feature-freeze?

> Hm. There's some advantage to doing so, because it won't break any large
> pending changes.  But it's also possible that it'll destabilize the
> buildfarm some.  In personal capacity I'm like +0.5.

My thought was that there is (hopefully) going to be a lot of testing
going on over the next few months, so making that faster would be
a useful activity.

regards, tom lane

Re: Reducing the runtime of the core regression tests

Hi,

On 2019-04-10 18:35:15 -0400, Tom Lane wrote:
> on my workstation, this cuts the time for "make installcheck-parallel"
> from 21.9 sec to 13.9 sec, or almost 40%.  I think that's a worthwhile
> improvement, considering how often all of us run those tests.

Awesome.


> * The plpgsql test ran much longer than others, which turns out to be
> largely due to the 2-second timeout in its test of statement_timeout.
> In view of the experience reflected in commit f1e671a0b, just
> reducing that timeout seems unsafe.  What I did instead was to shove
> that test case and some related ones into a new plpgsql test file,
> src/pl/plpgsql/src/sql/plpgsql_trap.sql, so that it's not part of the
> core regression tests at all.  (We've talked before about moving
> chunks of plpgsql.sql into the plpgsql module, so this is sort of a
> down payment on that.)  Now, if you think about the time to do
> check-world rather than just the core regression tests, this isn't
> obviously a win, and in fact it might be a loss because the plpgsql
> tests run serially not in parallel with anything else.  However,
> by that same token, the parallel-testing overload we were concerned
> about in f1e671a0b should be much less bad in the plpgsql context.
> I therefore took a chance on reducing the timeout down to 1 second.
> If the buildfarm doesn't like that, we can change it back to 2 seconds
> again.  It should still be a net win because of the fact that
> check-world runs the core tests more than once.

Hm, can't we "just" parallelize the plpgsql schedule instead?


> Thoughts?  Anyone object to making these sorts of changes
> post-feature-freeze?

Hm. There's some advantage to doing so, because it won't break any large
pending changes.  But it's also possible that it'll destabilize the
buildfarm some.  In personal capacity I'm like +0.5.

Greetings,

Andres Freund

Re: B-tree cache prefetches

2019-04-10 Thread Thomas Munro

On Tue, Nov 27, 2018 at 5:43 AM Andrey Borodin  wrote:
> > 31 авг. 2018 г., в 2:40, Thomas Munro  
> > написал(а):
> > [1] https://arxiv.org/pdf/1509.05053.pdf
>
> I've implemented all of the strategies used in that paper.
> On a B-tree page we have a line pointers ordered in key order and tuples 
> residing at the end of the page. Tuples at the end are mostly "appended", 
> i.e. they usually are accommodated at the end of the free space.
> The ides of the attached patch is to try to order tuples in different 
> strategies. This ordering happens during VACUUM and can be easily broken with 
> single insert.
> Strategies are switched by #define:
> 1. USE_SEQ_ORDER - just order tuples in order of line pointers. This strategy 
> has no idea behind and is expected to work similar to baseline (no changes to 
> code at all)
> 2. USE_EYZINGER_ORDER - depth-first topological search. If you always choose 
> lower branch - your search is just a sequential read of bytes.
> 3. USE_VEB_ORDER - Van Emde Boas layout - try to pack 3 tuples (mid and two 
> sub-mids) together. The key ideas is that you cache-miss when read mid, but 
> tuples of both next steps are already fetched by that cache miss.
> 4. USE_BT_ORDER - order two sub-mids together so that you can prefetch both 
> two possible next mids in a single cache prefetch.
>
> Cache prefetches of B-tree binsearch are controlled by USE_PREFETCH.
>
> I've tested all these strategies on my laptop (2.2GHz Intel i7, 16GB RAM, 
> MacOS 10.14.1)
> ./pgbench -i -s 25 postgres && ./pgbench -T 100 postgres
> No changes in configs.
>
> Here are results in tps:
> without prefetch
> baseline 1448.331199
> seq  1433.737204
> bt   1463.701527
> veb  1457.586089
> eyz  1483.654765
>
> with prefetch
> baseline 1486.585895
> seq  1471.757042
> bt   1480.169967
> veb  1464.834678
> eyz  1460.323392
>
> This numbers are not statistically cleared, just one run was done for every 
> number, experiment needs to be tuned anyway.
> From this I can conclude:
> 1. It is hard to observe significant performance influence on pgbench. Maybe 
> I should use some more representative B-tree performance tests?
> 2. Cache prefetches seem to increase performance without any layout 
> strategies. But the difference is hair-splitting 2.6%
> 3. For implemented strategies my preliminary results corresponds to the 
> result in a paper: Eyzinger layout is the best.
> 4. Among layout strategies with prefetch, bt-order strategy seems to be 
> winner.
>
> How can I improve experimental evaluation of these layout strategies?
> Other thoughts? I'd be happy to hear any comment on this.

What about read-only tests with prewarmed indexes?  Those numbers look IO bound.

This is focusing on prefetch within btree code, but I wonder if there
is some lower hanging fruit that doesn't require any layout changes:
heap prefetch during index scans.  By analogy to figure 8a
"software-pipelined prefetching" of [1], I wonder if you could build a
pipeline using dirty (unlocked) memory reads, like this:

1.  Prefetch buffer mapping hash bucket for heap_tids[n + 3]
2.  Prefetch page header for heap_tids[n + 2] using a dirty read of
the first value in the buffer mapping hash bucket
3.  Prefetch heap tuple for tids[n + 1] using a dirty read of the heap page
4.  Return heap tid for tids[0] to caller

If you're lucky, by the time the caller uses the tid to the find the
buffer, read the line pointer and access the tuple, all three things
will be in cache and hopefully won't have changed since the 'dirty'
reads.  Prefetching the wrong cachelines would be possible obviously,
if there is a buffer mapping hash table collision and the one you
wanted isn't first, or stuff just changed.  Maybe there aren't enough
cachelines for this 3-level pipeline to work, considering that in
between random other executor nodes could run, so perhaps just a
2-level or 1-level pipeline would pay off.  As I showed in my toy hash
join prefetch experiment[2], I could get 20-30% speedup from
prefetching just hash buckets (equivalent to prefetching buffer
mapping hash buckets), and then a further 5-10% speedup from
prefetching the first item in the hash bucket, but I expect nested
hash joins to interfere with that as they compete for cache lines.  I
suppose I could try to do three levels and fetch the next tuple in the
chain, but that seems unlikely to help because we're getting into
lower probabilities that I'd even even want that data; in contrast,
the index scan -> heap scan case *definitely* needs to go through a
line pointer to a tuple somewhere on the page so 3-level might be
worth experimenting with.  Just a though, I have not tried any of
this.  Maybe there are really 4 levels, if you count the buffer
descriptor/lock.

[1] http://www.cs.cmu.edu/~chensm/papers/hashjoin_tods_preliminary.pdf
[2] 
https://www.postgresql.org/message-id/CA%2BhUKGKB%3DEWq%2Brj4NK8CX6ZqHZzuQYWSb9iDfYKzGN6-ejShUQ%40mail.gmail.com

-- 
Thomas Munro

Re: pg_dump is broken for partition tablespaces

On 2019-Apr-10, Alvaro Herrera wrote:

> but the test immediately does this:
> 
> alter table at_partitioned alter column b type numeric using b::numeric;
> 
> and watch what happens!  (1663 is pg_default)
> 
> alvherre=# select relname, reltablespace from pg_class where relname like 
> 'at_partitioned%';
>relname| reltablespace 
> --+---
>  at_partitioned   | 0
>  at_partitioned_a_idx | 0
>  at_partitioned_b_idx |  1663
> (3 filas)
> 
> Outrageous!

This is because ruleutils.c attaches a TABLESPACE clause when asked to
dump an index definition; and tablecmds.c uses ruleutils to deparse the
index definition into something that can be replayed via CREATE INDEX
commands (or ALTER TABLE ADD CONSTRAINT UNIQUE/PRIMARY KEY, if that's
the case.)

This patch (PoC quality) fixes that behavior, but I'm looking to see
what else it breaks.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 83f72470b10..acb6b09a576 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -10567,7 +10567,13 @@ ATExecAlterColumnType(AlteredTableInfo *tab, Relation rel,
 			tab->changedIndexOids = lappend_oid(tab->changedIndexOids,
 foundObject.objectId);
 			tab->changedIndexDefs = lappend(tab->changedIndexDefs,
-			pg_get_indexdef_string(foundObject.objectId));
+			/*
+			 * we include a tablespace clause
+			 * only if it's a plain index.
+			 * This likely breaks some stuff.
+			 */
+			pg_get_indexdef_string(foundObject.objectId,
+   relKind == RELKIND_INDEX));
 		}
 	}
 	else if (relKind == RELKIND_SEQUENCE)
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 0c7a533e697..41144d84a76 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -1142,11 +1142,11 @@ pg_get_indexdef_ext(PG_FUNCTION_ARGS)
  * Returns a palloc'd C string; no pretty-printing.
  */
 char *
-pg_get_indexdef_string(Oid indexrelid)
+pg_get_indexdef_string(Oid indexrelid, bool tablespace)
 {
 	return pg_get_indexdef_worker(indexrelid, 0, NULL,
   false, false,
-  true, true,
+  tablespace, true,
   0, false);
 }
 
@@ -2170,6 +2170,7 @@ pg_get_constraintdef_worker(Oid constraintId, bool fullCommand,
 		pfree(options);
 	}
 
+	/* XXX this tablespace works even if it's 0 */
 	tblspc = get_rel_tablespace(indexId);
 	if (OidIsValid(tblspc))
 		appendStringInfo(, " USING INDEX TABLESPACE %s",
diff --git a/src/include/utils/ruleutils.h b/src/include/utils/ruleutils.h
index 7c49e9d0a83..6ff3b841057 100644
--- a/src/include/utils/ruleutils.h
+++ b/src/include/utils/ruleutils.h
@@ -18,7 +18,7 @@
 #include "nodes/pg_list.h"
 
 
-extern char *pg_get_indexdef_string(Oid indexrelid);
+extern char *pg_get_indexdef_string(Oid indexrelid, bool tablespace);
 extern char *pg_get_indexdef_columns(Oid indexrelid, bool pretty);
 
 extern char *pg_get_partkeydef_columns(Oid relid, bool pretty);

finding changed blocks using WAL scanning

Over at
https://www.postgresql.org/message-id/CA%2BTgmobFVe4J4AA7z9OMUzKnm09Tt%2BsybhxeL_Ddst3q3wqpzQ%40mail.gmail.com
I mentioned parsing the WAL to extract block references so that
incremental backup could efficiently determine which blocks needed to
be copied. Ashwin replied in
http://postgr.es/m/CALfoeitO-vkfjubMFQRmgyXghL-uUnZLNxbr=obrqqsm8kf...@mail.gmail.com
to mention that the same approach could be useful for pg_upgrade:

# Currently, pg_rewind requires
# all the WAL logs to be present on source side from point of divergence to
# rewind. Instead just parse the wal and keep the changed blocks around on
# sourece. Then don't need to retain the WAL but can still rewind using the
# changed block map.

Since there are at least two possible use case for an efficient way to
learn when blocks last changed, and in each case the performance
benefits of being able to learn that are potentially quite large, I'm
starting this thread to brainstorm how such a system might work.

It seems to me that there are basically two ways of storing this kind
of information, plus a bunch of variants. One way is to store files
that cover a range of LSNs, and basically contain a synopsis of the
WAL for those LSNs. You omit all the actual data and just mention
which blocks were changed by some record in that part of the WAL. In
this type of scheme, the storage required is roughly proportional to
the volume of WAL for which you wish to retain data. Pruning old data
is easy; just remove the files that provide information about LSNs
that you don't care about any more. The other way is to store data
about each block, or each range of blocks, or all the blocks that hash
onto a certain slot; for each, store the newest LSN that has modified
that block, or a block in that range, or a block that hashes onto that
that slot. In this system, storage is roughly proportional to the
size of the database cluster, except maybe in the hashing case, but I
*think* that case will degrade unless you basically expand the map to
be roughly proportional to the size of the cluster anyway. I may be
wrong.

Of these two variants, I am inclined to prefer the version where each
file is a summary of the block references within some range of LSNs.
It seems simpler to implement to me. You just read a bunch of WAL
files and then when you get tired you stop and emit an output file.
You need to protect yourself against untimely crashes. One way is to
stick a checksum into the output file. After you finish writing it,
fsync() it before you start writing the next one. After a restart,
read the latest such file and see if the checksum is OK. If not,
regenerate it; if not, assume it's good and move on. Files other than
the last one can be assumed good. Another way is to create the file
with a temporary name, fsync() it, and then rename it into place and
fsync() again. The background worker that generates the files can
have a GUC to remove them when they are older than some threshold
amount of time, or you can keep them forever and let the user manually
remove stuff they no longer want based on LSN. That's pretty much it.

The version where you keep an LSN per block/range/hash bucket seems
more complex in terms of durability. Now you have to worry not only
about creating new files, but about modifying them. That seems to add
some complexity. I think it may be possible to make it work without
doing write-ahead logging for every change, but it certainly needs
careful thought to avoid torn page problems and checkpoint
synchronization issues. Moreover, it potentially uses lots and lots
of inodes if there are many relations in the cluster. You can avoid
that by not creating maps for small files, if you like, or by
switching to the hash bucket approach. But either of those approaches
is lossy. Omitting the maps for small files means you always have to
assume everything in those files changed. The hash bucket approach is
vulnerable to hash collisions; you have to assume that all blocks that
hash to a given bucket have changed. Those are probably manageable
disadvantages, but I think they are real ones.

There is one thing that does worry me about the file-per-LSN-range
approach, and that is memory consumption when trying to consume the
information. Suppose you have a really high velocity system. I don't
know exactly what the busiest systems around are doing in terms of
data churn these days, but let's say just for kicks that we are
dirtying 100GB/hour. That means, roughly 12.5 million block
references per hour. If each block reference takes 12 bytes, that's
maybe 150MB/hour in block reference files. If you run a daily
incremental backup, you've got to load all the block references for
the last 24 hours and deduplicate them, which means you're going to
need about 3.6GB of memory. If you run a weekly incremental backup,
you're going to need about 25GB of memory. That is not ideal. One
can keep the memory consumption to a more reasonable

serializable transaction: exclude constraint violation (backed by GIST index) instead of ssi conflict

2019-04-10 Thread Peter Billen

Hi all,

I understood that v11 includes predicate locking for gist indexes, as per
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3ad55863e9392bff73377911ebbf9760027ed405
.

I tried this in combination with an exclude constraint as following:

drop table if exists t;
create table t(period tsrange);
alter table t add constraint bla exclude using gist(period with &&);
-- t1
begin transaction isolation level serializable;
select * from t where period && tsrange(now()::timestamp, now()::timestamp
+ interval '1 hour');
insert into t(period) values(tsrange(now()::timestamp, now()::timestamp +
interval '1 hour'));
-- t2
begin transaction isolation level serializable;
select * from t where period && tsrange(now()::timestamp, now()::timestamp
+ interval '1 hour');
insert into t(period) values(tsrange(now()::timestamp, now()::timestamp +
interval '1 hour'));
-- t1
commit;
-- t2
ERROR:  conflicting key value violates exclusion constraint "bla"
DETAIL:  Key (period)=(["2019-04-10 20:59:20.6265","2019-04-10
21:59:20.6265")) conflicts with existing key (period)=(["2019-04-10
20:59:13.332622","2019-04-10 21:59:13.332622")).

I kinda expected/hoped that transaction t2 would get aborted by a
serialization error, and not an exclude constraint violation. This makes
the application session bound to transaction t2 failing, as only
serialization errors are retried.

We introduced the same kind of improvement/fix for btree indexes earlier,
see
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=fcff8a575198478023ada8a48e13b50f70054766.
Should this also be applied for (exclude) constraints backed by a gist
index (as gist indexes now support predicate locking), or am I creating
incorrect assumptions something here?

Thanks.

Re: PostgreSQL pollutes the file system

2019-04-10 Thread Isaac Morland

I just want to be on record that I don't think there is a problem here that
needs to be solved. The choice to put Postgres-related binaries in /usr/bin
or wherever is a distribution/packaging decision. As has been pointed out,
if I download, build, and install Postgres, the binaries by default go
in /usr/local/pgsql/bin.

It is a long-standing Unix tradition to have short-named commands from many
sources in /usr/bin and /bin, not to mention other files, often with short
names, in various directories all over the system. For example, on one of
the Ubuntu machines at my work, take a look at all the 2-character commands
in those directories, and how many different packages they come from, in
the list at the bottom of this message.

At this point I think Postgres absolutely owns the name "psql" as a Unix
binary and I would oppose any suggestion that this should be renamed. Just
my own effort to teach my fingers to type something different would
probably outweigh any benefit from renaming.

Having said this, if people are enthusiastic and can actually agree, there
are a few changes that might make sense:

- move clusterdb, createdb, etc. (*db, but not initdb because that is a
server, not client, program) into pg_db_util [subcommand] (or some such)
- move createuser, dropuser into pg_role_util [subcommand] (or some such)
- pgbench -> pg_bench (why no '_' anyway?)
- ecpg -> pg_ec (usually invoked by makefiles anyway, I'm guessing)

But I consider this worth doing only if people consider that it's an
improvement for reasons other than just getting stuff out of /bin or
/usr/bin.

List of 2-character commands and their source packages on one of our
systems (the "no path found" ones are mostly symlinks into the Ubuntu
"alternatives" system):

16:52 ijmorlan@ubuntu1604-102$ dpkg -S /usr/bin/?? /bin/?? | sort
dpkg-query: no path found matching pattern /usr/bin/cc
dpkg-query: no path found matching pattern /usr/bin/ex
dpkg-query: no path found matching pattern /usr/bin/fp
dpkg-query: no path found matching pattern /usr/bin/js
dpkg-query: no path found matching pattern /usr/bin/pc
dpkg-query: no path found matching pattern /usr/bin/rn
dpkg-query: no path found matching pattern /usr/bin/rt
dpkg-query: no path found matching pattern /usr/bin/vi
dpkg-query: no path found matching pattern /bin/mt
dpkg-query: no path found matching pattern /bin/nc
acct: /usr/bin/ac
apache2-utils: /usr/bin/ab
aspectj: /usr/bin/aj
at: /usr/bin/at
bc: /usr/bin/bc
bf: /usr/bin/bf
binutils: /usr/bin/ar
binutils: /usr/bin/as
binutils: /usr/bin/ld
binutils: /usr/bin/nm
bsdmainutils: /usr/bin/hd
bsdmainutils: /usr/bin/ul
byobu: /usr/bin/NF
coreutils: /bin/cp
coreutils: /bin/dd
coreutils: /bin/df
coreutils: /bin/ln
coreutils: /bin/ls
coreutils: /bin/mv
coreutils: /bin/rm
coreutils: /usr/bin/du
coreutils: /usr/bin/id
coreutils: /usr/bin/nl
coreutils: /usr/bin/od
coreutils: /usr/bin/pr
coreutils: /usr/bin/tr
coreutils: /usr/bin/wc
cups-client: /usr/bin/lp
dash: /bin/sh
dc: /usr/bin/dc
debhelper: /usr/bin/dh
diversion by dash from: /bin/sh
diversion by dash to: /bin/sh.distrib
ed: /bin/ed
ghostscript: /usr/bin/gs
graphviz: /usr/bin/gc
gv: /usr/bin/gv
i3-wm: /usr/bin/i3
ii: /usr/bin/ii
iproute2: /bin/ip
iproute2: /bin/ss
ispell: /usr/bin/sq
login: /bin/su
login: /usr/bin/sg
m4: /usr/bin/m4
mc: /usr/bin/mc
mercurial: /usr/bin/hg
mono-devel: /usr/bin/al
mono-devel: /usr/bin/lc
mono-devel: /usr/bin/sn
mtools: /usr/bin/lz
mtools: /usr/bin/uz
p7zip-full: /usr/bin/7z
procps: /bin/ps
rcs: /usr/bin/ci
rcs: /usr/bin/co
rs: /usr/bin/rs
ruby: /usr/bin/ri
sc: /usr/bin/sc
speech-tools: /usr/bin/dp
tex4ht: /usr/bin/ht
texlive-binaries: /usr/bin/mf
util-linux: /usr/bin/pg
xz-utils: /usr/bin/xz

Re: block-level incremental backup

On Wed, 10 Apr 2019 11:55:51 -0700
Andres Freund  wrote:

> Hi,
> 
> On 2019-04-10 14:38:43 -0400, Robert Haas wrote:
> > On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais
> >  wrote:  
> > > In my current design, the scan is done backward from end to start and I
> > > keep all the records appearing after the last occurrence of their
> > > respective FPI.  
> > 
> > Oh, interesting.  That seems like it would require pretty major
> > surgery on the WAL stream.  
> 
> Can't you just read each segment forward, and then reverse?

Not sure what you mean.

I first look for the very last XLOG record by jumping to the last WAL and
scanning it forward. 

Then, I do a backward from there to record LSN of xlogrecord to keep.

Finally, I clone each WAL and edit them as needed (as described in my previous
email). This is my current WIP though.

> That's not that much memory?

I don't know, yet. I did not mesure it.

Re: block-level incremental backup

On Wed, 10 Apr 2019 14:38:43 -0400
Robert Haas  wrote:

> On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais
>  wrote:
> > In my current design, the scan is done backward from end to start and I
> > keep all the records appearing after the last occurrence of their
> > respective FPI.  
> 
> Oh, interesting.  That seems like it would require pretty major
> surgery on the WAL stream.

Indeed.

Presently, the surgery in my code is replacing redundant xlogrecord with noop.

I have now to deal with muti-blocks records. So far, I tried to mark non-needed
block with !BKPBLOCK_HAS_DATA and made a simple patch in core to ignore such
marked blocks, but it doesn't play well with dependency between xlogrecord, eg.
during UPDATE. So my plan is to rewrite them to remove non-needed blocks using
eg. XLOG_FPI.

As I wrote, this is mainly an hobby project right now for my own education. Not
sure where it leads me, but I learn a lot while working on it.

Re: pg_dump is broken for partition tablespaces

On 2019-Apr-10, Andres Freund wrote:

> Hi,
> 
> On 2019-04-10 09:28:21 -0400, Alvaro Herrera wrote:
> > So I think that apart from David's patch, we should just document all
> > these things carefully.
> 
> Yea, I think that's the most important part.
> 
> I'm not convinced that we should have any inheriting behaviour btw - it
> seems like there's a lot of different ways to think this should behave,
> with different good reason each.

So, I ended up with the attached patch.  I think it works pretty well,
and it passes all my POLA tests.

But it doesn't pass pg_upgrade tests!  And investigating closer, it
seems closely related to what David was complaining elsewhere about the
tablespace being improperly set by some rewrite operations.  Here's the
setup as created by regress' create_table.sql:

create table at_partitioned (a int, b text) partition by range (a);
create table at_part_1 partition of at_partitioned for values from (0) to 
(1000);
insert into at_partitioned values (512, '0.123');
create table at_part_2 (b text, a int);
insert into at_part_2 values ('1.234', 1024);
create index on at_partitioned (b);
create index on at_partitioned (a);

If you examine state at this point, it's all good:
alvherre=# select relname, reltablespace from pg_class where relname like 
'at_partitioned%';
   relname| reltablespace 
--+---
 at_partitioned   | 0
 at_partitioned_a_idx | 0
 at_partitioned_b_idx | 0

but the test immediately does this:

alter table at_partitioned alter column b type numeric using b::numeric;

and watch what happens!  (1663 is pg_default)

alvherre=# select relname, reltablespace from pg_class where relname like 
'at_partitioned%';
   relname| reltablespace 
--+---
 at_partitioned   | 0
 at_partitioned_a_idx | 0
 at_partitioned_b_idx |  1663
(3 filas)

Outrageous!

I'm going to have a look at this behavior now.  IMO it's a separate bug,
but with that obviously we cannot fix the other one.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: block-level incremental backup

2019-04-10 Thread Konstantin Knizhnik





On 10.04.2019 19:51, Robert Haas wrote:

On Wed, Apr 10, 2019 at 10:22 AM Konstantin Knizhnik
 wrote:

Some times ago I have implemented alternative version of ptrack utility
(not one used in pg_probackup)
which detects updated block at file level. It is very simple and may be
it can be sometimes integrated in master.

I don't think this is completely crash-safe.  It looks like it
arranges to msync() the ptrack file at appropriate times (although I
haven't exhaustively verified the logic), but it uses MS_ASYNC, so
it's possible that the ptrack file could get updated on disk either
before or after the relation file itself.  I think before is probably
OK -- it just risks having some blocks look modified when they aren't
really -- but after seems like it is very much not OK.  And changing
this to use MS_SYNC would probably be really expensive.  Likely a
better approach would be to hook into the new fsync queue machinery
that Thomas Munro added to PostgreSQL 12.


I do not think that MS_SYNC or fsync queue is needed here.
If power failure or OS crash cause loose of some writes to ptrack map, 
then in any case {ostgres will perform recovery and updating pages from 
WAL cause once again marking them in ptrack map. So as in case of CLOG 
and many other Postgres files it is not critical to loose some writes 
because them will be restored from WAL. And before truncating WAL, 
Postgres performs checkpoint which flushes all changes to the disk, 
including ptrack map updates.




It looks like your system maps all the blocks in the system into a
fixed-size map using hashing.  If the number of modified blocks
between the full backup and the incremental backup is large compared
to the size of the ptrack map, you'll start to get a lot of
false-positives.  It will look as if much of the database needs to be
backed up.  For example, in your sample configuration, you have
ptrack_map_size = 103. If you've got a 100GB database with 20%
daily turnover, that's about 2.6 million blocks.  If you set bump a
random entry ~2.6 million times in a map with 103 entries, on the
average ~92% of the entries end up getting bumped, so you will get
very little benefit from incremental backup.  This problem drops off
pretty fast if you raise the size of the map, but it's pretty critical
that your map is large enough for the database you've got, or you may
as well not bother.

This is why ptrack block size should be larger than page size.
Assume that it is 1Mb. 1MB is considered to be optimal amount of disk 
IO, when frequent seeks are not degrading read speed (it is most 
critical for HDD). In other words reading 10 random pages (20%) from 
this 1Mb block will takes almost the same amount of time (or even 
longer) than reading all this 1Mb in one operation.


There will be just 10 used entries in ptrack map with very small 
probability of collision.
Actually I have chosen this size (103) for ptrack map because with 
1Mb block size is allows to map without noticable number of collisions 
1Tb database which seems to be enough for most Postgres installations. 
But increasing ptrack map size 10 and even 100 times should not also 
cause problems with modern RAM sizes.




It also appears that your system can't really handle resizing of the
map in any friendly way.  So if your data size grows, you may be faced
with either letting the map become progressively less effective, or
throwing it out and losing all the data you have.

None of that is to say that what you're presenting here has no value,
but I think it's possible to do better (and I think we should try).

Definitely I didn't consider proposed patch as perfect solution and 
certainly it requires improvements (and may be complete redesign).
I just want to present this approach (maintaining hash of block's LSN in 
mapped memory) and keeping track of modified blocks at file level 
(unlike current ptrack implementation which logs changes in all places 
in Postgres code where data is updated).


Also, despite to the fact that this patch may be considered as raw 
prototype, I have spent some time thinking about all aspects of this 
approach including fault tolerance and false positives.

Re: PostgreSQL pollutes the file system

On 2019-04-10 15:01, Tatsuo Ishii wrote:
>> On 2019-03-29 20:32, Joe Conway wrote:
>>>   pg_util  
>>
>> How is that better than just renaming to pg_$oldname?
> 
> As I already said in up thread:
> 
>> This way, we would be free from the command name conflict problem

Well, whatever we do -- if anything -- we would certainly need to keep
the old names around for a while somehow.  So this doesn't really make
that issue go away.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PostgreSQL pollutes the file system

On 2019-04-10 15:15, Fred .Flintstone wrote:
> The warnings would only be printed if the programs were executed with
> the old file names.
> This in order to inform people relying on the old names that they are
> deprecated and they should move to the new names with the pg_ prefix.

Yeah, that would be annoying.  Let's not do that.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: block-level incremental backup

On 2019-04-10 17:31, Robert Haas wrote:
> I think the way to think about this problem, or at least the way I
> think about this problem, is that we need to decide whether want
> file-level incremental backup, block-level incremental backup, or
> byte-level incremental backup.

That is a great analysis.  Seems like block-level is the preferred way
forward.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Should the docs have a warning about pg_stat_reset()?

On 2019-Apr-10, Bruce Momjian wrote:

> On Thu, Apr 11, 2019 at 04:14:11AM +1200, David Rowley wrote:

> > I still think we should start with a warning about pg_stat_reset().
> > People are surprised by this, and these are just the ones who notice:
> > 
> > https://www.postgresql.org/message-id/CAB_myF4sZpxNXdb-x=welpqbdou6ue8fhtm0fverpm-1j7p...@mail.gmail.com
> > 
> > I imagine there are many others just suffering from bloat due to
> > auto-vacuum not knowing how many dead tuples there are in the tables.
> 
> OK, let me step back.  Why are people resetting the statistics
> regularly?  Based on that purpose, does it make sense to clear the
> stats that effect autovacuum?

I agree that we should research that angle.  IMO resetting stats should
be seriously frowned upon.  And if they do need to reset some counters
for some valid reason, offer a mechanism that leaves the autovac-
guiding counters alone.

IMO the answer for $SUBJECT is yes.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: block-level incremental backup

Hi,

On 2019-04-10 14:38:43 -0400, Robert Haas wrote:
> On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais
>  wrote:
> > In my current design, the scan is done backward from end to start and I 
> > keep all
> > the records appearing after the last occurrence of their respective FPI.
> 
> Oh, interesting.  That seems like it would require pretty major
> surgery on the WAL stream.

Can't you just read each segment forward, and then reverse? That's not
that much memory? And sure, there's some inefficient cases where records
span many segments, but that's rare enough that reading a few segments
several times doesn't strike me as particularly bad?

Greetings,

Andres Freund

Re: Should the docs have a warning about pg_stat_reset()?

2019-04-10 Thread Bruce Momjian

On Thu, Apr 11, 2019 at 04:14:11AM +1200, David Rowley wrote:
> On Sat, 30 Mar 2019 at 00:59, Robert Haas  wrote:
> >
> > On Wed, Mar 27, 2019 at 7:49 PM David Rowley
> >  wrote:
> > > Yeah, analyze, not vacuum.  It is a bit scary to add new ways for
> > > auto-vacuum to suddenly have a lot of work to do.  When all workers
> > > are busy it can lead to neglect of other duties.  It's true that there
> > > won't be much in the way of routine vacuuming work for the database
> > > the stats were just reset on, as of course, all the n_dead_tup
> > > counters were just reset. However, it could starve other databases of
> > > vacuum attention.  Anti-wraparound vacuums on the current database may
> > > get neglected too.
> > >
> > > I'm not saying let's not do it, I'm just saying we need to think of
> > > what bad things could happen as a result of such a change.
> >
> > +1.  I think that if we documented that pg_stat_reset() is going to
> > trigger an auto-analyze of every table in your system, we'd have some
> > people who didn't read the documentation and unleashed a storm of
> > auto-analyze activity, and other people who did read the documentation
> > and then intentionally used it to unleash a storm of auto-analyze
> > activity.  Neither sounds that great.
> 
> I still think we should start with a warning about pg_stat_reset().
> People are surprised by this, and these are just the ones who notice:
> 
> https://www.postgresql.org/message-id/CAB_myF4sZpxNXdb-x=welpqbdou6ue8fhtm0fverpm-1j7p...@mail.gmail.com
> 
> I imagine there are many others just suffering from bloat due to
> auto-vacuum not knowing how many dead tuples there are in the tables.

OK, let me step back.  Why are people resetting the statistics
regularly?  Based on that purpose, does it make sense to clear the
stats that effect autovacuum?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Postgres "failures" dataset for machine learning

2019-04-10 Thread Ben Simmons

Hi all,

I was wondering if there exists either a test suite of pathological failure
cases for postgres, or a dataset of failure scenarios. I'm not exactly sure
what such a dataset would look like, possibly a bunch of snapshots of test
databases when undergoing a bunch of different failure scenarios?

I'm experimenting with machine learning and I had an idea to build a
classifier to determine if a running postgres database is having issues.
Right now "issues" is very ambiguously defined, but I'm thinking of
problems I've encountered at work, such as resource saturation, long
running transactions, lock contention, etc. I know a lot of this is already
covered by existing monitoring solutions, but I'm specifically interested
to see if a ML model can learn monitoring rules on its own.

If the classifier turns out to be feasible then my hope would to be to
expand the ML model to have some diagnostic capabilities -- I've had
difficulty in the past figuring out exactly what is going wrong with
postgres when my workplace's production environment was having database
issues.

Thanks,

Ben Simmons

Re: block-level incremental backup

On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais
 wrote:
> In my current design, the scan is done backward from end to start and I keep 
> all
> the records appearing after the last occurrence of their respective FPI.

Oh, interesting.  That seems like it would require pretty major
surgery on the WAL stream.

> Summary files looks like what Andrey Borodin described as delta-files and
> change maps.

Yep.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: block-level incremental backup

Hi,

First thank you for your answer!

On Wed, 10 Apr 2019 12:21:03 -0400
Robert Haas  wrote:

> On Wed, Apr 10, 2019 at 10:57 AM Jehan-Guillaume de Rorthais
>  wrote:
> > My idea would be create a new tool working on archived WAL. No burden
> > server side. Basic concept is:
> >
> > * parse archives
> > * record latest relevant FPW for the incr backup
> > * write new WALs with recorded FPW and removing/rewriting duplicated
> > walrecords.
> >
> > It's just a PoC and I hadn't finished the WAL writing part...not even
> > talking about the replay part. I'm not even sure this project is a good
> > idea, but it is a good educational exercice to me in the meantime.
> >
> > Anyway, using real life OLTP production archives, my stats were:
> >
> >   # WAL   xlogrec kept Size WAL kept
> > 12739%   50%
> > 38322%   38%
> > 63920%   29%
> >
> > Based on this stats, I expect this would save a lot of time during recovery
> > in a first step. If it get mature, it might even save a lot of archives
> > space or extend the retention period with degraded granularity. It would
> > even help taking full backups with a lower frequency.
> >
> > Any thoughts about this design would be much appreciated. I suppose this
> > should be offlist or in a new thread to avoid polluting this thread as this
> > is a slightly different subject.  
> 
> Interesting idea, but I don't see how it can work if you only deal
> with the FPWs and not the other records.  For instance, suppose that
> you take a full backup at time T0, and then at time T1 there are two
> modifications to a certain block in quick succession.  That block is
> then never touched again.  Since no checkpoint intervenes between the
> modifications, the first one emits an FPI and the second does not.
> Capturing the FPI is fine as far as it goes, but unless you also do
> something with the non-FPI change, you lose that second modification.
> You could fix that by having your tool replicate the effects of WAL
> apply outside the server, but that sounds like a ton of work and a ton
> of possible bugs.

In my current design, the scan is done backward from end to start and I keep all
the records appearing after the last occurrence of their respective FPI.

The next challenge I have to achieve is to deal with multiple blocks records
where some need to be removed and other are FPI to keep (eg. UPDATE).

> I have a related idea, though.  Suppose that, as Peter says upthread,
> you have a replication slot that prevents old WAL from being removed.
> You also have a background worker that is connected to that slot.  It
> decodes WAL and produces summary files containing all block-references
> extracted from those WAL records and the associated LSN (or maybe some
> approximation of the LSN instead of the exact value, to allow for
> compression and combining of nearby references).  Then you hold onto
> those summary files after the actual WAL is removed.  Now, when
> somebody asks the server for all blocks changed since a certain LSN,
> it can use those summary files to figure out which blocks to send
> without having to read all the pages in the database.  Although I
> believe that a simple system that finds modified blocks by reading
> them all is good enough for a first version of this feature and useful
> in its own right, a more efficient system will be a lot more useful,
> and something like this seems to me to be probably the best way to
> implement it.

Summary files looks like what Andrey Borodin described as delta-files and
change maps.

> With an approach based
> on WAL-scanning, the work is done in the background and nobody has to
> wait for it.

Agree with this.

Re: Status of the table access method work

2019-04-10 Thread Alexander Korotkov

On Wed, Apr 10, 2019 at 8:32 PM Andres Freund  wrote:
> On 2019-04-10 20:14:17 +0300, Alexander Korotkov wrote:
> > Your explanation of existing limitations looks very good and
> > convincing.  But I think there is one you didn't mention. We require
> > new table AMs to basically save old "contract" between heap and
> > indexes.  We have "all or nothing" choice during updates.  So, storage
> > can either insert to none of indexes or insert to all of them
> > (HOT-like update).
>
> I think that's a problem, and yea, I should have mentioned it. I'd
> earlier thought about it and then forgot.
>
> I however don't think we should design the interface for this before we
> have at least one AM that's actually in decent-ish shape that needs
> it. I seriously doubt we'll get the interface right enough.
>
> Note: I'm *extremely* *extremely* doubtful that moving the full executor
> invocations for expression indices etc into the tableam is a sane thing
> to do. It's possible to convince me there's no alternative, but it'll be
> really hard.
>
> I suspect the right direction will be more going in a direction of
> computing new index tuples for expression indexes before tableam gets
> involved. If we do that right, we can also implement the stuff that
> 1c53c4dec3985512f7f2f53c9d76a5295cd0a2dd reverted in a proper way.

Probably we can invent few modes table AM might work: calculation of
all new index tuples, calculation of new and old index tuples for
updated fields, calculation of all new and old index tuples and so on.
And them index tuples would be calculated either in advance or by
callback.

> > I think any storage, which is going to address "write amplification"
> > point raised by Uber, needs to break this "contract".
>
> FWIW, I don't think it makes much point in using Uber as a justification
> for anything here. Their analysis was so deeply flawed and motivated by
> non-technical reasons that it should just be ignored.

Yeah, Uber is just a buzz word here.  But problem that update of
single indexed field leads to insertions to every index is well-known
among the PostgreSQL users.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: pgbench - add minimal stats on initialization

2019-04-10 Thread Fabien COELHO



Hello,

Thanks for the feedback.


I have one minor observation that in case of initDropTables you log
'drop' and in case of initCreateTables you log 'create table'. We need
to be consistent. The "drop tables" and "create tables" are the best
fit here.


Ok.

Attached version does that, plus avoids re-assigning "first" on each loop, 
plus checks that --no-vacuum indeed removes all vacuums in the TAP test.


--
Fabien.diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index b67ad5e823..7de0d914f1 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3929,32 +3929,48 @@ checkInitSteps(const char *initialize_steps)
 static void
 runInitSteps(const char *initialize_steps)
 {
-	PGconn	   *con;
-	const char *step;
+	PQExpBufferData	stats;
+	PGconn			   *con;
+	const char		   *step;
+	doublerun_time = 0.0;
+	boolfirst = true;
+
+	initPQExpBuffer();
 
 	if ((con = doConnect()) == NULL)
 		exit(1);
 
 	for (step = initialize_steps; *step != '\0'; step++)
 	{
+		instr_time	start;
+		char 	   *op = NULL;
+
+		INSTR_TIME_SET_CURRENT(start);
+
 		switch (*step)
 		{
 			case 'd':
+op = "drop tables";
 initDropTables(con);
 break;
 			case 't':
+op = "create tables";
 initCreateTables(con);
 break;
 			case 'g':
+op = "generate",
 initGenerateData(con);
 break;
 			case 'v':
+op = "vacuum";
 initVacuum(con);
 break;
 			case 'p':
+op = "primary keys";
 initCreatePKeys(con);
 break;
 			case 'f':
+op = "foreign keys";
 initCreateFKeys(con);
 break;
 			case ' ':
@@ -3965,10 +3981,30 @@ runInitSteps(const char *initialize_steps)
 PQfinish(con);
 exit(1);
 		}
+
+		if (op != NULL)
+		{
+			instr_time	diff;
+			double		elapsed_sec;
+
+			INSTR_TIME_SET_CURRENT(diff);
+			INSTR_TIME_SUBTRACT(diff, start);
+			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
+
+			if (!first)
+appendPQExpBufferStr(, ", ");
+			else
+first = false;
+
+			appendPQExpBuffer(, "%s %.2f s", op, elapsed_sec);
+
+			run_time += elapsed_sec;
+		}
 	}
 
-	fprintf(stderr, "done.\n");
+	fprintf(stderr, "done in %.2f s (%s).\n", run_time, stats.data);
 	PQfinish(con);
+	termPQExpBuffer();
 }
 
 /*
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 62906d5e55..e910c2fc01 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -101,7 +101,7 @@ pgbench(
 	[qr{^$}],
 	[
 		qr{creating tables},   qr{vacuuming},
-		qr{creating primary keys}, qr{done\.}
+		qr{creating primary keys}, qr{done in \d+\.\d\d s }
 	],
 	'pgbench scale 1 initialization',);
 
@@ -116,7 +116,8 @@ pgbench(
 		qr{vacuuming},
 		qr{creating primary keys},
 		qr{creating foreign keys},
-		qr{done\.}
+		qr{(?!vacuuming)}, # no vacuum
+		qr{done in \d+\.\d\d s }
 	],
 	'pgbench scale 1 initialization');
 
@@ -131,7 +132,8 @@ pgbench(
 		qr{creating primary keys},
 		qr{.* of .* tuples \(.*\) done},
 		qr{creating foreign keys},
-		qr{done\.}
+		qr{(?!vacuuming)}, # no vacuum
+		qr{done in \d+\.\d\d s }
 	],
 	'pgbench --init-steps');

Re: Status of the table access method work

Hi,

On 2019-04-10 20:14:17 +0300, Alexander Korotkov wrote:
> Your explanation of existing limitations looks very good and
> convincing.  But I think there is one you didn't mention. We require
> new table AMs to basically save old "contract" between heap and
> indexes.  We have "all or nothing" choice during updates.  So, storage
> can either insert to none of indexes or insert to all of them
> (HOT-like update).

I think that's a problem, and yea, I should have mentioned it. I'd
earlier thought about it and then forgot.

I however don't think we should design the interface for this before we
have at least one AM that's actually in decent-ish shape that needs
it. I seriously doubt we'll get the interface right enough.

Note: I'm *extremely* *extremely* doubtful that moving the full executor
invocations for expression indices etc into the tableam is a sane thing
to do. It's possible to convince me there's no alternative, but it'll be
really hard.

I suspect the right direction will be more going in a direction of
computing new index tuples for expression indexes before tableam gets
involved. If we do that right, we can also implement the stuff that
1c53c4dec3985512f7f2f53c9d76a5295cd0a2dd reverted in a proper way.

> I think any storage, which is going to address "write amplification"
> point raised by Uber, needs to break this "contract".

FWIW, I don't think it makes much point in using Uber as a justification
for anything here. Their analysis was so deeply flawed and motivated by
non-technical reasons that it should just be ignored.

Greetings,

Andres Freund

Re: Status of the table access method work

2019-04-10 Thread Alexander Korotkov

On Fri, Apr 5, 2019 at 11:25 PM Andres Freund  wrote:
> I want to thank Haribabu, Alvaro, Alexander, David, Dmitry and all the
> others that collaborated on making tableam happen. It was/is a huge
> project.

Thank you so much for bringing this project to commit!  Excellent work!

Your explanation of existing limitations looks very good and
convincing.  But I think there is one you didn't mention. We require
new table AMs to basically save old "contract" between heap and
indexes.  We have "all or nothing" choice during updates.  So, storage
can either insert to none of indexes or insert to all of them
(HOT-like update).  I think any storage, which is going to address
"write amplification" point raised by Uber, needs to break this
"contract".

For example, zheap is promised to implement delete-marking indexes.
But it's not yet published.  And for me it's not clear that this
approach is better among the alternatives.  With delete-marking
approach you need to update index tuples corresponding to old values
of updated fields.  But additionally to that it's not trivial to
delete index tuple.  In order to do that, you need to both locate this
index tuple and know that this index value isn't present in undo
chain.  So, it's likely required another index lookup during purging
of undo chain.  Thus, we basically need to random lookup index twice
for every deleted index tuple.  Also, it becomes more complex to
lookup appropriate heap tuple during index scan.  Then you need to
check not only visibility, but also matching index value (here we need
to adjust index_fetch_tuple interface).  Because it might happen that
visible to you version have different index value.  That may lead to
O(N^2) performance while accessing single row with N versions (MySQL
InnoDB has this problem).

Alternative idea is to have MVCC-aware indexes.  This approach looks
more attractive for me.  In this approach you basically need xmin,
xmax fields in index tuples.  On insertion of index tuple you fill
it's xmin.  On update, previous index tuple is marked with xmax.
After that outdated index tuples might be deleted in the lazy manner
when page space is required.  So, only one random access is required
for deleted index tuple.  With this approach fetching single row is
O(N).  Also, index-only scan becomes very easy and doesn't even need a
visibility map.  The only problem here is extra space requirements for
index tuples.  But I think, this is well-isolated problem, which is
easy to attack.  For instance, some visibility information could be
evicted to undo chain (like zheap does for its tuples).  Also, we can
have special bit for "all visible" index tuples.  With "all visible"
bit set this tuple can get rid of visibility fields.  We can do this
for index tuples, because if index tuple requires extra space we can
split the page, in spite of heap where tuples are fixed in pages and
xmax needs to be updated in-place.

I understand that delete-marking indexes have some advantages, and
some people find them more appealing.  But my point is that we
shouldn't builtin one of this approaches into API unless we have
concrete proof that this approach is strongly overcomes another.  It
would be better to have our table-AM API flexible enough to implement
both.  I can imagine we have proper encapsulation here bringing more
interaction with indexes to the table AM side.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: block-level incremental backup

On Wed, Apr 10, 2019 at 12:56 PM Ashwin Agrawal  wrote:
> Not to fork the conversation from incremental backups, but similar approach 
> is what we have been thinking for pg_rewind. Currently, pg_rewind requires 
> all the WAL logs to be present on source side from point of divergence to 
> rewind. Instead just parse the wal and keep the changed blocks around on 
> sourece. Then don't need to retain the WAL but can still rewind using the 
> changed block map. So, rewind becomes much similar to incremental backup 
> proposed here after performing rewind activity using target side WAL only.

Interesting.  So if we build a system like this for incremental
backup, or for pg_rewind, the other one can use the same
infrastructure.  That sound excellent.  I'll start a new thread to
talk about that, and hopefully you and Heikki and others will chime in
with thoughts.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: block-level incremental backup

On Wed, Apr 10, 2019 at 7:51 AM Andrey Borodin  wrote:
> > 9 апр. 2019 г., в 20:48, Robert Haas  написал(а):
> > - This is just a design proposal at this point; there is no code.  If
> > this proposal, or some modified version of it, seems likely to be
> > acceptable, I and/or my colleagues might try to implement it.
>
> I'll be happy to help with code, discussion and patch review.

That would be great!

We should probably give this discussion some more time before we
plunge into the implementation phase, but I'd love to have some help
with that, whether it's with coding or review or whatever.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Cleanup/remove/update references to OID column

2019-04-10 Thread Justin Pryzby

On Wed, Apr 10, 2019 at 06:32:35PM +0200, Daniel Verite wrote:
>   Justin Pryzby wrote:
> 
> > Cleanup/remove/update references to OID column...
> 
> Just spotted a couple of other references that need updates:

> #1. In catalogs.sgml:
> #2. In ddl.sgml, when describing ctid:

I found and included fixes for a few more references:

 doc/src/sgml/catalogs.sgml   | 2 +-
 doc/src/sgml/ddl.sgml| 3 +--
 doc/src/sgml/information_schema.sgml | 4 ++--
 doc/src/sgml/ref/create_trigger.sgml | 2 +-
 doc/src/sgml/spi.sgml| 2 +-

Justin
>From d77f3d95b8cbb6bbc4addaaf4d9b1bcc11f10f10 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 2 Apr 2019 19:13:55 -0500
Subject: [PATCH v2] Cleanup/remove/update references to OID column...

..in wake of 578b229718e8f.

See also
93507e67c9ca54026019ebec3026de35d30370f9
1464755fc490a9911214817fe83077a3689250ab

Reviewed-by: Daniel Verite 
---
 doc/src/sgml/catalogs.sgml   |  2 +-
 doc/src/sgml/ddl.sgml| 12 +---
 doc/src/sgml/information_schema.sgml |  4 ++--
 doc/src/sgml/ref/create_trigger.sgml |  2 +-
 doc/src/sgml/ref/insert.sgml | 12 +---
 doc/src/sgml/ref/psql-ref.sgml   |  3 +++
 doc/src/sgml/spi.sgml|  2 +-
 7 files changed, 18 insertions(+), 19 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 40ddec4..d544e60 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1047,7 +1047,7 @@
   
   
The number of the column.  Ordinary columns are numbered from 1
-   up.  System columns, such as oid,
+   up.  System columns, such as ctid,
have (arbitrary) negative numbers.
   
  
diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 9e761db..244d5ce 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -1202,8 +1202,7 @@ CREATE TABLE circles (
   ctid will change if it is
   updated or moved by VACUUM FULL.  Therefore
   ctid is useless as a long-term row
-  identifier.  The OID, or even better a user-defined serial
-  number, should be used to identify logical rows.
+  identifier.  A primary key should be used to identify logical rows.
  
 

@@ -3672,11 +3671,10 @@ VALUES ('Albany', NULL, NULL, 'NY');
   
Partitions cannot have columns that are not present in the parent.  It
is not possible to specify columns when creating partitions with
-   CREATE TABLE, nor is it possible to add columns to
-   partitions after-the-fact using ALTER TABLE.  Tables may be
-   added as a partition with ALTER TABLE ... ATTACH PARTITION
-   only if their columns exactly match the parent, including any
-   oid column.
+   CREATE TABLE, to add columns to
+   partitions after-the-fact using ALTER TABLE, nor to
+   add a partition with ALTER TABLE ... ATTACH PARTITION
+   if its columns would not exactly match those of the parent.
   
  
 
diff --git a/doc/src/sgml/information_schema.sgml b/doc/src/sgml/information_schema.sgml
index 1321ade..9c618b1 100644
--- a/doc/src/sgml/information_schema.sgml
+++ b/doc/src/sgml/information_schema.sgml
@@ -1312,8 +1312,8 @@
   
The view columns contains information about all
table columns (or view columns) in the database.  System columns
-   (oid, etc.) are not included.  Only those columns are
-   shown that the current user has access to (by way of being the
+   (ctid, etc.) are not included.  The only columns shown
+   are those to which the current user has access (by way of being the
owner or having some privilege).
   
 
diff --git a/doc/src/sgml/ref/create_trigger.sgml b/doc/src/sgml/ref/create_trigger.sgml
index 6456105..3339a4b 100644
--- a/doc/src/sgml/ref/create_trigger.sgml
+++ b/doc/src/sgml/ref/create_trigger.sgml
@@ -465,7 +465,7 @@ UPDATE OF column_name1 [, column_name2NEW row seen by the condition is the current value,
as possibly modified by earlier triggers.  Also, a BEFORE
trigger's WHEN condition is not allowed to examine the
-   system columns of the NEW row (such as oid),
+   system columns of the NEW row (such as ctid),
because those won't have been set yet.
   
 
diff --git a/doc/src/sgml/ref/insert.sgml b/doc/src/sgml/ref/insert.sgml
index 62e142f..3e1be4c 100644
--- a/doc/src/sgml/ref/insert.sgml
+++ b/doc/src/sgml/ref/insert.sgml
@@ -552,13 +552,11 @@ INSERT INTO table_name [ AS oid count
 
The count is the
-   number of rows inserted or updated.  If count is exactly one, and the
-   target table has OIDs, then oid is the OID
-   assigned to the inserted row.  The single row must have been
-   inserted rather than updated.  Otherwise oid is zero.
+   number of rows inserted or updated.
+   oid used to be the object ID of the inserted row
+   if rows was 1 and the target table had OIDs, but
+   OIDs system columns are not supported anymore; therefore
+   oid is always 0.

Re: block-level incremental backup

2019-04-10 Thread Ashwin Agrawal

On Wed, Apr 10, 2019 at 9:21 AM Robert Haas  wrote:

> I have a related idea, though.  Suppose that, as Peter says upthread,
> you have a replication slot that prevents old WAL from being removed.
> You also have a background worker that is connected to that slot.  It
> decodes WAL and produces summary files containing all block-references
> extracted from those WAL records and the associated LSN (or maybe some
> approximation of the LSN instead of the exact value, to allow for
> compression and combining of nearby references).  Then you hold onto
> those summary files after the actual WAL is removed.  Now, when
> somebody asks the server for all blocks changed since a certain LSN,
> it can use those summary files to figure out which blocks to send
> without having to read all the pages in the database.  Although I
> believe that a simple system that finds modified blocks by reading
> them all is good enough for a first version of this feature and useful
> in its own right, a more efficient system will be a lot more useful,
> and something like this seems to me to be probably the best way to
> implement it.
>

Not to fork the conversation from incremental backups, but similar approach
is what we have been thinking for pg_rewind. Currently, pg_rewind requires
all the WAL logs to be present on source side from point of divergence to
rewind. Instead just parse the wal and keep the changed blocks around on
sourece. Then don't need to retain the WAL but can still rewind using the
changed block map. So, rewind becomes much similar to incremental backup
proposed here after performing rewind activity using target side WAL only.

Re: block-level incremental backup

On Wed, Apr 10, 2019 at 10:22 AM Konstantin Knizhnik
 wrote:
> Some times ago I have implemented alternative version of ptrack utility
> (not one used in pg_probackup)
> which detects updated block at file level. It is very simple and may be
> it can be sometimes integrated in master.

I don't think this is completely crash-safe.  It looks like it
arranges to msync() the ptrack file at appropriate times (although I
haven't exhaustively verified the logic), but it uses MS_ASYNC, so
it's possible that the ptrack file could get updated on disk either
before or after the relation file itself.  I think before is probably
OK -- it just risks having some blocks look modified when they aren't
really -- but after seems like it is very much not OK.  And changing
this to use MS_SYNC would probably be really expensive.  Likely a
better approach would be to hook into the new fsync queue machinery
that Thomas Munro added to PostgreSQL 12.

It looks like your system maps all the blocks in the system into a
fixed-size map using hashing.  If the number of modified blocks
between the full backup and the incremental backup is large compared
to the size of the ptrack map, you'll start to get a lot of
false-positives.  It will look as if much of the database needs to be
backed up.  For example, in your sample configuration, you have
ptrack_map_size = 103. If you've got a 100GB database with 20%
daily turnover, that's about 2.6 million blocks.  If you set bump a
random entry ~2.6 million times in a map with 103 entries, on the
average ~92% of the entries end up getting bumped, so you will get
very little benefit from incremental backup.  This problem drops off
pretty fast if you raise the size of the map, but it's pretty critical
that your map is large enough for the database you've got, or you may
as well not bother.

It also appears that your system can't really handle resizing of the
map in any friendly way.  So if your data size grows, you may be faced
with either letting the map become progressively less effective, or
throwing it out and losing all the data you have.

None of that is to say that what you're presenting here has no value,
but I think it's possible to do better (and I think we should try).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Cleanup/remove/update references to OID column

2019-04-10 Thread Daniel Verite

Justin Pryzby wrote:

> Cleanup/remove/update references to OID column...
> 
> ..in wake of 578b229718e8f.

Just spotted a couple of other references that need updates:

#1. In catalogs.sgml:

  attnum
  int2

   The number of the column.  Ordinary columns are numbered from 1
   up.  System columns, such as oid,
   have (arbitrary) negative numbers.

oid should be replaced by xmin or some other system column.

#2. In ddl.sgml, when describing ctid:

  The physical location of the row version within its table.  Note that
  although the ctid can be used to
  locate the row version very quickly, a row's
  ctid will change if it is
  updated or moved by VACUUM FULL.  Therefore
  ctid is useless as a long-term row
  identifier.  The OID, or even better a user-defined serial
  number, should be used to identify logical rows.

"The OID" used to refer to an entry above in that list, now it's not
clear what it refers to.
"serial number" also sounds somewhat obsolete now that IDENTITY is
supported. The last sentence could be changed to:
 "A primary key should be used to identify logical rows".

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: block-level incremental backup

On Wed, Apr 10, 2019 at 10:57 AM Jehan-Guillaume de Rorthais
 wrote:
> My idea would be create a new tool working on archived WAL. No burden
> server side. Basic concept is:
>
> * parse archives
> * record latest relevant FPW for the incr backup
> * write new WALs with recorded FPW and removing/rewriting duplicated 
> walrecords.
>
> It's just a PoC and I hadn't finished the WAL writing part...not even talking
> about the replay part. I'm not even sure this project is a good idea, but it 
> is
> a good educational exercice to me in the meantime.
>
> Anyway, using real life OLTP production archives, my stats were:
>
>   # WAL   xlogrec kept Size WAL kept
> 12739%   50%
> 38322%   38%
> 63920%   29%
>
> Based on this stats, I expect this would save a lot of time during recovery in
> a first step. If it get mature, it might even save a lot of archives space or
> extend the retention period with degraded granularity. It would even help
> taking full backups with a lower frequency.
>
> Any thoughts about this design would be much appreciated. I suppose this 
> should
> be offlist or in a new thread to avoid polluting this thread as this is a
> slightly different subject.

Interesting idea, but I don't see how it can work if you only deal
with the FPWs and not the other records.  For instance, suppose that
you take a full backup at time T0, and then at time T1 there are two
modifications to a certain block in quick succession.  That block is
then never touched again.  Since no checkpoint intervenes between the
modifications, the first one emits an FPI and the second does not.
Capturing the FPI is fine as far as it goes, but unless you also do
something with the non-FPI change, you lose that second modification.
You could fix that by having your tool replicate the effects of WAL
apply outside the server, but that sounds like a ton of work and a ton
of possible bugs.

I have a related idea, though.  Suppose that, as Peter says upthread,
you have a replication slot that prevents old WAL from being removed.
You also have a background worker that is connected to that slot.  It
decodes WAL and produces summary files containing all block-references
extracted from those WAL records and the associated LSN (or maybe some
approximation of the LSN instead of the exact value, to allow for
compression and combining of nearby references).  Then you hold onto
those summary files after the actual WAL is removed.  Now, when
somebody asks the server for all blocks changed since a certain LSN,
it can use those summary files to figure out which blocks to send
without having to read all the pages in the database.  Although I
believe that a simple system that finds modified blocks by reading
them all is good enough for a first version of this feature and useful
in its own right, a more efficient system will be a lot more useful,
and something like this seems to me to be probably the best way to
implement it.

The reason why I think this is likely to be superior to other possible
approaches, such as the ptrack approach Konstantin suggests elsewhere
on this thread, is because it pushes the work of figuring out which
blocks have been modified into the background.  With a ptrack-type
approach, the server has to do some non-zero amount of extra work in
the foreground every time it modifies a block.  With an approach based
on WAL-scanning, the work is done in the background and nobody has to
wait for it.  It's possible that there are other considerations which
aren't occurring to me right now, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Should the docs have a warning about pg_stat_reset()?

On Sat, 30 Mar 2019 at 00:59, Robert Haas  wrote:
>
> On Wed, Mar 27, 2019 at 7:49 PM David Rowley
>  wrote:
> > Yeah, analyze, not vacuum.  It is a bit scary to add new ways for
> > auto-vacuum to suddenly have a lot of work to do.  When all workers
> > are busy it can lead to neglect of other duties.  It's true that there
> > won't be much in the way of routine vacuuming work for the database
> > the stats were just reset on, as of course, all the n_dead_tup
> > counters were just reset. However, it could starve other databases of
> > vacuum attention.  Anti-wraparound vacuums on the current database may
> > get neglected too.
> >
> > I'm not saying let's not do it, I'm just saying we need to think of
> > what bad things could happen as a result of such a change.
>
> +1.  I think that if we documented that pg_stat_reset() is going to
> trigger an auto-analyze of every table in your system, we'd have some
> people who didn't read the documentation and unleashed a storm of
> auto-analyze activity, and other people who did read the documentation
> and then intentionally used it to unleash a storm of auto-analyze
> activity.  Neither sounds that great.

I still think we should start with a warning about pg_stat_reset().
People are surprised by this, and these are just the ones who notice:

https://www.postgresql.org/message-id/CAB_myF4sZpxNXdb-x=welpqbdou6ue8fhtm0fverpm-1j7p...@mail.gmail.com

I imagine there are many others just suffering from bloat due to
auto-vacuum not knowing how many dead tuples there are in the tables.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_dump is broken for partition tablespaces

Hi,

On 2019-04-10 09:28:21 -0400, Alvaro Herrera wrote:
> So I think that apart from David's patch, we should just document all
> these things carefully.

Yea, I think that's the most important part.

I'm not convinced that we should have any inheriting behaviour btw - it
seems like there's a lot of different ways to think this should behave,
with different good reason each.

Greetings,

Andres Freund

Re: Minimal logical decoding on standbys

Hi,

On 2019-04-10 12:11:21 +0530, tushar wrote:
> 
> On 03/13/2019 08:40 PM, tushar wrote:
> > Hi ,
> > 
> > I am getting a server crash on standby while executing
> > pg_logical_slot_get_changes function   , please refer this scenario
> > 
> > Master cluster( ./initdb -D master)
> > set wal_level='hot_standby in master/postgresql.conf file
> > start the server , connect to  psql terminal and create a physical
> > replication slot ( SELECT * from
> > pg_create_physical_replication_slot('p1');)
> > 
> > perform pg_basebackup using --slot 'p1'  (./pg_basebackup -D slave/ -R
> > --slot p1 -v))
> > set wal_level='logical' , hot_standby_feedback=on,
> > primary_slot_name='p1' in slave/postgresql.conf file
> > start the server , connect to psql terminal and create a logical
> > replication slot (  SELECT * from
> > pg_create_logical_replication_slot('t','test_decoding');)
> > 
> > run pgbench ( ./pgbench -i -s 10 postgres) on master and select
> > pg_logical_slot_get_changes on Slave database
> > 
> > postgres=# select * from pg_logical_slot_get_changes('t',null,null);
> > 2019-03-13 20:34:50.274 IST [26817] LOG:  starting logical decoding for
> > slot "t"
> > 2019-03-13 20:34:50.274 IST [26817] DETAIL:  Streaming transactions
> > committing after 0/6C60, reading WAL from 0/6C28.
> > 2019-03-13 20:34:50.274 IST [26817] STATEMENT:  select * from
> > pg_logical_slot_get_changes('t',null,null);
> > 2019-03-13 20:34:50.275 IST [26817] LOG:  logical decoding found
> > consistent point at 0/6C28
> > 2019-03-13 20:34:50.275 IST [26817] DETAIL:  There are no running
> > transactions.
> > 2019-03-13 20:34:50.275 IST [26817] STATEMENT:  select * from
> > pg_logical_slot_get_changes('t',null,null);
> > TRAP: FailedAssertion("!(data == tupledata + tuplelen)", File:
> > "decode.c", Line: 977)
> > server closed the connection unexpectedly
> >     This probably means the server terminated abnormally
> >     before or while processing the request.
> > The connection to the server was lost. Attempting reset: 2019-03-13
> > 20:34:50.276 IST [26809] LOG:  server process (PID 26817) was terminated
> > by signal 6: Aborted
> > 
> Andres - Do you think - this is an issue which needs to  be fixed ?

Yes, it definitely needs to be fixed. I just haven't had sufficient time
to look into it. Have you reproduced this with Amit's latest version?

Amit, have you spent any time looking into it? I know that you're not
that deeply steeped into the internals of logical decoding, but perhaps
there's something obvious going on.

Greetings,

Andres Freund

Re: block-level incremental backup

On Tue, Apr 9, 2019 at 5:28 PM Alvaro Herrera  wrote:
> On 2019-Apr-09, Peter Eisentraut wrote:
> > On 2019-04-09 17:48, Robert Haas wrote:
> > > 3. There should be a new tool that knows how to merge a full backup
> > > with any number of incremental backups and produce a complete data
> > > directory with no remaining partial files.
> >
> > Are there by any chance standard file formats and tools that describe a
> > binary difference between directories?  That would be really useful here.
>
> VCDIFF? https://tools.ietf.org/html/rfc3284

I don't understand VCDIFF very well, but I see some potential problems
with going in this direction.

First, suppose we take a full backup on Monday.  Then, on Tuesday, we
want to take an incremental backup.  In my proposal, the backup server
only needs to provide the database with one piece of information: the
start-LSN of the previous backup.  The server determines which blocks
are recently modified and sends them to the client, which stores them.
The end.  On the other hand, storing a maximally compact VCDIFF seems
to require that, for each block modified in the Tuesday backup, we go
read the corresponding block as it existed on Monday.  Assuming that
the server is using some efficient method of locating modified blocks,
this will approximately double the amount of read I/O required to
complete the backup: either the server or the client must now read not
only the current version of the block but the previous versions.  If
the previous backup is an incremental backup that does not contain
full block images but only VCDIFF content, whoever is performing the
VCDIFF calculation will need to walk the entire backup chain and
reconstruct the previous contents of the previous block so that it can
compute the newest VCDIFF.  A customer who does an incremental backup
every day and maintains a synthetic full backup from 1 week prior will
see a roughly eightfold increase in read I/O compared to the design I
proposed.

The same problem exists at restore time.  In my design, the total read
I/O required is equal to the size of the database, plus however much
metadata needs to be read from older delta files -- and that should be
fairly small compared to the actual data being read, at least in
normal, non-extreme cases.  But if we are going to proceed by applying
a series of delta files, we're going to need to read every older
backup in its entirety.  If the turnover percentage is significant,
say 20%/day, and if the backup chain is say 7 backups long to get back
to a full backup, this is a huge difference.  Instead of having to
read ~100% of the database size, as in my proposal, we'll need to read
100% + (6 * 20%) = 220% of the database size.

Since VCDIFF uses an add-copy-run language to described differences,
we could try to work around the problem that I just described by
describing each changed data block as an 8192-byte add, and unchanged
blocks as an 8192-byte copy.  If we did that, then I think that the
problem at backup time goes away: we can write out a VCDIFF-format
file for the changed blocks based just on knowing that those are the
blocks that have changed, without needing to access the older file. Of
course, if we do it this way, the file will be larger than it would be
if we actually compared the old and new block contents and wrote out a
minimal VCDIFF, but it does make taking a backup a lot simpler.  Even
with this proposal, though, I think we still have trouble with restore
time.  I proposed putting the metadata about which blocks are included
in a delta file at the beginning of the file, which allows a restore
of a new incremental backup to relatively efficiently flip through
older backups to find just the blocks that it needs, without having to
read the whole file.  But I think (although I am not quite sure) that
in the VCDIFF format, the payload for an ADD instruction is stored
near the payload.  The result would be that you'd have to basically
read the whole file at restore time to figure out which blocks were
available from that file and which ones needed to be retrieved from an
older backup.  So while this approach would fix the backup-time
problem, I believe that it would still require significantly more read
I/O at restore time than my proposal.

Furthermore, if, at backup time, we have to do anything that requires
access to the old data, either the client or the server needs to have
access to that data.  Nonwithstanding the costs of reading it, that
doesn't seem very desirable.  The server is quite unlikely to have
access to the backups, because most users want to back up to a
different server in order to guard against a hardware failure.  The
client is more likely to be running on a machine where it has access
to the data, because many users back up to the same machine every day,
so the machine that is taking the current backup probably has the
older one.  However, accessing that old backup might not be cheap.  It
could be located in an object store in the

Re: Failure in contrib test _int on loach

2019-04-10 Thread Heikki Linnakangas


On 09/04/2019 19:11, Anastasia Lubennikova wrote:

05.04.2019 19:41, Anastasia Lubennikova writes:

In attachment, you can find patch with a test that allows to reproduce
the bug not randomly, but on every run.
Now I'm trying to find a way to fix the issue.


The problem was caused by incorrect detection of the page to insert new
tuple after split.
If gistinserttuple() of the tuple formed by gistgetadjusted() had to
split the page, we must to go back to the parent and
descend back to the child that's a better fit for the new tuple.

Previously this was handled by the code block with the following comment:

* Concurrent split detected. There's no guarantee that the
* downlink for this page is consistent with the tuple we're
* inserting anymore, so go back to parent and rechoose the best
* child.

After introducing GistBuildNSN this code path became unreachable.
To fix it, I added new flag to detect such splits during indexbuild.


Isn't it possible that the grandparent page is also split, so that we'd 
need to climb further up?


- Heikki

Re: Berserk Autovacuum (let's save next Mandrill)

Hi,

On April 10, 2019 8:13:06 AM PDT, Alvaro Herrera  
wrote:
>On 2019-Mar-31, Darafei "Komяpa" Praliaskouski wrote:
>
>> Alternative point of "if your database is super large and actively
>written,
>> you may want to set autovacuum_freeze_max_age to even smaller values
>so
>> that autovacuum load is more evenly spread over time" may be needed.
>
>I don't think it's helpful to force emergency vacuuming more
>frequently;
>quite the contrary, it's likely to cause even more issues.  We should
>tweak autovacuum to perform freezing more preemtively instead.

I still think the fundamental issue with making vacuum less painful is that the 
all indexes have to be read entirely. Even if there's not much work (say 
millions of rows frozen, hundreds removed). Without that issue we could vacuum 
much more frequently. And do it properly in insert only workloads.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Berserk Autovacuum (let's save next Mandrill)

On 2019-Mar-31, Darafei "Komяpa" Praliaskouski wrote:

> Alternative point of "if your database is super large and actively written,
> you may want to set autovacuum_freeze_max_age to even smaller values so
> that autovacuum load is more evenly spread over time" may be needed.

I don't think it's helpful to force emergency vacuuming more frequently;
quite the contrary, it's likely to cause even more issues.  We should
tweak autovacuum to perform freezing more preemtively instead.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: block-level incremental backup

Hi,

On Tue, 9 Apr 2019 11:48:38 -0400
Robert Haas  wrote:

> Several companies, including EnterpriseDB, NTT, and Postgres Pro, have
> developed technology that permits a block-level incremental backup to
> be taken from a PostgreSQL server.  I believe the idea in all of those
> cases is that non-relation files should be backed up in their
> entirety, but for relation files, only those blocks that have been
> changed need to be backed up.  I would like to propose that we should
> have a solution for this problem in core, rather than leaving it to
> each individual PostgreSQL company to develop and maintain their own
> solution. Generally my idea is:
> 
> 1. There should be a way to tell pg_basebackup to request from the
> server only those blocks where LSN >= threshold_value.  There are
> several possible ways for the server to implement this, the simplest
> of which is to just scan all the blocks and send only the ones that
> satisfy that criterion.  That might sound dumb, but it does still save
> network bandwidth, and it works even without any prior setup.

+1 this is a simple design and probably a first easy step bringing a lot of
benefices already.

> It will probably be more efficient in many cases to instead scan all the WAL
> generated since that LSN and extract block references from it, but
> that is only possible if the server has all of that WAL available or
> can somehow get it from the archive.

I seize the opportunity to discuss about this on the fly.

I've been playing with the idea of producing incremental backups from
archives since many years. But I've only started PoC'ing on it this year.

My idea would be create a new tool working on archived WAL. No burden
server side. Basic concept is:

* parse archives
* record latest relevant FPW for the incr backup
* write new WALs with recorded FPW and removing/rewriting duplicated walrecords.

It's just a PoC and I hadn't finished the WAL writing part...not even talking
about the replay part. I'm not even sure this project is a good idea, but it is
a good educational exercice to me in the meantime. 

Anyway, using real life OLTP production archives, my stats were:

  # WAL   xlogrec kept Size WAL kept
12739%   50%
38322%   38%
63920%   29%

Based on this stats, I expect this would save a lot of time during recovery in
a first step. If it get mature, it might even save a lot of archives space or
extend the retention period with degraded granularity. It would even help
taking full backups with a lower frequency.

Any thoughts about this design would be much appreciated. I suppose this should
be offlist or in a new thread to avoid polluting this thread as this is a
slightly different subject.

Regards,

PS: I was surprised to still find some existing piece of code related to
pglesslog in core. This project has been discontinued and WAL format changed in
the meantime.

Re: block-level incremental backup

2019-04-10 Thread Konstantin Knizhnik




On 09.04.2019 18:48, Robert Haas wrote:

1. There should be a way to tell pg_basebackup to request from the
server only those blocks where LSN >= threshold_value.


Some times ago I have implemented alternative version of ptrack utility 
(not one used in pg_probackup)
which detects updated block at file level. It is very simple and may be 
it can be sometimes integrated in master.

I attached patch to vanilla to this mail.
Right now it contains just two GUCs:

ptrack_map_size: Size of ptrack map (number of elements) used for 
incremental backup: 0 disabled.

ptrack_block_log: Logarithm of ptrack block size (amount of pages)

and one function:

pg_ptrack_get_changeset(startlsn pg_lsn) returns 
{relid,relfilenode,reltablespace,forknum,blocknum,segsize,updlsn,path}


Idea is very simple: it creates hash map of fixed size (ptrack_map_size) 
and stores LSN of written pages in this map.
As far as postgres default page size seems to be too small  for ptrack 
block (requiring too large hash map or increasing number of conflicts, 
as well as
increasing number of random reads) it is possible to configure ptrack 
block to consists of multiple pages (power of 2).


This patch is using memory mapping mechanism. Unfortunately there is no 
portable wrapper for it in Postgres, so I have to provide own 
implementations for Unix/Windows. Certainly it is not good and should be 
rewritten.


How to use?

1. Define ptrack_map_size in postgres.conf, for example (use simple 
number for more uniform hashing):


ptrack_map_size = 103

2.  Remember current lsn.

psql postgres -c "select pg_current_wal_lsn()"
 pg_current_wal_lsn

 0/224A268
(1 row)

3. Do some updates.

$ pgbench -T 10 postgres

4. Select changed blocks.

 select * from pg_ptrack_get_changeset('0/224A268');
 relid | relfilenode | reltablespace | forknum | blocknum | segsize |  
updlsn   | path

---+-+---+-+--+-+---+--
 16390 |   16396 |  1663 |   0 | 1640 |   1 | 
0/224FD88 | base/12710/16396
 16390 |   16396 |  1663 |   0 | 1641 |   1 | 
0/2258680 | base/12710/16396
 16390 |   16396 |  1663 |   0 | 1642 |   1 | 
0/22615A0 | base/12710/16396

...

Certainly ptrack should be used as part of some backup tool (as 
pg_basebackup or pg_probackup).



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 61a8f11..f4b8506 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -25,9 +25,20 @@
 #include 
 #include 
 
+#include 
+#ifndef WIN32
+#include "sys/mman.h"
+#endif
+
 #include "miscadmin.h"
+#include "funcapi.h"
+#include "access/hash.h"
+#include "access/table.h"
 #include "access/xlogutils.h"
 #include "access/xlog.h"
+#include "access/htup_details.h"
+#include "catalog/pg_class.h"
+#include "catalog/pg_tablespace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
@@ -36,6 +47,7 @@
 #include "storage/relfilenode.h"
 #include "storage/smgr.h"
 #include "storage/sync.h"
+#include "utils/builtins.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "pg_trace.h"
@@ -116,6 +128,18 @@ static MemoryContext MdCxt;		/* context for all MdfdVec objects */
  */
 #define EXTENSION_DONT_CHECK_SIZE	(1 << 4)
 
+/*
+ * Size of ptrack map (number of entries)
+ */
+int ptrack_map_size;
+
+/*
+ * Logarithm of ptrack block size (amount of pages)
+ */
+int  ptrack_block_log;
+
+static int ptrack_fd;
+static pg_atomic_uint64* ptrack_map;
 
 /* local routines */
 static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
@@ -138,7 +162,7 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
 			 BlockNumber blkno, bool skipFsync, int behavior);
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 		   MdfdVec *seg);
-
+static void  ptrack_mark_block(SMgrRelation reln, ForkNumber forkno, BlockNumber blkno);
 
 /*
  *	mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -422,6 +446,8 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		register_dirty_segment(reln, forknum, v);
 
 	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+
+	ptrack_mark_block(reln, forknum, blocknum);
 }
 
 /*
@@ -575,6 +601,8 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 		nblocks -= nflush;
 		blocknum += nflush;
 	}
+
+	ptrack_sync();
 }
 
 /*
@@ -700,6 +728,8 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	if (!skipFsync && !SmgrIsTemp(reln))
 		register_dirty_segment(reln, forknum, v);
+
+	ptrack_mark_block(reln, forknum, blocknum);
 }
 
 /*
@@ -886,6 +916,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 			FilePathName(v->mdfd_vfd;
 		segno--;
 	}
+

Re: Berserk Autovacuum (let's save next Mandrill)

2019-04-10 Thread Chris Travers

On Sat, Apr 6, 2019 at 9:56 AM Darafei "Komяpa" Praliaskouski 
wrote:

> The invoking autovacuum on table based on inserts, not only deletes
>> and updates, seems good idea to me. But in this case, I think that we
>> can not only freeze tuples but also update visibility map even when
>> setting all-visible. Roughly speaking  I think vacuum does the
>> following operations.
>>
>> 1. heap vacuum
>> 2. HOT pruning
>> 3. freezing tuples
>> 4. updating visibility map (all-visible and all-frozen)
>> 5. index vacuum/cleanup
>> 6. truncation
>>
>> With the proposed patch[1] we can control to do 5 or not. In addition
>> to that, another proposed patch[2] allows us to control 6.
>>
>
> [1] is committed, [2] nears commit. Seems we have now all the infra to
> teach autovacuum to run itself based on inserts and not hurt anybody?
>
> ...
>
>> [1] https://commitfest.postgresql.org/22/1817/
>> [2] https://commitfest.postgresql.org/22/1981/
>>
>
>
Reading the thread and the patch, I generally agree that:
1.  With the current infrastructure having auto vacuum periodically scan
append-only tables for freezing would be good, and
2.  I can't think of any cases where this would be a bad thing.

Also I am not 100% convinced that the problems are avoidable by setting the
wraparound prevention thresholds low enough.  In cases where one is doing
large bulk inserts all the time, vacuum freeze could have a lot of work to
do, and in some cases I could imagine IO storms making that difficult.

I plan to run some benchmarks on this to try to assess performance impact
of this patch in standard pgbench scenarios.I will also try to come up with
some other benchmarks in append only workloads.


>
> --
> Darafei Praliaskouski
> Support me: http://patreon.com/komzpa
>


-- 
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin

Re: PostgreSQL pollutes the file system

2019-04-10 Thread Christoph Berg

Re: Fred .Flintstone 2019-04-10 

> Does anyone oppose the proposal?

I don't think part #3 has been discussed, and I'd oppose printing
these warnings.

Christoph

Re: PostgreSQL pollutes the file system

2019-04-10 Thread Fred .Flintstone

Does anyone oppose the proposal?
How can we determine consensus?
Is there any voting process?

Is there any developer who is more versed than me with C than me who
can write this patch?

On Wed, Apr 10, 2019 at 2:52 PM Christoph Berg  wrote:
>
> Re: Fred .Flintstone 2019-04-10 
> 
> > It seems we do have a clear path forward on how to accomplish this and
> > implement this change.
> >
> > 1. Rename executables to carry the pg_ prefix.
> > 2. Create symlinks from the old names to the new names.
> > 3. Modify the executables to read argv[0] and print a warning if the
> > executable is called from the old name (symlink).
> >
> > This seems technically feasible and easy.
> >
> > How can we proceed?
>
> You can send a patch.
>
> But I don't think there has been a "clear" agreement that this is a
> good idea.
>
> Christoph

Re: PostgreSQL pollutes the file system

2019-04-10 Thread Tatsuo Ishii

> On 2019-03-29 20:32, Joe Conway wrote:
>>   pg_util  
> 
> How is that better than just renaming to pg_$oldname?

As I already said in up thread:

> This way, we would be free from the command name conflict problem
> and plus, we could do:
>
> pgsql --help
>
> which will prints subscommand names when a user is not sure what is
> the sub command name.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: pgbench - add minimal stats on initialization

2019-04-10 Thread Ibrar Ahmed

Hi Fabien,

I have one minor observation that in case of initDropTables you log
'drop' and in case of initCreateTables you log 'create table'. We need
to be consistent. The "drop tables" and "create tables" are the best
fit here. Otherwise, the patch is good.

On Wed, Apr 10, 2019 at 2:18 PM Ibrar Ahmed  wrote:
>
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   tested, passed
> Documentation:not tested
>
> Please ignore the last email.
>
> Patch works perfectly and the code is well-written. I have one minor 
> observation that in case of initDropTables you log "drop" and in case of 
> initCreateTables you log "create table". I think you need to be consistent. 
> And why not "drop tables" and "create tables"
>
> The new status of this patch is: Waiting on Author



-- 
Ibrar Ahmed

Re: PostgreSQL pollutes the file system

2019-04-10 Thread Christoph Berg

Re: Fred .Flintstone 2019-04-10 

> It seems we do have a clear path forward on how to accomplish this and
> implement this change.
> 
> 1. Rename executables to carry the pg_ prefix.
> 2. Create symlinks from the old names to the new names.
> 3. Modify the executables to read argv[0] and print a warning if the
> executable is called from the old name (symlink).
> 
> This seems technically feasible and easy.
> 
> How can we proceed?

You can send a patch.

But I don't think there has been a "clear" agreement that this is a
good idea.

Christoph

Re: PostgreSQL pollutes the file system

2019-04-10 Thread Fred .Flintstone

It seems we do have a clear path forward on how to accomplish this and
implement this change.

1. Rename executables to carry the pg_ prefix.
2. Create symlinks from the old names to the new names.
3. Modify the executables to read argv[0] and print a warning if the
executable is called from the old name (symlink).

This seems technically feasible and easy.

How can we proceed?

On Tue, Apr 2, 2019 at 11:24 PM Fred .Flintstone  wrote:
>
> It looks like this thread is featured on LWN under the article:
> Program names and "pollution".
> https://lwn.net/
> https://lwn.net/Articles/784508/ (Subscription required)
>
> On Sat, Mar 30, 2019 at 12:27 PM Peter Eisentraut
>  wrote:
> >
> > On 2019-03-29 20:32, Joe Conway wrote:
> > >   pg_util  
> >
> > How is that better than just renaming to pg_$oldname?
> >
> > --
> > Peter Eisentraut  http://www.2ndQuadrant.com/
> > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: GSOC 2019 proposal 'WAL-G safety features'

2019-04-10 Thread Andrey Borodin

Hi!

> 9 апр. 2019 г., в 18:20, Zhichao Liu  написал(а):
> 
> Dear PostgreSQL community,
> 
> I am a GSoC 2019 applicant and am working on 'WAL-G safety features'. I have 
> finished an initial draft of my proposal and I would appreciate your comments 
> and advice on my proposal. I know it is pretty late for the improvement of my 
> proposal, but I will be glad to join in the project this summer even without 
> GSoC! Please help me make my proposal and ideas better.
> Thank you!
> 
> The link is 
> https://docs.google.com/document/d/18cxbj1zId1BpMjgUkZ0MZgb1HdMg1S9h0U1WecZON3U/edit?usp=sharing

This is great that you want to work on WAL-G, you do not need proposal to do 
this. Just make a PRs, ask questions, etc :)
On WAL-G github page you can find way to slack channel. This list is intended 
for PostgreSQL core features. in GSoC WAL-G is hosted under PostgreSQL umbrella.

Thanks!

Best regards, Andrey Borodin.

Re: block-level incremental backup

2019-04-10 Thread Andrey Borodin

Hi!

> 9 апр. 2019 г., в 20:48, Robert Haas  написал(а):
> 
> Thoughts?
Thanks for this long and thoughtful post!

At Yandex, we are using incremental backups for some years now. Initially, we 
used patched pgbarman, then we implemented this functionality in WAL-G. And 
there are many things to be done yet. We have more than 1Pb of clusters 
backuped with this technology.
Most of the time we use this technology as a part of HA setup in managed 
PostgreSQL service. So, for us main goals are to operate backups cheaply and 
restore new node quickly. Here's what I see from our perspective.

1. Yes, this feature is important.

2. This importance comes not from reduced disk storage, magnetic disks and 
object storages are very cheap.

3. Incremental backups save a lot of network bandwidth. It is non-trivial for 
the storage system to ingest hundreds of Tb daily.

4. Incremental backups are a redundancy of WAL, intended for parallel 
application. Incremental backup applied sequentially is not very useful, it 
will not be much faster than simple WAL replay in many cases.

5. As long as increments duplicate WAL functionality - it is not worth pursuing 
tradeoffs of storage utilization reduction. We scan WAL during archivation, 
extract numbers of changed blocks and store changemap for a group of WALs in 
the archive.

6. This changemaps can be used for the increment of the visibility map (if I 
recall correctly). But you cannot compare LSNs on a page of visibility map: 
some operations do not bump them.

7. We use changemaps during backups and during WAL replay - we know blocks that 
will change far in advance and prefetch them to page cache like pg_prefaulter 
does.

8. There is similar functionality in RMAN for one well-known database. They 
used to store 8 sets of change maps. That database also has cool functionality 
"increment for catchup".

9. We call incremental backup a "delta backup". This wording describes purpose 
more precisely: it is not "next version of DB", it is "difference between two 
DB states". But wording choice does not matter much.


Here are slides from my talk at PgConf.APAC[0]. I've proposed a talk on this 
matter to PgCon, but it was not accepted. I will try next year :)

> 9 апр. 2019 г., в 20:48, Robert Haas  написал(а):
> - This is just a design proposal at this point; there is no code.  If
> this proposal, or some modified version of it, seems likely to be
> acceptable, I and/or my colleagues might try to implement it.

I'll be happy to help with code, discussion and patch review.

Best regards, Andrey Borodin.

[0] https://yadi.sk/i/Y_S1iqNN5WxS6A

Re: hyrax vs. RelationBuildPartitionDesc

On Wed, Apr 10, 2019 at 05:03:21PM +0900, Amit Langote wrote:
> The problem lies in all branches that have partitioning, so it should be
> listed under Older Bugs, right?  You may have noticed that I posted
> patches for all branches down to 10.

I have noticed.  The message from Tom upthread outlined that an open
item should be added, but this is not one.  That's what I wanted to
emphasize.  Thanks for making it clearer.
--
Michael


signature.asc
Description: PGP signature

Re: Zedstore - compressed in-core columnar storage

2019-04-10 Thread Amit Kapila

On Wed, Apr 10, 2019 at 12:55 PM Heikki Linnakangas  wrote:
>
> On 10/04/2019 09:29, Amit Kapila wrote:
> > On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal  wrote:
> >> Row store
> >> -
> >>
> >> The tuples are stored one after another, sorted by TID. For each
> >> tuple, we store its 48-bit TID, a undo record pointer, and the actual
> >> tuple data uncompressed.
> >>
> >
> > Storing undo record pointer with each tuple can take quite a lot of
> > space in cases where you can't compress them.
>
> Yeah. This does depend on compression to eliminate the unused fields
> quite heavily at the moment. But you could have a flag in the header to
> indicate "no undo pointer needed", and just leave it out, when it's needed.
>
> > Have you thought how will you implement the multi-locker scheme in
> > this design?  In zheap, we have used undo for the same and it is easy
> > to imagine when you have separate transaction slots for each
> > transaction.  I am not sure how will you implement the same here.
> I've been thinking that the undo record would store all the XIDs
> involved. So if there are multiple lockers, the UNDO record would store
> a list of XIDs.
>

This will be quite tricky.  Whenever a new locker arrives, you first
need to fetch previous undo to see which all XIDs already have a lock
on it.  Not only that, it will make discarding undo's way complicated.
  We have considered this approach before implementing the current
approach in zheap.

> Alternatively, I suppose you could store multiple UNDO
> pointers for the same tuple.
>

This will not only make the length of the tuple unnecessarily long but
would make it much harder to reclaim that space once the corresponding
undo is discarded.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: pgsql: Unified logging system for command-line programs

On 2019-04-09 13:58, Christoph Berg wrote:
> I'm not entirely sure what happened here, but I think this made
> pg_restore verbose by default (and there is no --quiet option).

That was by accident.  Fixed.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pgbench - add minimal stats on initialization

2019-04-10 Thread Ibrar Ahmed

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

Please ignore the last email.

Patch works perfectly and the code is well-written. I have one minor 
observation that in case of initDropTables you log "drop" and in case of 
initCreateTables you log "create table". I think you need to be consistent. And 
why not "drop tables" and "create tables"

The new status of this patch is: Waiting on Author

Re: pgbench - add minimal stats on initialization

2019-04-10 Thread Ibrar Ahmed

The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   tested, failed
Spec compliant:   tested, failed
Documentation:not tested

Patch works perfectly and the code is well-written. I have one minor 
observation that in case of initDropTables you log "drop" and in case of 
initCreateTables you log "create table". I think you need to be consistent. And 
why not "drop tables" and "create tables"

The new status of this patch is: Waiting on Author

Re: Transaction commits VS Transaction commits (with parallel) VS query mean time

2019-04-10 Thread Haribabu Kommi

On Wed, Apr 10, 2019 at 3:32 PM Amit Kapila  wrote:

> On Mon, Apr 8, 2019 at 8:51 AM Amit Kapila 
> wrote:
> >
> > On Mon, Apr 8, 2019 at 7:54 AM Jamison, Kirk 
> wrote:
> > > So I am marking this thread as “Ready for Committer”.
> > >
> >
> > Thanks, Hari and Jamison for verification.  The patches for
> > back-branches looks good to me.  I will once again verify them before
> > commit. I will commit this patch tomorrow unless someone has
> > objections.  Robert/Alvaro, do let me know if you see any problem with
> > this fix?
> >
>
> Pushed.
>

Thanks Amit.
Will look into it further to handle all the internally generated
transactions.

Regards,
Haribabu Kommi
Fujitsu Australia

Re: shared-memory based stats collector

2019-04-10 Thread Tomas Vondra


On Wed, Apr 10, 2019 at 09:39:29AM +0900, Kyotaro HORIGUCHI wrote:

At Tue, 9 Apr 2019 17:03:33 +0200, Tomas Vondra  wrote 
in <20190409150333.5iashyjxm5jmraml@development>

Unfortunately, now that we're past code freeze it's clear this is a
PG12
matter now :-(

I personally consider this to be very worthwhile & beneficial
improvement,
but I agree with Andres the patch did not quite get to committable
state
in the last CF. Conidering how sensitive part it touches, I suggest we
try
to get it committed early in the PG13 cycle. I'm willing to spend some
time on doing test/benchmarks and reviewing the code, if needed.


I'm very happy to be told that. Actually the code was a rush work
(mainly for reverting refactoring) and left some stupid
mistakes. I'm going through on the patch again and polish code.



While reviewing the patch I've always had issue with evaluating how it
behaves for various scenarios / workloads. The reviews generally did one
specific benchmark, but I find that unsatisfactory. I wonder whether if
we could develop a small set of more comprehensive workloads for this
patch (i.e. different numbers of objects, access patterns, ...).

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: bug in update tuple routing with foreign partitions

2019-04-10 Thread Etsuro Fujita


(2019/03/06 18:33), Amit Langote wrote:

I noticed a bug with how UPDATE tuple routing initializes ResultRelInfos
to use for partition routing targets.  Specifically, the bug occurs when
UPDATE targets include a foreign partition that is locally modified (as
opposed to being modified directly on the remove server) and its
ResultRelInfo (called subplan result rel in the source code) is reused for
tuple routing:

-- setup
create extension postgres_fdw ;
create server loopback foreign data wrapper postgres_fdw;
create user mapping for current_user server loopback;
create table p (a int) partition by list (a);
create table p1 partition of p for values in (1);
create table p2base (a int check (a = 2));
create foreign table p2 partition of p for values in (2) server loopback
options (table_name 'p2base');
insert into p values (1), (2);

-- see in the plan that foreign partition p2 is locally modified
explain verbose update p set a = 2 from (values (1), (2)) s(x) where a =
s.x returning *;
QUERY PLAN

─
  Update on public.p  (cost=0.05..236.97 rows=50 width=42)
Output: p1.a, "*VALUES*".column1
Update on public.p1
Foreign Update on public.p2
  Remote SQL: UPDATE public.p2base SET a = $2 WHERE ctid = $1 RETURNING a
->   Hash Join  (cost=0.05..45.37 rows=26 width=42)
  Output: 2, p1.ctid, "*VALUES*".*, "*VALUES*".column1
  Hash Cond: (p1.a = "*VALUES*".column1)
  ->   Seq Scan on public.p1  (cost=0.00..35.50 rows=2550 width=10)
Output: p1.ctid, p1.a
  ->   Hash  (cost=0.03..0.03 rows=2 width=32)
Output: "*VALUES*".*, "*VALUES*".column1
->   Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2
width=32)
  Output: "*VALUES*".*, "*VALUES*".column1
->   Hash Join  (cost=100.05..191.59 rows=24 width=42)
  Output: 2, p2.ctid, "*VALUES*".*, "*VALUES*".column1
  Hash Cond: (p2.a = "*VALUES*".column1)
  ->   Foreign Scan on public.p2  (cost=100.00..182.27 rows=2409
width=10)
Output: p2.ctid, p2.a
Remote SQL: SELECT a, ctid FROM public.p2base FOR UPDATE
  ->   Hash  (cost=0.03..0.03 rows=2 width=32)
Output: "*VALUES*".*, "*VALUES*".column1
->   Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2
width=32)
  Output: "*VALUES*".*, "*VALUES*".column1


-- as opposed to directly on remote side (because there's no local join)
explain verbose update p set a = 2 returning *;
  QUERY PLAN
─
  Update on public.p  (cost=0.00..227.40 rows=5280 width=10)
Output: p1.a
Update on public.p1
Foreign Update on public.p2
->   Seq Scan on public.p1  (cost=0.00..35.50 rows=2550 width=10)
  Output: 2, p1.ctid
->   Foreign Update on public.p2  (cost=100.00..191.90 rows=2730 width=10)
  Remote SQL: UPDATE public.p2base SET a = 2 RETURNING a
(8 rows)

Running the first update query crashes:

update p set a = 2 from (values (1), (2)) s(x) where a = s.x returning
tableoid::regclass, *;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

The problem seems to occur because ExecSetupPartitionTupleRouting thinks
it can reuse p2's ResultRelInfo for tuple routing.  In this case, it can't
be used, because its ri_FdwState contains information that will be needed
for postgresExecForeignUpdate to work, but it's still used today.  Because
it's assigned to be used for tuple routing, its ri_FdwState will be
overwritten by postgresBeginForeignInsert that's invoked by the tuple
routing code using the aforementioned ResultRelInfo.  Crash occurs when
postgresExecForeignUpdate () can't find the information it's expecting in
the ri_FdwState.


Agreed, as I said before in another thread.


The solution is to teach ExecSetupPartitionTupleRouting to avoid using a
subplan result rel if its ri_FdwState is already set.


I'm not sure that is a good idea, because that requires to create a new 
ResultRelInfo, which is not free; I think it requires a lot of work. 
Another solution to avoid that would be to store the fmstate created by 
postgresBeginForeignInsert() into the ri_FdwState, not overwrite the 
ri_FdwState, like the attached.  This would not need any changes to the 
core, and this would not cause any overheads either, IIUC.  What do you 
think about that?



I have attached 2 patches, one for PG 11 where the problem first appeared
and another for HEAD.  The patch for PG 11 is significantly bigger due to
having to handle the complexities of mapping of subplan result rel indexes
to leaf partition indexes.  Most of that

Re: Problem with default partition pruning

2019-04-10 Thread Kyotaro HORIGUCHI

At Wed, 10 Apr 2019 14:55:48 +0900, Amit Langote 
 wrote in 

> On 2019/04/10 12:53, Kyotaro HORIGUCHI wrote:
> > At Wed, 10 Apr 2019 11:17:53 +0900, Amit Langote 
> >  wrote:
> >> Yeah, I think we should move the "if (partconstr)" block to the "if
> >> (is_orclause(clause))" block as I originally proposed in:
> >>
> >> https://www.postgresql.org/message-id/9bb31dfe-b0d0-53f3-3ea6-e64b811424cf%40lab.ntt.co.jp
> >>
> >> It's kind of a hack to get over the limitation that
> >> get_matching_partitions() can't prune default partitions for certain OR
> >> clauses and I think we shouldn't let that hack grow into what seems like
> >> almost duplicating relation_excluded_by_constraints() but without the
> >> constraint_exclusion GUC guard.
> > 
> > That leaves an issue of contradicting clauses that is not an arm
> > of OR-expr. Isn't that what Hosoya-san is trying to fix?
> 
> Yes, that's right.  But as I said, maybe we should try not to duplicate
> the functionality of relation_excluded_by_constraints() in partprune.c.

Currently we classify partition constraint as a constraint. So it
should be handled not in partition pruning, but constraint
exclusion code. That's sound reasonable.

> To summarize, aside from the problem described by the subject of this
> thread (patch for that is v4_default_partition_pruning.patch posted by
> Hosoya-san on 2019/04/04), we have identified couple of other issues:
> 
> 1. One that Thibaut reported on 2019/03/04
> 
> > I kept on testing with sub-partitioning.Thanks.
> > I found a case, using 2 default partitions, where a default partition is
> > not pruned:
> >
> > --
> >
> > create table test2(id int, val text) partition by range (id);
> > create table test2_20_plus_def partition of test2 default;
> > create table test2_0_20 partition of test2 for values from (0) to (20)
> >   partition by range (id);
> > create table test2_0_10 partition of test2_0_20 for values from (0) to (10);
> > create table test2_10_20_def partition of test2_0_20 default;
> >
> > # explain (costs off) select * from test2 where id=5 or id=25;
> >QUERY PLAN
> > -
> >  Append
> >->  Seq Scan on test2_0_10
> >  Filter: ((id = 5) OR (id = 25))
> >->  Seq Scan on test2_10_20_def
> >  Filter: ((id = 5) OR (id = 25))
> >->  Seq Scan on test2_20_plus_def
> >  Filter: ((id = 5) OR (id = 25))
> > (7 rows)
> 
> For this, we can move the "if (partconstr)" block in the same if
> (is_orclause(clause)) block, as proposed in the v1-delta.patch that I
> proposed on 2019/03/20.  Note that that patch restricts the scope of
> applying predicate_refuted_by() only to the problem that's currently
> tricky to solve by partition pruning alone -- pruning default partitions
> for OR clauses like in the above example.

This is a failure of partition pruning, which should be resolved
in the partprune code.

> 2. Hosoya-san reported on 2019/03/22 that a contradictory WHERE clause
> applied to a *partition* doesn't return an empty plan:
> 
> > I understood Amit's proposal.  But I think the issue Thibaut reported
> > would  occur regardless of whether clauses have OR clauses or not as
> > follows.
> >
> > I tested a query which should output "One-Time Filter: false".
> >
> > # explain select * from test2_0_20 where id = 25;
> >   QUERY PLAN
> > ---
> >  Append  (cost=0.00..25.91 rows=6 width=36)
> >->  Seq Scan on test2_10_20_def  (cost=0.00..25.88 rows=6 width=36)
> >  Filter: (id = 25)
> 
> So, she proposed to apply predicate_refuted_by to the whole
> baserestrictinfo (at least in the latest patch), which is same as always
> performing constraint exclusion to sub-partitioned partitions.  I
> initially thought it might be a good idea, but only later realized that
> now there will be two places doing the same constraint exclusion proof --
> gen_partprune_steps_internal(), and set_rel_size() calling
> relation_excluded_by_constraints().  The latter depends on
> constraint_exclusion GUC whose default being 'partition' would mean we'd
> not get an empty plan with it.  Even if you turn it to 'on', a bug of
> get_relation_constraints() will prevent the partition constraint from
> being loaded and performing constraint exclusion with it; I reported it in
> [1].

Hmm. One perplexing thing here is the fact that partition
constraint is not a table constraint but a partitioning
classification in users' view.

> I think that we may be better off solving the latter problem as follows:
> 
> 1. Modify relation_excluded_by_constraints() to *always* try to exclude
> "baserel" partitions using their partition constraint (disregarding
> constraint_exclusion = off/partition).
> 
> 2. Modify prune_append_rel_partitions(), which runs much earlier these
> days compared to set_rel_size(), to call relation_excluded_by_constraint()
> modified as described

Re: hyrax vs. RelationBuildPartitionDesc

Hi,

On 2019/04/10 15:42, Michael Paquier wrote:
> On Mon, Apr 08, 2019 at 10:40:41AM -0400, Robert Haas wrote:
>> On Mon, Apr 8, 2019 at 9:59 AM Tom Lane  wrote:
>>> Amit Langote  writes:
>>> Yeah, it's an open issue IMO.  I think we've been focusing on getting
>>> as many feature patches done as we could during the CF, but now it's
>>> time to start mopping up problems like this one.
> 
> Please note that it is registered as an older bug and not an open
> item.

The problem lies in all branches that have partitioning, so it should be
listed under Older Bugs, right?  You may have noticed that I posted
patches for all branches down to 10.

>> Do you have any further thoughts based on my last response?
> 
> So your last response is that:
> https://www.postgresql.org/message-id/ca+tgmoa5rt+zr+vv+q1xlwqtdmcqknl6b4pjr4v6yac9k_l...@mail.gmail.com
> And what are you proposing as patch?  Perhaps something among those
> lines?
> https://www.postgresql.org/message-id/036852f2-ba7f-7a1f-21c6-00bc3515e...@lab.ntt.co.jp

AFAIK, the patch there isn't meant to solve problems discussed at the 1st
link.  It's meant to fix poor cache memory management of partition
constraint expression trees, which seems to be a separate issue from the
PartitionDesc memory management issue (the latter is the main topic of
this thread.)  Problem with partition constraint expression trees was only
mentioned in passing in this thread [1], although it had also come up in
the past, as I said when posting the patch.

>> Does anyone else wish to offer opinions?
> 
> It seems to me that Tom's argument to push in the way relcache
> information is handled by copying its contents sounds sensible to me.
> That's not perfect, but it is consistent with what exists (note
> PublicationActions for a rather-still-not-much-complex example of
> structure which gets copied from the relcache).

I gave my vote for direct access of unchangeable relcache substructures
(TupleDesc, PartitionDesc, etc.), because they're accessed during the
planning of every user query and fairly expensive to copy compared to list
of indexes or PublicationActions that you're citing.  To further my point
a bit, I wonder why PublicationActions needs to be copied out of relcache
at all?  Looking at its usage in execReplication.c, it seems we can do
fine without copying, because we are holding both a lock and a relcache
reference on the replication target relation during the entirety of
apply_handle_insert(), which means that information can't change under us,
neither logically nor physically.

Also, to reiterate what I think was one of Robert's points upthread [2],
the reason for requiring some code to copy the relcache substructures out
of relcache should be that the caller might want change its content; for
example, planner or its hooks may want to add/remove an index to/from the
list of indexes copied from the relcache.  The reason for copying should
not be that relcache content may change under us despite holding a lock
and relcache reference.

Thanks,
Amit

[1] https://www.postgresql.org/message-id/7961.1552498252%40sss.pgh.pa.us

[2]
https://www.postgresql.org/message-id/CA%2BTgmoa5rT%2BZR%2BVv%2Bq1XLwQtDMCqkNL6B4PjR4V6YAC9K_LBxw%40mail.gmail.com

Re: Zedstore - compressed in-core columnar storage

2019-04-10 Thread Heikki Linnakangas


On 10/04/2019 10:38, Konstantin Knizhnik wrote:

I also a little bit confused about UNDO records and MVCC support in
Zedstore. Actually columnar store is mostly needed for analytic for
read-only or append-only data. One of the disadvantages of Postgres is
quite larger per-record space overhead caused by MVCC.
It may be critical if you want to store huge timeseries  with relatively
small number of columns (like measurements of some sensor).
It will be nice to have storage format which reduce this overhead when
it is not needed (data is not updated).


Sure. Definitely something we need to optimize.


Right now, even without UNFO pages, size of zedstore is larger than size
of original Postgres table.
It seems to be very strange.


If you have a table with a lot of columns, but each column is small, 
e.g. lots of boolean columns, the item headers that zedstore currently 
stores for each datum take up a lot of space. We'll need to squeeze 
those harder to make this competitive. Instead of storing a header for 
each datum, if a group of consecutive tuples have the same visibility 
information, we could store the header just once, with an array of the 
datums, for example.


- Heikki

Re: Zedstore - compressed in-core columnar storage

2019-04-10 Thread Konstantin Knizhnik





On 10.04.2019 10:25, Heikki Linnakangas wrote:

On 10/04/2019 09:29, Amit Kapila wrote:
On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal  
wrote:

Row store
-

The tuples are stored one after another, sorted by TID. For each
tuple, we store its 48-bit TID, a undo record pointer, and the actual
tuple data uncompressed.



Storing undo record pointer with each tuple can take quite a lot of
space in cases where you can't compress them.


Yeah. This does depend on compression to eliminate the unused fields 
quite heavily at the moment. But you could have a flag in the header 
to indicate "no undo pointer needed", and just leave it out, when it's 
needed.



Have you thought how will you implement the multi-locker scheme in
this design?  In zheap, we have used undo for the same and it is easy
to imagine when you have separate transaction slots for each
transaction.  I am not sure how will you implement the same here.
I've been thinking that the undo record would store all the XIDs 
involved. So if there are multiple lockers, the UNDO record would 
store a list of XIDs. Alternatively, I suppose you could store 
multiple UNDO pointers for the same tuple.


- Heikki




I also a little bit confused about UNDO records and MVCC support in 
Zedstore. Actually columnar store is mostly needed for analytic for
read-only or append-only data. One of the disadvantages of Postgres is 
quite larger per-record space overhead caused by MVCC.
It may be critical if you want to store huge timeseries  with relatively 
small number of columns (like measurements of some sensor).
It will be nice to have storage format which reduce this overhead when 
it is not needed (data is not updated).


Right now, even without UNFO pages, size of zedstore is larger than size 
of original Postgres table.

It seems to be very strange.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Zedstore - compressed in-core columnar storage

2019-04-10 Thread Heikki Linnakangas


On 10/04/2019 09:29, Amit Kapila wrote:

On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal  wrote:

Row store
-

The tuples are stored one after another, sorted by TID. For each
tuple, we store its 48-bit TID, a undo record pointer, and the actual
tuple data uncompressed.



Storing undo record pointer with each tuple can take quite a lot of
space in cases where you can't compress them.


Yeah. This does depend on compression to eliminate the unused fields 
quite heavily at the moment. But you could have a flag in the header to 
indicate "no undo pointer needed", and just leave it out, when it's needed.



Have you thought how will you implement the multi-locker scheme in
this design?  In zheap, we have used undo for the same and it is easy
to imagine when you have separate transaction slots for each
transaction.  I am not sure how will you implement the same here.
I've been thinking that the undo record would store all the XIDs 
involved. So if there are multiple lockers, the UNDO record would store 
a list of XIDs. Alternatively, I suppose you could store multiple UNDO 
pointers for the same tuple.


- Heikki

Re: [survey] New "Stable" QueryId based on normalized query text

2019-04-10 Thread Julien Rouhaud

On Tue, Apr 9, 2019 at 11:26 PM Bruce Momjian  wrote:
>
> On Wed, Mar 20, 2019 at 03:19:58PM -0700, legrand legrand wrote:
> > > The rest of thread raise quite a lot of concerns about the semantics,
> > > the cost and the correctness of this patch.  After 5 minutes checking,
> > > it wouldn't suits your need if you use custom functions, custom types,
> > > custom operators (say using intarray extension) or if your tables
> > > don't have columns in the same order in every environment.  And there
> > > are probably other caveats that I didn't see;
> >
> > Yes I know,
> > It would have to be extended at less at functions, types, operators ...
> > names
> > and a guc pg_stat_statements.queryid_based= 'names' (default being 'oids')
> >
> > and with a second guc ('fullyqualifed' ?)
> > sould include tables, functions, types, operators ... namespaces
> >
> > let "users" specify their needs, we will see ;o)
>
> Why can't we just explose the hash computation as an SQL function and
> let people call it with pg_stat_activity.query or wherever they want the
> value?  We can install multiple functions if needed.

It'd be very nice to exposing the queryid computation at SQL level.
However it would allow to compute only the top-level queryid from
pg_stat_activity.  For monitoring and performance purpose, people
would probably want to see the queryid of the underlying query
actually running I think.

Re: hyrax vs. RelationBuildPartitionDesc