Re: [bug fix] pg_rewind takes long time because it mistakenly copies data files

2018-02-25 Thread Michael Paquier
On Mon, Feb 26, 2018 at 06:01:43AM +, Tsunakawa, Takayuki wrote:
> The cause was that pg_rewind failed to recognize data files in
> tablespace directories, resulting in the full copy of those files
> instead of WAL replay.

Ouch.  Confirmed.

If I test pg_rewind with a tablespace (primary and standby on same
server, base backup taken using --tablespace-mapping), then I bump
immediately into an assertion failure:
(lldb) bt
* thread #1: tid = 0x, 0x7fff99aa3f06 
libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
  * frame #0: 0x7fff99aa3f06 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x7fff8a30e4ec libsystem_pthread.dylib`pthread_kill + 90
frame #2: 0x7fff9a8d46df libsystem_c.dylib`abort + 129
frame #3: 0x7fff9a89bdd8 libsystem_c.dylib`__assert_rtn + 321
frame #4: 0x00010f7680f6 
pg_rewind`process_block_change(forknum=MAIN_FORKNUM, rnode=(spcNode = 16385, 
dbNode = 16384, relNode = 16386), blkno=4424) + 406 at filemap.c:374
frame #5: 0x00010f760c9f 
pg_rewind`extractPageInfo(record=0x7fc742800600) + 399 at parsexlog.c:401
frame #6: 0x00010f76055e 
pg_rewind`extractPageMap(datadir="/Users/XXX/data/5432", startpoint=0, 
tliIndex=0, endpoint=134217768) + 270 at parsexlog.c:97
frame #7: 0x00010f75f48c pg_rewind`main(argc=5, 
argv=0x7fff504a1450) + 1964 at pg_rewind.c:300
frame #8: 0x7fff88daf5ad libdyld.dylib`start + 1
(lldb) up 4
frame #4: 0x00010f7680f6 
pg_rewind`process_block_change(forknum=MAIN_FORKNUM, rnode=(spcNode = 16385, 
dbNode = 16384, relNode = 16386), blkno=4424) + 406 at filemap.c:374
   371  
   372  if (entry)
   373  {
-> 374  Assert(entry->isrelfile);
   375  
   376  switch (entry->action)
   377  {
(lldb) p *entry
(file_entry_t) $0 = {
  path = 0x7fc742424400 "pg_tblspc/16385/PG_11_201802061/16384/16386"
  type = FILE_TYPE_REGULAR
  action = FILE_ACTION_COPY
  oldsize = 0
  newsize = 36249600
  isrelfile = '\0'
  pagemap = (bitmap = , bitmapsize = 0)
  link_target = 0x7fc742424430 
  next = 0x
}

Your patch is able to fix that.  I have also checked that after
diverging the promoted server with more data and inserting data on the
old primary then the correct set of blocks from the tablespace is fetched as
well by pg_rewind.  This patch also behaves correctly when creating a
new relation on the promoted server as it copies the whole relation. In
short your patch looks good to me.

Creating a test case for this patch would be nice, however this needs a
bit of work so as the tablespace map can be used with pg_basebackup and
PostgresNode.pm (or use raw pg_basebackup commands in pg_rewind tests):
- PostgresNode::init_from_backup needs to be able to understand extra
options given by caller for pg_basebackup.
- RewindTest::create_standby should be extended with extra arguments
given for previous extension.
:(
--
Michael


signature.asc
Description: PGP signature


RE: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-02-25 Thread Tsunakawa, Takayuki
From: Michael Paquier [mailto:mich...@paquier.xyz]
> I have been playing more with that this morning, and trying to tweak the
> XLOG reader so as the fetched page is zeroed where necessary does not help
> much.  XLogReaderState->EndRecPtr is updated once the last record is set
> so it is possible to use it and zero the rest of the page which would prevent
> a lookup.  But this is actually not wise for performance:
> - The data after EndRecPtr could be perfectly valid, so you could zero data
> which could be reused later.
> - It is necessary to drop the quick-exit checks in PageReadInternal.

First of all, thanks for your many experiments and ideas.

Yes, the above method doesn't work.  The reason is not only performance but 
also correctness.  After you zero-fill the remaining portion of state->readbuf 
based on state->EndRecPtr, you need to read the same page upon the request of 
the next WAL record.  And that page read brings garbage again into 
state->readbuf.  After all, the callers of ReadPageInternal() has to face the 
garbage.


> The WAL receiver approach also has a drawback.  If WAL is streamed at full
> speed, then the primary sends data with a maximum of 6 WAL pages.
> When beginning streaming with a new segment, then the WAL sent stops at
> page boundary.  But if you stop once in the middle of a page then you need
> to zero-fill the page until the current segment is finished streaming.  So
> if the workload generates spiky WAL then the WAL receiver can would a lot
> of extra lseek() calls with the patch applied, while all the writes would
> be sequential on HEAD, so that's not performant-wise IMO.

Does even the non-cascading standby stop in the middle of a page?  I thought 
the master always the whole WAL blocks without stopping in the middle of a page.


> Another idea I am thinking about would be to zero-fill the segments when
> recycled instead of being just renamed when doing InstallXLogFileSegment().
> This would also have the advantage of making the segments ahead more
> compressible, which is a gain for custom backups, and the WAL receiver does
> not need any tweaks as it would write the data on a clean file.  Zero-filling
> the segments is done only when a new segment is created (see XLogFileInit).

Yes, I was (and am) inclined to take this approach; this is easy and clean, but 
not good for performance...  I hope there's something which justifies this 
approach.


Regards
Takayuki Tsunakawa





invalid memory alloc request size error with commit 4b93f579

2018-02-25 Thread Rushabh Lathia
Hi,

With commit 4b93f57999a2ca9b9c9e573ea32ab1aeaa8bf496, which plpgsql
use its DTYPE_REC code paths for composite-type variables - below
test started failing with "invalid memory alloc request size 2139062167
<%28213%29%20906-2167>"
error.

Testcase:

create table foo ( name varchar(20), type varchar(20));

insert into foo values ( 'Ford', 'Car');

CREATE OR REPLACE FUNCTION Trigger_Function() returns trigger as
$$
BEGIN
  RAISE NOTICE 'OLD: %, NEW: %', OLD, NEW;
  IF NEW.name = 'Ford' THEN
return OLD; -- return old tuple
  END IF;
  return NEW; -- return original tuple
END;
$$ language plpgsql;

CREATE TRIGGER Before_Update_Trigger BEFORE UPDATE ON foo FOR EACH ROW
EXECUTE PROCEDURE Trigger_Function();

UPDATE foo SET type = 'New Car' where name = 'Ford';

Error coming while trying to copy the invalid tuple from
(heap_copytuple() <- ExecCopySlotTuple() <- ExecMaterializeSlot() <-
ExecUpdate() <- ExecModifyTable())

Here ExecBRUpdateTriggers() returns the invalid tuple when trigger
return old tuple.  Looking further I found that tuple is getting free
at ExecBRUpdateTriggers(), below code:

if (trigtuple != fdw_trigtuple)
heap_freetuple(trigtuple);


It seems like before commit 4b93f57999a2ca9b9c9e573ea32ab1aeaa8bf496,
plpgsql_exec_trigger() always used to copy the old and new tuple but
after that commit it doen't copy the "old" and "new" tuple if
if user just did "return new" or "return old" without changing anything.

With commit 4b93f57999a2ca9b9c9e573ea32ab1aeaa8bf496, which plpgsql
use its DTYPE_REC code paths for composite-type variables - below
test started failing with "invalid memory alloc request size 2139062167"
error.

Testcase:

create table foo ( name varchar(20), type varchar(20));

insert into foo values ( 'Ford', 'Car');

CREATE OR REPLACE FUNCTION Trigger_Function() returns trigger as
$$
BEGIN
  RAISE NOTICE 'OLD: %, NEW: %', OLD, NEW;
  IF NEW.name = 'Ford' THEN
return OLD; -- return old tuple
  END IF;
  return NEW; -- return original tuple
END;
$$ language plpgsql;

CREATE TRIGGER Before_Update_Trigger BEFORE UPDATE ON foo FOR EACH ROW
EXECUTE PROCEDURE Trigger_Function();

UPDATE foo SET type = 'New Car' where name = 'Ford';

Error coming while trying to copy the invalid tuple from
(heap_copytuple() <- ExecCopySlotTuple() <- ExecMaterializeSlot() <-
ExecUpdate() <- ExecModifyTable())

Here ExecBRUpdateTriggers() returns the invalid tuple when trigger
return old tuple.  Looking further I found that tuple is getting free
at ExecBRUpdateTriggers(), below code:

if (trigtuple != fdw_trigtuple)
heap_freetuple(trigtuple);


It seems like before commit 4b93f57999a2ca9b9c9e573ea32ab1aeaa8bf496,
plpgsql_exec_trigger() always used to copy the old and new tuple but
after that commit it doen't copy the "old" and "new" tuple if
if user just did "return new" or "return old" without changing anything.

+   /*
+* Copy tuple to upper executor memory.  But if user just did
+* "return new" or "return old" without changing anything,
there's
+* no need to copy; we can return the original tuple (which will
+* save a few cycles in trigger.c as well as here).
+*/
+   if (rettup != trigdata->tg_newtuple &&
+   rettup != trigdata->tg_trigtuple)
+   rettup = SPI_copytuple(rettup);

In ExecBRUpdateTriggers(), we need to add a check that if trigtuple is same
as newtuple, then we don't require to free the trigtuple.

ExecBRDeleteTriggers() also does the similar things, but their we don't
need a check because it doesn't care about the return tuple.

PFA patch which add a check to not free the trigtuple if newtuple is same
as trigtuple and also added the related testcase.


Thanks,
Rushabh Lathia
www.EnterpriseDB.com
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index fffc009..9315244 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2815,7 +2815,8 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
 			return NULL;		/* "do nothing" */
 		}
 	}
-	if (trigtuple != fdw_trigtuple)
+	if (trigtuple != fdw_trigtuple &&
+		trigtuple != newtuple)
 		heap_freetuple(trigtuple);
 
 	if (newtuple != slottuple)
diff --git a/src/pl/plpgsql/src/expected/plpgsql_record.out b/src/pl/plpgsql/src/expected/plpgsql_record.out
index 29e42fd..0e898c2 100644
--- a/src/pl/plpgsql/src/expected/plpgsql_record.out
+++ b/src/pl/plpgsql/src/expected/plpgsql_record.out
@@ -479,6 +479,25 @@ table mutable;
  bar baz
 (2 rows)
 
+-- check returning old tuple
+create or replace function sillytrig() returns trigger language plpgsql as
+$$begin
+  raise notice 'old.ctid = %', old.ctid;
+  raise notice 'old.tableoid = %', old.tableoid::regclass;
+  return old;
+end$$;
+update mutable set f2 = f2 || ' baz';
+NOTICE:  old.ctid = (0,3)
+NOTICE:  old.tableoid = mutable
+NOTICE:  old.ctid = (0,4)
+NOTICE:  old.tableoid = mutable
+table mutable;
+   f2

Re: Precision loss casting float to numeric

2018-02-25 Thread Chapman Flack
Here are two patches.

The 0001-*.patch simply adds a regression test to numeric.sql that bits
aren't lost casting from float[48] to numeric. Naturally, it fails at this
stage.

The 0002-*.patch is a proof-of-concept patching float4_numeric and
float8_numeric in the trivial way (just using FLT_DECIMAL_DIG and
DBL_DECIMAL_DIG in place of FLT_DIG and DBL_DIG). It makes the new
regression test pass. (It will only work under a compiler that has
__FLT_DECIMAL_DIG__ and __DBL_DECIMAL_DIG__ available, and I used
those internal versions to avoid mucking with build tooling to change
the target C standard, which I assume wouldn't be welcome anyway.
This patch is just POC and not proposed as anything else.)

It changes the output of an existing numeric test and a few numeric
aggregate tests.

I have adjusted the numeric test: its purpose is to check the direction
of numeric rounding of ties, but the value to be rounded was computed
in float8 and then cast to numeric, and because negative powers of ten
aren't tidy in binary, it can turn out that the float8 computation
produces a correctly-rounded-53-bit-result that is on the nearer-to-zero
side of a tie, and now that that result is correctly cast, the resulting
numeric doesn't round in the away-from-zero direction.

I changed that test because I concluded it wasn't meant to test
float8-to-numeric casting, but only the rounding of tied numerics,
so I just made the inner expression be typed numeric, and the expected
output is unchanged.

The three aggregate tests with changed output are working from a table
of float4 values, and my assumption is they are now simply producing
the correct aggregate results given the full precision of the input
values, but I haven't confirmed that yet, so this patch leaves those
three failing for now.

-Chap
>From 5b3b05009830b29d1d3167bcd3321228535161dc Mon Sep 17 00:00:00 2001
From: Chapman Flack 
Date: Sun, 25 Feb 2018 22:52:22 -0500
Subject: [PATCH 1/2] Add regression test for cast float[48] to numeric.

At this stage, as expected, it fails.
---
 src/test/regress/expected/numeric.out | 34 ++
 src/test/regress/sql/numeric.sql  | 31 +++
 2 files changed, 65 insertions(+)

diff --git a/src/test/regress/expected/numeric.out b/src/test/regress/expected/numeric.out
index 17985e8..c186385 100644
--- a/src/test/regress/expected/numeric.out
+++ b/src/test/regress/expected/numeric.out
@@ -2063,3 +2063,37 @@ SELECT SUM((-)::numeric) FROM generate_series(1, 10);
  -0
 (1 row)
 
+--
+-- Tests for cast from float4, float8
+--
+-- check precision of mantissa not lost
+WITH RECURSIVE
+  f4bit(place) AS (
+VALUES (1::float4)
+UNION
+SELECT place/2::float4
+FROM f4bit
+WHERE 1::float4 + place/2::float4 > 1::float4
+  ),
+  f4(epsilon) AS (select min(place) FROM f4bit),
+  f8bit(place) AS (
+VALUES (1::float8)
+UNION
+SELECT place/2::float8
+FROM f8bit
+WHERE 1::float8 + place/2::float8 > 1::float8
+  ),
+  f8(epsilon) AS (SELECT min(place) FROM f8bit),
+  testvalues(f4, f8) AS (
+SELECT 1::float4+f4.epsilon, 1::float8+f8.epsilon
+FROM f4, f8
+  )
+SELECT
+  f4::numeric > 1::numeric AS f4numericpreserved,
+  f8::numeric > 1::numeric AS f8numericpreserved
+FROM testvalues;
+ f4numericpreserved | f8numericpreserved 
++
+ t  | t
+(1 row)
+
diff --git a/src/test/regress/sql/numeric.sql b/src/test/regress/sql/numeric.sql
index d77504e..dab9ae2 100644
--- a/src/test/regress/sql/numeric.sql
+++ b/src/test/regress/sql/numeric.sql
@@ -1036,3 +1036,34 @@ select scale(-13.000);
 -- cases that need carry propagation
 SELECT SUM(::numeric) FROM generate_series(1, 10);
 SELECT SUM((-)::numeric) FROM generate_series(1, 10);
+
+--
+-- Tests for cast from float4, float8
+--
+
+-- check precision of mantissa not lost
+WITH RECURSIVE
+  f4bit(place) AS (
+VALUES (1::float4)
+UNION
+SELECT place/2::float4
+FROM f4bit
+WHERE 1::float4 + place/2::float4 > 1::float4
+  ),
+  f4(epsilon) AS (select min(place) FROM f4bit),
+  f8bit(place) AS (
+VALUES (1::float8)
+UNION
+SELECT place/2::float8
+FROM f8bit
+WHERE 1::float8 + place/2::float8 > 1::float8
+  ),
+  f8(epsilon) AS (SELECT min(place) FROM f8bit),
+  testvalues(f4, f8) AS (
+SELECT 1::float4+f4.epsilon, 1::float8+f8.epsilon
+FROM f4, f8
+  )
+SELECT
+  f4::numeric > 1::numeric AS f4numericpreserved,
+  f8::numeric > 1::numeric AS f8numericpreserved
+FROM testvalues;
-- 
2.7.3

>From 4d721ac45967f64415bafb82039a652f0a451e2c Mon Sep 17 00:00:00 2001
From: Chapman Flack 
Date: Mon, 26 Feb 2018 01:26:01 -0500
Subject: [PATCH 2/2] Proof of concept fix of float[48]_numeric.

This patch simply uses the new FLT_DECIMAL_DIG/DBL_DECIMAL_DIG
constants in place of the FLT_DIG/DBL_DIG, just to see what happens.
It uses the 

Re: SSL passphrase prompt external command

2018-02-25 Thread Daniel Gustafsson

> On 23 Feb 2018, at 11:14, Peter Eisentraut  
> wrote:
> 
> Here is a patch that adds a way to specify an external command for
> obtaining SSL passphrases.  There is a new GUC setting
> ssl_passphrase_command.

+1 on going down this route.

> Right now, we rely on the OpenSSL built-in prompting mechanism, which
> doesn't work in some situations, including under systemd.  This patch
> allows a configuration to make that work, e.g., with systemd-ask-password.

+replaced by a prompt string.  (Write %% for a
+literal %.)  Note that the prompt string will

I might be thick, but I don’t see where the %% handled?  Also, AFAICT a string
ending with %\0 will print a literal % without requiring %% (which may be a
perfectly fine case to allow, depending on how strict we want to be with the
format).

cheers ./daniel


Re: remove pg_class.relhaspkey

2018-02-25 Thread Michael Paquier
On Mon, Feb 26, 2018 at 12:45:48AM -0500, Tom Lane wrote:
> Michael Paquier  writes:
> > On Sat, Feb 24, 2018 at 10:21:44PM -0500, Tom Lane wrote:
> >> We've discussed that at least twice before, and not pulled the trigger
> >> for fear of breaking client code.
> 
> > Speaking of which, I have looked at where relhaspkey is being used.  And
> > there are a couple of things using it:
> > - Greenplum has a consistency checker tool using it.
> > - https://github.com/no0p/pgsampler
> > - https://searchcode.com/codesearch/view/54937539/
> > - http://codegist.net/code/postgres%20drop%20tables/
> > - 
> > https://hackage.haskell.org/package/relational-schemas-0.1.3.4/src/src/Database/Relational/Schema/PgCatalog/PgClass.hs
> 
> Thanks for poking around.  Did you happen to notice how many of these
> clients are taking pains to deal with the existing inaccuracy of
> relhaspkey (ie, that it remains set after the pkey has been dropped)?

As far as I looked at things.  Those clients rely on how optimistic
relhaspkey is.  In short, if it is set to true, there can be primary
key definitions.  If set to false, then it is sure that no primary key
definition can be found.  If the flag is true, then those clients just
do an extra lookup on pg_index with indisprimary.  I think that this
just complicates the code involved though.  If looking for primary keys
it is way better to just scan directly pg_index.

> I think there's possibly an argument that relhaspkey should be dropped
> because it's an attractive nuisance, encouraging clients to assume
> more than they should about what it means.  But we don't have a lot
> of evidence for such an argument right now.

The only breakage I could imagine here is an application which thinks
relhaspkey set to true implies that a primary key *has* to be present.
I have to admit that I have not found such a case.  Still I would not be
surprised if there are such applications unaware of being broken,
particularly plpgsql functions.
--
Michael


signature.asc
Description: PGP signature


[bug fix] pg_rewind takes long time because it mistakenly copies data files

2018-02-25 Thread Tsunakawa, Takayuki
Hello,

Our customer reported that pg_rewind took many hours to synchronize 400GB of 
data, even if the new primary doesn't perform any updates.  The attached patch 
fixes that.

The cause was that pg_rewind failed to recognize data files in tablespace 
directories, resulting in the full copy of those files instead of WAL replay.

Regards
Takayuki Tsunakawa





pg_rewind_pathcmp.patch
Description: pg_rewind_pathcmp.patch


Re: remove pg_class.relhaspkey

2018-02-25 Thread Tom Lane
Michael Paquier  writes:
> On Sat, Feb 24, 2018 at 10:21:44PM -0500, Tom Lane wrote:
>> We've discussed that at least twice before, and not pulled the trigger
>> for fear of breaking client code.

> Speaking of which, I have looked at where relhaspkey is being used.  And
> there are a couple of things using it:
> - Greenplum has a consistency checker tool using it.
> - https://github.com/no0p/pgsampler
> - https://searchcode.com/codesearch/view/54937539/
> - http://codegist.net/code/postgres%20drop%20tables/
> - 
> https://hackage.haskell.org/package/relational-schemas-0.1.3.4/src/src/Database/Relational/Schema/PgCatalog/PgClass.hs

Thanks for poking around.  Did you happen to notice how many of these
clients are taking pains to deal with the existing inaccuracy of
relhaspkey (ie, that it remains set after the pkey has been dropped)?

I think there's possibly an argument that relhaspkey should be dropped
because it's an attractive nuisance, encouraging clients to assume
more than they should about what it means.  But we don't have a lot
of evidence for such an argument right now.

regards, tom lane



Re: [HACKERS] SERIALIZABLE with parallel query

2018-02-25 Thread Thomas Munro
On Sat, Feb 24, 2018 at 12:04 AM, Thomas Munro
 wrote:
> I'm testing another version that is a lot simpler: like v10, it relies
> on the knowledge that the leader's transaction will always end after
> the workers have finished, but it handles the RO_SAFE optimisation by
> keeping the SERIALIZABLEXACT alive but freeing its locks etc.  More
> soon.

I've now broken it into two patches.

Patch 0001 is like my original patch with some minor improvements,
except that it now disables the RO_SAFE optimisation completely in
parallel mode.  In other words, it's the stupidest fix possible to the
problem you flagged up.  I think the main questions to answer about
the 0001 patch are whether this new locking protocol is sufficient,
whether anything bad could happen as a result of lock
escalation/transfer, and whether the underlying assumption about the
SERIALIZABLEXACT's lifetime holds true (that the leader will never
call ReleasePredicateLocks() while a worker is still running).

There are a couple of easy incremental improvements that could be made
on top of that patch, but I didn't make them because I'm trying to be
conservative in the hope of landing at least the basic feature in
PostgreSQL 11.  Namely:

1.  We could still return false if we see SXACT_FLAG_RO_SAFE in
SerializationNeededForRead() (we just couldn't call
ReleasePredicateLocks()).

2.  We could set MySerializableXact to InvalidSerializableXact in
worker backends so at least they'd benefit from the optimisation (we
just couldn't do that in the leader or it'd leak resources).

Patch 0002 aims a bit higher than those ideas.  I wanted to make sure
that the leader wouldn't arbitrarily miss out on the optimisation, and
I also suspect that the optimisation might be contagious in the sense
that actually releasing sooner might cause the RO_SAFE flag to be set
on *other* transactions sooner.  Patch 0002 works like this:

The first backend to observe the RO_SAFE flag 'partially releases' the
SERIALIZABLEXACT, so that the SERIALIZABLEXACT itself remains valid.
(The concept of 'partial release' already existed, but I'm using it in
a new way.)  All backends clear their MySerializableXact variable so
that they drop to faster SI in their own time.  The leader keeps a
copy of it in SavedSerializableXact, so that it can fully release it
at the end of the transaction when we know that no other backend has a
reference to it.

These patches survive hammering with a simple test that generates a
mixture of read only and read write parallel queries that hit the
interesting case (attached; this test helped me understand that the
refcount scheme I considered was going to be hard).  I haven't
personally tried to measure the value of the optimisation (though I'm
pretty sure it exists, based on the VLDB paper and the knowledge that
REPEATABLE READ (what the optimisation effectively gives you) just has
to be faster than SERIALIZABLE 'cause I've see all that code you get
to not run!).  I'd like to propose the 0001 patch for now, but keep
the 0002 patch back for a bit as it's very new and needs more
feedback, if possible from Kevin and others involved in the SSI
project.  Of course their input on the 0001 patch is also super
welcome.

-- 
Thomas Munro
http://www.enterprisedb.com
import psycopg2
import threading

DSN = "dbname=postgres"

def run2(i):
  conn = psycopg2.connect(DSN)
  cursor = conn.cursor()
  cursor.execute("""set parallel_setup_cost=0""")
  cursor.execute("""set parallel_tuple_cost=0""")
  cursor.execute("""set min_parallel_table_scan_size=0""")
  cursor.execute("""set max_parallel_workers_per_gather=4""")
  cursor.execute("""set default_transaction_isolation=serializable""")
  if i % 2 == 0:
cursor.execute("""set default_transaction_read_only=on""")
cursor.execute("""select count(*) from foo""")
cursor.execute("""select count(*) from foo""")
conn.commit()

for i in range(16):
  conn = psycopg2.connect(DSN)
  cursor = conn.cursor()
  cursor.execute("""drop table if exists foo""")
  cursor.execute("""create table foo as select generate_series(1, 10)::int as a""")
  cursor.execute("""alter table foo set (parallel_workers = 4)""")
  conn.commit()
  conn.close()
  threading.Thread(target=run2, args=[i]).start()


0001-Enable-parallel-query-with-SERIALIZABLE-isolatio-v12.patch
Description: Binary data


0002-Enable-the-read-only-SERIALIZABLE-optimization-f-v12.patch
Description: Binary data


Re: [HACKERS] Runtime Partition Pruning

2018-02-25 Thread Amit Langote
(Sorry, I had mistakenly replied only to Simon on Friday)

On Fri, Feb 23, 2018 at 9:04 PM, Simon Riggs  wrote:
> On 23 February 2018 at 11:40, David Rowley  
> wrote:
>> On 23 February 2018 at 04:11, Jesper Pedersen
>>  wrote:
>>> Are UPDATE and DELETE suppose to be supported ?
>>
>> To be honest, I had not even considered those.
>
> I can say that I had considered those. Handling of UPDATE and DELETE
> with partitions is significantly different, so its not just an
> oversight its a different branch of the project.
>
> We need SELECT to work first and then we have a chance of making
> UPDATE/DELETE work.

+1

Thanks,
Amit



Re: PlaceHolderVars in pushed down child-join cause error

2018-02-25 Thread Ashutosh Bapat
On Thu, Feb 22, 2018 at 8:36 PM, Robert Haas  wrote:
> On Thu, Feb 22, 2018 at 7:41 AM, Ashutosh Bapat
>  wrote:
>> postgres_fdw isn't expected to push down joins with placeholder vars.
>> But the check for that in foreign_join_ok() only considers
>> joinrel->relids. For a child-join relids contains the child relids but
>> PlaceHolderInfo refers to the top parent's relids. Hence postgres_fdw
>> tries to push down a child-join with PlaceHolderVars in it and fails
>> with error "unsupported expression type for deparse: 198". 198 being
>> T_PlaceHolderVar.
>>
>> The fix is to use joinrel->top_parent_relids for a child-join.
>> Attached patch for the same.
>
> Committed, but I changed the formatting, because as you had it,
> pgindent would have mangled it.  While I was at it, I tweaked the
> wording of the comment a bit.

Thanks.


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Precision loss casting float to numeric

2018-02-25 Thread Chapman Flack
Back in

https://www.postgresql.org/message-id/4e384467-f28a-69ce-
75aa-4bc01125a39d%40anastigmatix.net

I got intrigued about casting float values to numeric. Two queries below
(one for float4, one for float8) show what happens for float values with
bits of precision from one up to the limit of the data type. Each value
generated has two 1 bits in the mantissa, one always at 2^0 and the other
at 2^(-p) for increasing p; in other words, each value is the sum of 1 and
2^(-p), and the last such value that compares unequal to 1 is the value
1 plus (FLT_EPSILON or DBL_EPSILON, respectively, for the machine). The
next value generated that way would be indistinguishable from 1.

I'm testing on Intel, IEEE-754 hardware, with 24 bits of precision for
float4 and 53 bits for float8.

TL;DR: casting these values to numeric loses their distinguishability
from 1, six bits early for float4 and five bits early for float8. Casting
to numeric and back again is even worse: you get only six of your 24 bits
back for float4, only 15 of 53 for float8.

Everyone expects the floats to be approximate types and that they
won't be able to exactly represent arbitrary values. However, that's
not how numeric is advertised, and the idea that a numeric can't safely
keep the precision of a float4 or float8, and give it back when you ask,
at best violates least astonishment and arguably is a bug.

I see where at least the first problem (5 or 6 bits lost casting to
numeric) comes in:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/utils/adt/numeric.c;h=6f400729713bfd942cc196b81d50bf25e4757315;hb=c4ba1bee68abe217e441fb81343e5f9e9e2a5353#l3298

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/utils/adt/numeric.c;h=6f400729713bfd942cc196b81d50bf25e4757315;hb=c4ba1bee68abe217e441fb81343e5f9e9e2a5353#l3227

These conversions work by sprintf of the float value to its text
representation with FLT_DIG or DBL_DIG decimal digits of precision.

The mistake is subtle and understandable, because even the C standards
didn't catch it until recently, when they introduced the new constants
FLT_DECIMAL_DIG and DBL_DECIMAL_DIG, because they realized the constant
you need depends on what you're doing.

The old FLT_DIG and DBL_DIG tell you how many digits of a decimal
number you can count on recovering intact after conversion to float
or double and back to decimal. That's inside-out from what's needed
here, where the constant of interest is the number of decimal digits
necessary to reliably represent any float or double so you can get
those original bits back, and that's what the FLT_DECIMAL_DIG and
DBL_DECIMAL_DIG constants are, and they're larger (by 3 and 2 decimal
digits, respectively, on IEEE-754 hardware) than their FLT_DIG and
DBL_DIG counterparts.

So, a trivial fix for float4_numeric and float8_numeric would be to
change the constant used in the sprintf, and that would (I think) solve
at least the first precision-loss problem. But it would only compile
where the compiler is new enough to define the newer constants, and
my preferred fix is to just open-code the conversion using frexp
rather than going through the text representation at all. I'm working
on that patch.

The second problem (losing even more bits in the roundtrip to numeric
and back) suggests that in the other direction, numeric_float4 and
numeric_float8 need some love too, but right now I'm focused on the
to-numeric direction).

-Chap


Columns:
 place - the place value of the 1 bit on the right; in the last row,
 this is the machine epsilon for the type.

 float4gt  - whether 1+place is distinguishable from 1 using
 float8gtall float4/float8 operations; true for every row.

 numericgt - whether 1+place is still distinguishable from 1 after
 casting to numeric.

 rtgt  - whether the roundtrip, 1+place cast to numeric and back,
 is still distinguishable from 1. Ends up same as numericgt
 (on my platform anyway).

 rteq  - whether the roundtrip, 1+place cast to numeric and back,
 equals the original 1+place. Starts failing quite early.

WITH RECURSIVE
  f4(place) AS (
VALUES (1::float4)
UNION
SELECT place/2::float4
FROM f4
WHERE 1::float4 + place/2::float4 > 1::float4
  )
SELECT
  place,
  1::float4 + place > 1::float4 AS float4gt,
  (1::float4 + place)::numeric > 1::numeric AS numericgt,
  (1::float4 + place)::numeric::float4 > 1::float4 AS rtgt,
  (1::float4 + place)::numeric::float4 = 1::float4 + place as rteq
FROM
  f4;

place| float4gt | numericgt | rtgt | rteq
-+--+---+--+--
   1 | t| t | t| t
 0.5 | t| t | t| t
0.25 | t| t | t| t
   0.125 | t| t | t| t
  0.0625 | t| t | t| t
 0.03125 | t| t | t| t
0.015625 | t| t | t| f
 

Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-02-25 Thread Michael Paquier
On Fri, Feb 23, 2018 at 11:02:19PM +0900, Michael Paquier wrote:
> Tsunakawa-san has proposed upthread to fix the problem by zero-ing the
> page read in the WAL receiver.  While I agree that zeroing the page is
> the way to go, doing so in the WAL receiver does not take care of
> problems with the frontends.  I don't have the time to test that yet as
> it is late here, but based on my code lookups tweaking
> ReadPageInternal() so as the page is zero'ed correctly should do it for
> all the cases.  This way, the checks happening after a page read would
> fail because of the zero'ed fields consistently instead of the garbage
> accumulated.

I have been playing more with that this morning, and trying to tweak the
XLOG reader so as the fetched page is zeroed where necessary does not
help much.  XLogReaderState->EndRecPtr is updated once the last record
is set so it is possible to use it and zero the rest of the page which
would prevent a lookup.  But this is actually not wise for performance:
- The data after EndRecPtr could be perfectly valid, so you could zero
data which could be reused later.
- It is necessary to drop the quick-exit checks in PageReadInternal.

The WAL receiver approach also has a drawback.  If WAL is streamed at
full speed, then the primary sends data with a maximum of 6 WAL pages.
When beginning streaming with a new segment, then the WAL sent stops at
page boundary.  But if you stop once in the middle of a page then you
need to zero-fill the page until the current segment is finished
streaming.  So if the workload generates spiky WAL then the WAL receiver
can would a lot of extra lseek() calls with the patch applied, while all
the writes would be sequential on HEAD, so that's not performant-wise
IMO.

Another idea I am thinking about would be to zero-fill the segments
when recycled instead of being just renamed when doing
InstallXLogFileSegment().  This would also have the advantage of making
the segments ahead more compressible, which is a gain for custom
backups, and the WAL receiver does not need any tweaks as it would write
the data on a clean file.  Zero-filling the segments is done only when a
new segment is created (see XLogFileInit).

Each approach has its own cost. Thoughts?
--
Michael


signature.asc
Description: PGP signature


Re: Online enabling of checksums

2018-02-25 Thread Daniel Gustafsson
> On 26 Feb 2018, at 05:48, Tomas Vondra  wrote:
> On 02/24/2018 10:45 PM, Magnus Hagander wrote:
>> Is it really that invisible? Given how much we argue over adding
>> single counters to the stats system, I'm not sure it's quite that
>> low.
> 
> I'm a bit unsure where would the flags be stored - I initially assumed
> pg_database/pg_class, but now I see mentions of the stats system.
> 
> But I wonder why should this be stored in a catalog at all? The info is
> only needed by the bgworker(s), so they could easily flush the current
> status to a file every now and then and fsync it. Then after restart, if
> you find a valid file, use it to resume from the last OK position. If
> not, start from scratch.

Since this allows checksums to be turned off as well, storing a flag in the
catalog would mean a issuing a fairly wide update in that case to switch it to
False, which might not be ideal.  Not that I expect (hope) turning checksums
off on a cluster will be a common operation, but thats a fairly large side
effect of doing so.

cheers ./daniel


Re: Online enabling of checksums

2018-02-25 Thread Tomas Vondra


On 02/24/2018 10:45 PM, Magnus Hagander wrote:
> On Sat, Feb 24, 2018 at 1:34 AM, Robert Haas  > wrote:
> 
> On Thu, Feb 22, 2018 at 3:28 PM, Magnus Hagander
> > wrote:
> > I would prefer that yes. But having to re-read 9TB is still 
> significantly
> > better than not being able to turn on checksums at all (state today). 
> And
> > adding a catalog column for it will carry the cost of the migration
> > *forever*, both for clusters that never have checksums and those that 
> had it
> > from the beginning.
> >
> > Accepting that the process will start over (but only read, not 
> re-write, the
> > blocks that have already been processed) in case of a crash does
> > significantly simplify the process, and reduce the long-term cost of it 
> in
> > the form of entries in the catalogs. Since this is a on-time operation 
> (or
> > for many people, a zero-time operation), paying that cost that one time 
> is
> > probably better than paying a much smaller cost but constantly.
> 
> That's not totally illogical, but to be honest I'm kinda surprised
> that you're approaching it that way.  I would have thought that
> relchecksums and datchecksums columns would have been a sort of
> automatic design choice for this feature.  The thing to keep in mind
> is that nobody's going to notice the overhead of adding those columns
> in practice, but someone will surely notice the pain that comes from
> having to restart the whole operation.  You're talking about trading
> an effectively invisible overhead for a very noticeable operational
> problem.
> 
> 
> Is it really that invisible? Given how much we argue over adding
> single counters to the stats system, I'm not sure it's quite that
> low.

I'm a bit unsure where would the flags be stored - I initially assumed
pg_database/pg_class, but now I see mentions of the stats system.

But I wonder why should this be stored in a catalog at all? The info is
only needed by the bgworker(s), so they could easily flush the current
status to a file every now and then and fsync it. Then after restart, if
you find a valid file, use it to resume from the last OK position. If
not, start from scratch.

FWIW this is pretty much what the stats collector does.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: TupleTableSlot abstraction

2018-02-25 Thread Alexander Korotkov
Hi!

Thank you for working on this subject.  See some my comments below.

On Wed, Feb 21, 2018 at 1:43 AM, Andres Freund  wrote:

> - TupleTableSlots have to contain all the necessary information for each
>   type of slot. Some implementations might require a buffer pin to be
>   hold (our heap), others pointers to multiple underlying points of data
>   (columns store), some need additional metadata (our MinimalTuple).
>
>   Unifying the information into one struct makes it much harder to
>   provide differing implementations without modifying core postgres
>   and/or increasing overhead for every user of slots.
>
>   I think the consequence of that is that we need to allow each type of
>   slot to have their own TupleTableSlot embedding struct.
>

+1, I think that it's reasonable to keep minimal common slot information in
TupleTableSlot struct and to let particular slot implementations to extend
it.

  Haribabu's patch solved this by adding a tts_storage parameter that
>   contains additional data, but imo that's unconvincing due to the added
>   indirection in very performance critical codepaths.
>

+1

- The way slots currently are implemented each new variant data is
>   stored in slots adds new branches to hot code paths like
>   ExecClearTuple(), and they're not extensible.
>
>   Therefore I think we need to move to callback based implementation. In
>   my tests, by moving the callback invocations into very thin inline
>   functions, the branch prediction accuracy actually goes sligthly
>   *up*.
>

Sounds interesting.  Besides branch prediction accuracy, what can you
say about overall performance?

- Currently we frequently convert from one way of representing a row
>   into another, in the same slot. We do so fairly implicilty in
>   ExecMaterializeSlot(), ExecFetchSlotMinimalTuple(), ExecCopySlot()...
>
>   The more we abstract specific storage representation away, the worse I
>   think that is. I think we should make such conversions explicit by
>   requiring transfers between two slots if a specific type is required.
>
>
> - We have a few codepaths that set fields on tuples that can't be
>   represented in virtual slots. E.g. postgres_fdw sets ctid, oid,
>   ExecProcessReturning() will set tableoid.
>
>
> In my POC I turned TupleTableSlot into basically an abstract base class:
> struct TupleTableSlot
> {
> NodeTag type;
>
> uint16  flags;
> uint16  nvalid; /* # of valid values in tts_values
> */
>
> const TupleTableSlotOps *const cb;
>
> TupleDesc   tupleDescriptor;/* slot's tuple descriptor
> */
>
> Datum  *values; /* current per-attribute values */
> bool   *nulls;  /* current per-attribute null
> flags */
>
> /* can we optimize away? */
> MemoryContext mcxt; /* slot itself is in this context
> */
> };
>
> which then is inherited from for specific implementations of tuple
> slots:
>
> typedef struct HeapTupleTableSlot
> {
> TupleTableSlot base;
> HeapTuple   tuple;  /* physical tuple */
> uint32  off;/* saved state for
> slot_deform_tuple */
> } HeapTupleTableSlot;
>
> ...
> typedef struct MinimalTupleTableSlot
> {
> TupleTableSlot base;
> HeapTuple   tuple;  /* tuple wrapper */
> MinimalTuple mintuple;  /* minimal tuple, or NULL if none */
> HeapTupleData minhdr;   /* workspace for minimal-tuple-only case */
> uint32  off;/* saved state for
> slot_deform_tuple */
> } MinimalTupleTableSlot;
>
>
> typedef struct TupleTableSlotOps
> {
> void (*init)(TupleTableSlot *slot);
> void (*release)(TupleTableSlot *slot);
>
> void (*clear)(TupleTableSlot *slot);
>
> void (*getsomeattrs)(TupleTableSlot *slot, int natts);
>
> void (*materialize)(TupleTableSlot *slot);
>
> HeapTuple (*get_heap_tuple)(TupleTableSlot *slot);
> MinimalTuple (*get_minimal_tuple)(TupleTableSlot *slot);
>
> HeapTuple (*copy_heap_tuple)(TupleTableSlot *slot);
> MinimalTuple (*copy_minimal_tuple)(TupleTableSlot *slot);
>
> } TupleTableSlotOps;
>
> when creating a slot one has to specify the type of slot one wants to
> use:
>
> typedef enum TupleTableSlotType
> {
> TTS_TYPE_VIRTUAL,
> TTS_TYPE_HEAPTUPLE,
> TTS_TYPE_MINIMALTUPLE,
> TTS_TYPE_BUFFER
> } TupleTableSlotType;
>
> extern TupleTableSlot *MakeTupleTableSlot(TupleDesc desc,
> TupleTableSlotType st);
>

Can user define his own slot type?  If so, I assume that hard-coded
enum is a limitation of current prototype, and eventually we would have
an extendable array of slot types.  Is it correct?

You might notice that I've renamed a few fields (tts_values -> values,
> tts_isnull -> nulls, tts_nvalid -> nvalid) that haven't actually changed
> in the above structs / the attached 

Re: Online enabling of checksums

2018-02-25 Thread Tomas Vondra

On 02/25/2018 03:57 PM, Magnus Hagander wrote:
> 
> ...
> 
>> > I very strongly doubg it's a "very noticeable operational problem". People
>> > > don't restart their databases very often... Let's say it takes 2-3 weeks
>> > to
>> > > complete a run in a fairly large database. How many such large databases
>> > > actually restart that frequently? I'm not sure I know of any. And the
>> > only
>> > > effect of it is you have to start the process over (but read-only for the
>> > > part you have already done). It's certainly not ideal, but I don't agree
>> > > it's in any form a "very noticeable problem".
>> >
>> > I definitely know large databases that fail over more frequently than
>> > that.
>> >
>>
>> I would argue that they have bigger issues than enabling checksums... By
>> far.
> 
> In one case it's intentional, to make sure the overall system copes. Not
> that insane.
> 
> That I can understand. But in a scenario like that, you can also stop
> doing that for the period of time when you're rebuilding checksums, if
> re-reading the database over and over again is an issue.
> 
> Note, I'm not saying it wouldn't be nice to have the incremental
> functionality. I'm just saying it's not needed in a first version.
> 

I agree with this sentiment. I don't think we can make each patch
perfect for everyone - certainly not in v1 :-/

Sure, it would be great to allow resume after a restart, but if that
means we won't get anything in PG 11 then I think it's not a good
service to our users. OTOH if the patch without a resume addresses the
issue for 99% of users, and we can improve it in PG 12, why not? That
seems exactly like the incremental thing we do for many other features.

So +1 to not make the "incremental resume" mandatory. If we can support
it, great! But I think the patch may seem less complex than it actually
is, and figuring out how the resume should work will take some time.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Online enabling of checksums

2018-02-25 Thread Tomas Vondra


On 02/24/2018 03:51 AM, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
>> On 02/24/2018 03:11 AM, Andres Freund wrote:
>>> On 2018-02-24 03:07:28 +0100, Tomas Vondra wrote:
 I agree having to restart the whole operation after a crash is not
 ideal, but I don't see how adding a flag actually solves it. The problem
 is the large databases often store most of the data (>80%) in one or two
 central tables (think fact tables in star schema, etc.). So if you
 crash, it's likely half-way while processing this table, so the whole
 table would still have relchecksums=false and would have to be processed
 from scratch.
>>>
>>> I don't think it's quite as large a problem as you make it out to
>>> be. Even in those cases you'll usually have indexes, toast tables and so
>>> forth.
>>
>> Hmmm, right. I've been focused on tables and kinda forgot that the other
>> objects need to be transformed too ... :-/
> 
> There's also something of a difference between just scanning a table or
> index, where you don't have to do much in the way of actual writes
> because most of the table already has valid checksums, and having to
> actually write out all the changes.
> 
 But perhaps you meant something like "position" instead of just a simple
 true/false flag?
>>>
>>> I think that'd incur a much larger complexity cost.
>>
>> Yep, that was part of the point that I was getting to - that actually
>> addressing the issue would be more expensive than simple flags. But as
>> you pointed out, that was not quite ... well thought through.
> 
> No, but it's also not entirely wrong. Huge tables aren't uncommon.
> 
> That said, I'm not entirely convinced that these new flags would be
> as unnoticed as is being suggested here, but rather than focus on
> either side of that, I'm thinking about what we want to have *next*-
> we know that enabling/disabling checksums is an issue that needs to
> be solved, and this patch is making progress towards that, but the
> next question is what does one do when a page has been detected as
> corrupted? Are there flag fields which would be useful to have at a
> per-relation level to support some kind of corrective action or
> setting that says "don't care about checksums on this table, even
> though the entire database is supposed to have valid checksums, but
> instead do X with failed pages" or similar.
> 

Those questions are definitely worth asking, and I agree our ability to
respond to data corruption (incorrect checksums) needs improvements. But
I don't really see how a single per-relation flag will make any that any
easier?

Perhaps there are other flags/fields that might help, like for example
the maximum number of checksum errors per relation (although I don't
consider that very useful in practice), but that seems rather unrelated
to this patch.

> Beyond dealing with corruption-recovery cases, are there other use
> cases for having a given table not have checksums?
> 

Well, I see checksums are a way to detect data corruption caused by
storage, so if you have tablespaces backed by different storage systems,
you could disable checksums for objects on the storage you 100% trust.
That would limit the overhead of computing checksums.

But then again, this seems entirely unrelated to the patch discussed
here. That would obviously require flags in catalogs, and if the patch
eventually adds flags those would need to be separate.

> Would it make sense to introduce a flag or field which indicates that
> an entire table's pages has some set of attributes, of which
> 'checksums' is just one attribute? Perhaps a page version, which
> potentially allows us to have a way to change page layouts in the
> future?
> 
> I'm happy to be told that we simply don't have enough information at 
> this point to make anything larger than a relchecksums field-level 
> decision, but perhaps these thoughts will spark an idea about how we 
> could define something a bit broader with clear downstream
> usefulness that happens to also cover the "does this relation have
> checksums?" question.
> 

I don't know. But I think we need to stop moving the goalposts further
and further away, otherwise we won't get anything until PostgreSQL 73.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Small improvement to compactify_tuples

2018-02-25 Thread Yura Sokolov
23.01.2018 06:34, Stephen Frost пишет:
> Greetings,
> 
> * Юрий Соколов (funny.fal...@gmail.com) wrote:
>> On Wed, Nov 29, 2017 at 8:00 AM, Peter Geoghegan  wrote:
>>> On Tue, Nov 28, 2017 at 2:41 PM, Andres Freund  wrote:
 Maybe it's a stupid question. But would we still want to have this after
 the change? These should be just specializations of the template version
 imo.
>>
>> "generic" version operates on bytes, and it will be a bit hard to combine
>> it with
>> templated version. Not impossible, but it will look ugly.
> 
> If that's the case then does it really make sense to make this change..?

I don't think it is really necessary to implement generic version
through templated. It is much better to replace generic version with
templated in places where it matters for performance.

> 
>> In attach fixed qsort_template version.
>> And version for compactify_tuples with bucket_sort and templated qsort.
> 
> While having the patch is handy, I'm not seeing any performance numbers
> on this version, and I imagine others watching this thread are also
> wondering about things like a test run that just uses the specialized
> qsort_itemIds() without the bucketsort.
> 
> Are you planning to post some updated numbers and/or an updated test
> case that hopefully shows best/worst case with this change?  Would be
> good to get that on a couple of platforms too, if possible, since we've
> seen that the original benchmarks weren't able to be consistently
> repeated across different platforms.  Without someone doing that
> leg-work, this doesn't seem like it'll be moving forward.

Updated numbers are (same benchmark on same notebook, but with new
master, new ubuntu and later patch version) (average among 6 runs):

master   - 16135tps
with templated qsort - 16199tps
with bucket sort - 16956tps

Difference is still measurable, but less significant. I don't know why.

Rebased version of first patch (qsorted tamplate) is in atttach.

With regards,
Sokolov Yura.


0001-Header-for-customized-qsort.patch.gz
Description: application/gzip


signature.asc
Description: OpenPGP digital signature


Re: VACUUM FULL name is very confusing to some people (or to most non expert people)

2018-02-25 Thread Pavel Stehule
2018-02-25 18:51 GMT+01:00 Lætitia Avrot :

> Hi all,
>
> For most beginners (and even a lot of advanced users)  there is a strong
> confusion between simple VACUUM and VACUUM FULL. They think "full" is
> simply an option to the maintenance operation vacuum while it's not. It's a
> complete different  operation.
>
> I have a hard time explaining it when I teach PostgreSQL Administration
> (even if I stress the matter) and I constantly meet customer that are wrong
> about it.
>
> I think that the way we name this two operations is not helping them. I
> had to work with SQL Server some years ago and they use the word "SHRINK"
> to do something similar to "VACUUM FULL". I don't know if it's the best
> option, I think others can be found (COMPACT, DEFRAGMENT...)
>
> Of course, for compatibility reasons, VACUUM FULL should always be
> available, but I think an alias that is less confusing for people could be
> a good thing.
>
> What do you think ?
>

Although VACUUM and VACUUM FULL is different, then result is same (depends
on detail level) - the data files are optimized for other processing. You
should to see a VACUUM like family of commands that does some data files
optimizations. VACUUM, VACUUM FULL, VACUUM FREEZE, VACUUM ANALYZE, ...
Personally I don't think so we need to implement new synonym command for
this case. Why you you cannot to say your students - "VACUUM FULL is like
SHRINK in SQL Server"?

Regards

Pavel



>
> Cheers,
>
> Lætitia
>


VACUUM FULL name is very confusing to some people (or to most non expert people)

2018-02-25 Thread Lætitia Avrot
Hi all,

For most beginners (and even a lot of advanced users)  there is a strong
confusion between simple VACUUM and VACUUM FULL. They think "full" is
simply an option to the maintenance operation vacuum while it's not. It's a
complete different  operation.

I have a hard time explaining it when I teach PostgreSQL Administration
(even if I stress the matter) and I constantly meet customer that are wrong
about it.

I think that the way we name this two operations is not helping them. I had
to work with SQL Server some years ago and they use the word "SHRINK" to do
something similar to "VACUUM FULL". I don't know if it's the best option, I
think others can be found (COMPACT, DEFRAGMENT...)

Of course, for compatibility reasons, VACUUM FULL should always be
available, but I think an alias that is less confusing for people could be
a good thing.

What do you think ?

Cheers,

Lætitia


Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

2018-02-25 Thread Chapman Flack
On 07/17/17 11:29, Michael Paquier wrote:
> On Thu, Jul 6, 2017 at 3:48 PM, Heikki Linnakangas  wrote:
>> On 07/03/2017 06:30 PM, Chapman Flack wrote:
>>> Although it's moot in the straightforward approach of re-zeroing in
>>> the loop, it would still help my understanding of the system to know
>>> if there is some subtlety that would have broken what I proposed
>>> earlier, which was an extra flag to AdvanceXLInsertBuffer() that
>>> ...
>>
>> Yeah, I suppose that would work, too.
> 
> FWIW, I would rather see any optimization done in
> AdvanceXLInsertBuffer() instead of seeing a second memset re-zeroing
> the WAL page header after its data has been initialized by
> AdvanceXLInsertBuffer() once.

Here is a patch implementing the simpler approach Heikki suggested
(though my original leaning had been to wrench on AdvanceXLInsertBuffer
as Michael suggests). The sheer simplicity of the smaller change
eventually won me over, unless there's a strong objection.

As noted before, the cost of the extra small MemSet is proportional
to the number of unused pages in the segment, and that is an indication
of how busy the system *isn't*. I don't have a time benchmark of the
patch's impact; if I should, what would be a good methodology?

Before the change, what some common compression tools can achieve on
a mostly empty (select pg_switch_wal()) segment on my hardware are:

gzip  27052 in 0.145s
xz 5852 in 0.678s
lzip   5747 in 1.254s
bzip2  1445 in 0.261s

bzip2 is already the clear leader (I don't have lz4 handy to include in
the comparison) at around 1/20th size gzip can achieve, and that's before
this patch. After:

gzip 16393 in 0.143s
xz2640 in 0.520s
lzip  2535 in 1.198s
bzip2  147 in 0.238s

The patch gives gzip almost an extra factor of two, and about the same
for xz and lzip, and bzip2 gains nearly another order of magnitude.

I considered adding a regression test for unfilled-segment compressibility,
but wasn't sure where it would most appropriately go. I assume a TAP test
would be the way to do it.

-Chap
>From 35684bdaa1d27c3f4b7198541f3c92bb2b4cb2f4 Mon Sep 17 00:00:00 2001
From: Chapman Flack 
Date: Sun, 25 Feb 2018 11:44:47 -0500
Subject: [PATCH] Zero headers of unused pages after WAL switch.

When writing zeroed pages to the remainder of a WAL segment
after a WAL switch, ensure that the headers of those pages are
also zeroed, as their initialized values otherwise reduce the
compressibility of the WAL segment file by general tools.
---
 src/backend/access/transam/xlog.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 47a6c4d..a91ec7b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1556,7 +1556,16 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 		{
 			/* initialize the next page (if not initialized already) */
 			WALInsertLockUpdateInsertingAt(CurrPos);
-			AdvanceXLInsertBuffer(CurrPos, false);
+			/*
+			 * Fields in the page header preinitialized by AdvanceXLInsertBuffer
+			 * to nonconstant values reduce the compressibility of WAL segments,
+			 * and aren't needed in the freespace following a switch record.
+			 * Re-zero that header area. This is not performance-critical, as
+			 * the more empty pages there are for this loop to touch, the less
+			 * busy the system is.
+			 */
+			currpos = GetXLogBuffer(CurrPos);
+			MemSet(currpos, 0, SizeOfXLogShortPHD);
 			CurrPos += XLOG_BLCKSZ;
 		}
 	}
-- 
2.7.3



Re: Online enabling of checksums

2018-02-25 Thread Magnus Hagander
On Sat, Feb 24, 2018 at 10:48 PM, Magnus Hagander 
wrote:

> On Sat, Feb 24, 2018 at 4:29 AM, Tomas Vondra <
> tomas.von...@2ndquadrant.com> wrote:
>
>> Hi,
>>
>> I see the patch also does throttling by calling vacuum_delay_point().
>> Being able to throttle the checksum workers not to affect user activity
>> definitely seems like a useful feature, no complaints here.
>>
>> But perhaps binding it to vacuum_cost_limit/vacuum_cost_delay is not the
>> best idea? I mean, enabling checksums seems rather unrelated to vacuum,
>> so it seems a bit strange to configure it by vacuum_* options.
>>
>> Also, how am I supposed to set the cost limit? Perhaps I'm missing
>> something, but the vacuum_delay_point call happens in the bgworker, so
>> setting the cost limit before running pg_enable_data_checksums() will
>> get there, right? I really don't want to be setting it in the config
>> file, because then it will suddenly affect all user VACUUM commands.
>>
>> And if this patch gets improved to use multiple parallel workers, we'll
>> need a global limit (something like we have for autovacuum workers).
>>
>> In any case, I suggest mentioning this in the docs.
>>
>>
> Ah yes. I actually have it on my TODO to work on that, but I forgot to put
> that in the email I sent out. Apologies for that, and thanks for pointing
> it out!
>
> Right now you have to set the limit in the configuration file. That's of
> course not the way we want to have it long term (but as long as it is that
> way it should at least be documented). My plan is to either pick it up from
> the current session that calls pg_enable_data_checksums(), or to simply
> pass it down as parameters to the function directly.  I'm thinking the
> second one (pass a cost_delay and a cost_limit as optional parameters to
> the function) is the best one because as you say actually overloading it on
> the user visible GUCs seems a bit ugly. Once there I think the easiest is
> to just pass it down to the workers through the shared memory segment.
>
>
PFA an updated patch that adds this, and also fixes the problem in
pg_verify_checksums spotted by Michael Banck.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4c998fe51f..dc05ac3e55 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8537,7 +8537,8 @@ LOG:  CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1)
 or hide corruption, or other serious problems.  However, it may allow
 you to get past the error and retrieve undamaged tuples that might still be
 present in the table if the block header is still sane. If the header is
-corrupt an error will be reported even if this option is enabled. The
+corrupt an error will be reported even if this option is enabled. This
+option can only enabled when data checksums are enabled. The
 default setting is off, and it can only be changed by a superuser.

   
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2f59af25a6..a011ea1d8f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19481,6 +19481,71 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
 
   
 
+  
+   Data Checksum Functions
+
+   
+The functions shown in  can
+be used to enable or disable data checksums in a running cluster.
+See  for details.
+   
+
+   
+Checksum SQL Functions
+
+ 
+  
+   Function
+   Return Type
+   Description
+  
+ 
+ 
+  
+   
+
+ pg_enable_data_checksums
+
+pg_enable_data_checksums(cost_delay int, cost_limit int)
+   
+   
+bool
+   
+   
+
+ Initiates data checksums for the cluster. This will switch the data checksums mode
+ to in progress and start a background worker that will process
+ all data in the database and enable checksums for it. When all data pages have had
+ checksums enabled, the cluster will automatically switch to checksums
+ on.
+
+
+ If cost_delay and cost_limit are
+ specified, the speed of the process is throttled using the same principles as
+ Cost-based Vacuum Delay.
+
+   
+  
+  
+   
+
+ pg_disable_data_checksums
+
+pg_disable_data_checksums()
+   
+   
+bool
+   
+   
+Disables data checksums for the cluster.
+   
+  
+ 
+
+   
+
+  
+
   
Database Object Management Functions
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 22e6893211..c81c87ef41 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -210,6 +210,7 @@ Complete list of usable sgml source 

Re: Online enabling of checksums

2018-02-25 Thread Magnus Hagander
On Sat, Feb 24, 2018 at 11:06 PM, Andres Freund  wrote:

> Hi,
>
> On 2018-02-24 22:56:57 +0100, Magnus Hagander wrote:
> > On Sat, Feb 24, 2018 at 10:49 PM, Andres Freund 
> wrote:
> > > > We did consider doing it at a per-table basis as well. But this is
> also
> > > an
> > > > overhead that has to be paid forever, whereas the risk of having to
> read
> > > > the database files more than once (because it'd only have to read
> them on
> > > > the second pass, not write anything) is a one-off operation. And for
> all
> > > > those that have initialized with checksums in the first place don't
> have
> > > to
> > > > pay any overhead at all in the current design.
> > >
> > > Why does it have to be paid forever?
> > >
> >
> > The size of the pg_class row would be there forever. Granted, it's not
> that
> > big an overhead given that there are already plenty of columns there. But
> > the point being you can never remove that column, and it will be there
> for
> > users who never even considered running without checksums. It's certainly
> > not a large overhead, but it's also not zero.
>
> But it can be removed in the next major version, if we decide it's a
> good idea? We're not bound on compatibility for catalog layout.
>

Sure.

But we can also *add* it in the next major version, if we decide it's a
good idea?


FWIW' there's some padding space available where we currently could
> store two booleans without any space overhead. Also, If we decide that
> the boolean columns (which don't matter much in comparison to the rest
> of the data, particularly relname), we can compress them into a
> bitfield.
>
> I don't think this is a valid reason for not supporting
> interrupability. You can make fair arguments about adding incremental
> support incrementally and whatnot, but the catalog size argument doesn't
> seem part of a valid argument.
>

Fair enough.



> > I very strongly doubg it's a "very noticeable operational problem".
People
> > > don't restart their databases very often... Let's say it takes 2-3
weeks
> > to
> > > complete a run in a fairly large database. How many such large
databases
> > > actually restart that frequently? I'm not sure I know of any. And the
> > only
> > > effect of it is you have to start the process over (but read-only for
the
> > > part you have already done). It's certainly not ideal, but I don't
agree
> > > it's in any form a "very noticeable problem".
> >
> > I definitely know large databases that fail over more frequently than
> > that.
> >
>
> I would argue that they have bigger issues than enabling checksums... By
> far.

In one case it's intentional, to make sure the overall system copes. Not
that insane.

That I can understand. But in a scenario like that, you can also stop doing
that for the period of time when you're rebuilding checksums, if re-reading
the database over and over again is an issue.

Note, I'm not saying it wouldn't be nice to have the incremental
functionality. I'm just saying it's not needed in a first version.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Online enabling of checksums

2018-02-25 Thread Magnus Hagander
On Sun, Feb 25, 2018 at 1:21 AM, Greg Stark  wrote:

> > The change of the checksum state is WAL logged with a new xlog record.
> All the buffers written by the background worker are forcibly enabled full
> page writes to make sure the checksum is fully updated on the standby even
> if no actual contents of the buffer changed.
>
> Hm. That doesn't sound necessary to me. If you generate a checkpoint
> (or just wait until a new checkpoint has started) then go through and
> do a normal xlog record for every page (any xlog record, a noop record
> even) then the normal logic for full page writes ought to be
> sufficient. If the noop record doesn't need a full page write it's
> because someone else has already come in and done one and that one
> will set the checksum. In fact if any page has an lsn > the checkpoint
> start lsn for the checkpoint after the flag was flipped then you
> wouldn't need to issue any record at all.
>

What would be the actual benefit though? We'd have to invent a noop WAL
record, and just have some other part of the system do the full page write?
So why not just send the full page in the first place?

Also if that wasn't clear -- we only do the full page write if there isn't
already a checksum on the page and that checksum is correct.

(We do trigger a checkpoint at the end, and wait for it to complete)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: New gist vacuum.

2018-02-25 Thread Michail Nikolaev
> I'll attach patch file next message.

Updated patch is attached.


gist-vacuum-count.patch
Description: Binary data


Re: handling of heap rewrites in logical decoding

2018-02-25 Thread Craig Ringer
On 25 February 2018 at 09:57, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> A heap rewrite during a DDL command publishes changes for relations
> named like pg_temp_%u, which are not real tables, and this breaks
> replication systems that are not aware of that.  We have a hack in the
> pgoutput plugin to ignore those, but we knew that was a hack.  So here
> is an attempt at a more proper solution: Store a relisrewrite column in
> pg_class and filter on that.
>
> I have put this into reorderbuffer.c, so that it affects all plugins.
> An alternative would be to have each plugin handle it separately (the
> way it is done now), but it's hard to see why a plugin would not want
> this fixed behavior.
>



>
> A more fancy solution would be to make this column an OID that links
> back to the original table.  Then, a DDL-aware plugin could choose
> replicate the rewrite rows.  However, at the moment no plugin works that
> way, so this might be overdoing it.
>



I'm pretty sure we _will_ want the ability to decode and stream rewrite
contents for non-IMMUTABLE table rewrites.

Filtering out by default is OK by me, but I think making it impossible to
decode is a mistake. So I'm all for the oid option and had written a
suggestion for it before I saw you already mentioned it  in the next part
of your mail.

The main issue with filtering out rewrites by default is that I don't see
how, if we default to ignore/filter-out, plugins would indicate "actually I
want to choose about this one" or "I understand table rewrites". I'd prefer
not to add another change callback.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: prokind column (was Re: [HACKERS] SQL procedures)

2018-02-25 Thread John Naylor
On 2/25/18, Tom Lane  wrote:
> We need a plan for when/how to apply this, along with the proposed
> bootstrap data conversion patch, which obviously conflicts with it
> significantly.

The bulk changes in the bootstrap data patch are scripted rather than
patched, so the prokind patch will pose little in the way of
conflicts. I can't verify this just yet since Peter's second patch
doesn't apply for me against c4ba1bee68ab. Also, as of version 7 my
patch left out default values and human-readable oids, since I wanted
to get the new generated headers reviewed and up to project standards
first. Since I'll likely have to adjust the patches for those features
anyway, there's plenty of room for me to adjust to the changes to
pg_proc.h as well.

> My thought here is that the data conversion patch is going to break
> basically every pending patch that touches src/include/catalog/,
> so we ought to apply it at a point where that list of patches is short
> and there's lots of time for people to redo them.  Hence, end of the
> dev cycle is the right time.

I agree.

-John Naylor